University of Cincinnati: 07/11/2008 Arun Janarthanan Doctor of Philosophy Computer Engineering

UNIVERSITY OF CINCINNATI
07/11/2008
Date:___________________
ARUN JANARTHANAN
I, _________________________________________________________,
hereby submit this work as part of the requirements for the degree of:
DOCTOR OF PHILOSOPHY
in:
COMPUTER ENGINEERING
It is entitled:
NETWORKS-ON-CHIP BASED HIGH PERFORMANCE
COMMUNICATION ARCHITECTURES FOR FPGAS
This work and its defense approved by:
Dr. Ranga Vemuri

Chair: _______________________________
Dr. Karen Tomko
_______________________________
Dr. Harold Carter
_______________________________
Dr. Wen Ben Jone
_______________________________
Dr. S. Srinivasan
_______________________________
Networks-on-Chip based High Performance Communication
Architectures for FPGAs
A Dissertation submitted to the
Division of Research and Advanced Studies

of the University of Cincinnati
in partial fulfillment of the

requirements for the degree of
DOCTOR OF PHILOSOPHY
in the Department of
Electrical and Computer Engineering and Computer Science
of the College of Engineering.
by
Arun Janarthanan
B. E. (ECE), Sri Venkateswara College of Engineering, University of

Madras, Chennai, India, 2003
Dissertation Advisor : Dr: Karen Tomko

Committee Chair: Dr. Ranga Vemuri
Abstract
Networks-on-Chip is a recent solution paradigm adopted to increase the performance of
multi-core designs. The key idea is to interconnect various computation modules (IP
cores) in a network fashion and transport packets simultaneously across them, thereby
gaining performance. In addition to improving performance by having multiple packets in
flight, NoCs also present a host of other advantages including scalability, power
efficiency, and component re-use through modular design.
This work focuses on design and development of high performance communication

architectures for FPGAs using NoCs. Once completely developed, the above
methodology could be used to augment the current FPGA design flow for implementing
multi-core SoC applications. We design and implement an NoC framework for FPGAs,
Multi-Clock On-Chip Network for Reconfigurable Systems (MoCReS).
We enable the routers to function at independent clock frequencies, that are dictated by
the FPGA place & route constraints, and yet follow a low latency virtual cut-through
flow control. With increasing design complexities, power trade-offs play a significant
role in FPGA design. We analyze the power consumed in the NoC framework that we
have developed on a Virtex-4 FPGA. Through experimental results, we study the various
components of power consumed in an FPGA based NoC.
We propose a novel micro-architecture for a hybrid two-layer router that supports both
packet-switched communications, across its local and directional ports, as well as, time
multiplexed circuit-switched communications among the multiple IP cores directly
connected to it. Results from place and route VHDL models of the advanced router
architecture show an average improvement of 20.4% in NoC bandwidth (maximum of
24% compared to a traditional NoC). We parameterize the hybrid router model over the
number of ports, channel width and bRAM depth and develop a library of network
components (MoClib Library).
Synthesizing an NoC topology for FPGAs from the above library of network components
requires a complex trade-off among switch complexity, area available and bandwidth
capacity. We develop an algorithm and an application-generic design flow that includes
required bandwidth and area in the cost function and synthesizes the NoC topology for
FPGAs. For a set of real application and synthetic benchmarks, our approach shows an
average reduction of 21.6% in FPGA area (maximum of 26%) for equivalent bandwidth
constraints when compared with a baseline approach.
Interconnecting IP cores along with our NoC requires a glue logic that can connect
different versions of the router to IPs. To accomplish this, we design a customizable
Network Interface that is compatible with our 2-layer hybrid router. Towards capturing
real core implementation effects, we characterize a library of soft IP cores and implement
a typical image compression application on our FPGA. Through experiments we
determine the area and power overhead of our on-chip network on an FPGA when
implemented along with a typical application. Further by accurately modeling our On-
chip network for area, delay and power, we develop a platform that could be used to
floorplan a complete multi-processor application along with the NoC.
Acknowledgements
Firstly, I would like to express my gratitude towards my advisor, Prof.Tomko for shaping
up my graduate studies. Your strong directions and compassion has gone a long way in
helping me develop valuable academic and personal life qualities. Thanks for helping me
meet my deadlines and reviewing papers over very short notices. I consider myself very
fortunate to take courses and be constantly associated with Prof. Vemuri and his lab.
The high standards you set in the courses and discussions remained as a stable platform
for my research work. I also thank my other committee members, Prof. Carter, Prof.
Jone and Prof. Srinivasan for reviewing my work and giving good feedback.
I consider myself very lucky to have Prof.Srinivasan as my mentor at every stage
in my academic progress from high school to declaring my dissertation complete. I
would like to acknowledge the support extended by Xilinx and Mentorgraphics through
their university program. I would like to acknowledge our department staff Rob Montjoy
for efficiently handling all computing and licensing issues and Julie Muenchen for her
patience in ensuring that we conform to department regulations and formalities.
I am grateful to Intel and my supervisors for providing me with challenging respon-

sibilities during my internship. The break from graduate school served beyond a quality
work experience by going onto fund a huge duration of my research work.
Thanks to all my friends in graduate school. I thank Bala especially for helping
me develop strong interests in On-Chip Networks. I was inspired by the comprehensive

approach you had for problem solving. The late night/early morning discussions we
had in ERC needs a special mention for strengthening my NoC background. Thanks to
Daniel for implementing the NI that is featured in the dissertation. Also the discussions
Vijay & I had during the collaboration times will be very memorable. Thanks to Vijay
Sundaresan for always finding time to help out in any issue. Thanks to Jayanth for
imparting high levels of optimism on anything we talk anytime.
In a five year long PhD program it is extremely important to keep in touch with
iv
a lot of people and I thank SABHA for serving as a wonderful medium to interact
with students and community. I carry a rich variety of learning from serving for two
consecutive years in the SABHA’s committee. I will cherish the experience for years to
come.
I am thankful to UCs relaxed ambience. It covered all of our expenses leaving us with
enough money for frequent travel. All the never-ending road trip memories that I share
with many in UC will be etched in my mind for ever. UCs on-campus recreational and
housing facilities were excellent as well. I am indebted to my roommates Aravind and
Jagadish for making a wonderful home far away from home also for being great cooks
and more importantly agreeing to share cooking turns with me. I know that I have made
some life long bond of friendship while I was at UC Prasanna, Raghav, Ramki, Payal
and others who I have not mentioned.

Thanks to those everlasting VLSI projects. I found a good project and life partner.
I consider myself very lucky to find someone with whom I will share the rest of my life
Anusha. I have taken you virtually through every up and down in my graduate school.
Thanks for being there for me always.

I am very fortunate to be a part of a caring large family that spans the whole of
the US and India. Amma, Appa, Anu and now Nive I am happily placed in a circle of
affection because of you. I dedicate this contribution to all of you and to my beloved
paati who for all her life prayed for her grandsons success and moved to a more peaceful
world when I was presenting this doctoral dissertation.
v
Contents
1 Introduction 1
1.1 Platform FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Device vs Interconnect Scaling . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 FPGA based NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Energy Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.3 Design Re-use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.4 Dynamic Reconfiguration . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Research Approach and Overview . . . . . . . . . . . . . . . . . . . . . . 5
1.4.1 MoCReS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.2 Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4.3 Hybrid Two-Layer Router . . . . . . . . . . . . . . . . . . . . . . 6
1.4.4 Topology Synthesis for FPGAs . . . . . . . . . . . . . . . . . . . 6
1.4.5 SoC Development . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 On-Chip Network Background 9
2.1 Alternatives in Communication Architecture . . . . . . . . . . . . . . . . 9

2.1.1 Dedicated FPGA Interconnects . . . . . . . . . . . . . . . . . . . 10
2.1.2 Time-muxed Interconnects . . . . . . . . . . . . . . . . . . . . . . 10
vi
2.1.3 Circuit-Switched Interconnects . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Packet-Switched Networks . . . . . . . . . . . . . . . . . . . . . . 11
2.2 On-Chip Network Description . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Network-on-Chip Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.3 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.4 Arbitration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.5 Buffering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Current Research in NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Industrial Applications with NoCs . . . . . . . . . . . . . . . . . . 15
2.4.2 FPGA based NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 MoCReS: NoC Framework 16

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Network-on-Chip Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.1 Network Topology . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4.2 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.3 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.4 Buffering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.5 Arbiter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Router Micro-Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5.1 Packet Description . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5.2 Input Port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5.3 Cross-Point Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5.4 Central Arbiter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
vii
3.6 Results: Functional Simulation . . . . . . . . . . . . . . . . . . . . . . . 25
3.6.1 Common Clock Design . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6.2 Multi-Clock Design . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.7 Results: Area-Performance Characterization . . . . . . . . . . . . . . . . 27
3.7.1 Router Area Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.7.2 FPGA NoC Resource Analysis . . . . . . . . . . . . . . . . . . . . 29
3.7.3 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.8.1 Limitations of our Packet Switched MoCReS Framework . . . . . 33
4 NoC Power Analysis 34

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 MoCReS Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3.1 Components of Power . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.2 Comparison with LiPaR . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.3 NoC Component Power . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5 Hybrid Two-Layer Router Architecture 42

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2.1 Packetization and Control Overheads . . . . . . . . . . . . . . . . 43
5.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4 Architecture Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4.1 Cross-Point Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.4.2 Central Arbiter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4.3 NI Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
viii
5.4.4 Design Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.5 Architectural Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.6 System-Level Router Model . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.7 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.8 Results: Performance Improvement . . . . . . . . . . . . . . . . . . . . . 53
5.8.1 Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6 Hybrid NoC: Performance and Power Analysis 56
6.1 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.1.1 Packet Injection Rate Vs Average Latency . . . . . . . . . . . . . 57
6.2 Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2.1 Power Breakdown Results . . . . . . . . . . . . . . . . . . . . . . 61
6.2.2 Switch vs Link Power . . . . . . . . . . . . . . . . . . . . . . . . . 63
7 Experimental Platform 66
7.1 Multi-Processor Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.2 Xilinx ISE Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.2.1 FPGA Resource Characterization . . . . . . . . . . . . . . . . . . 68
7.3 NoC Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.3.1 Bandwidth Requirement Vs No.Flits . . . . . . . . . . . . . . . . 76
8 FPGA Based NoC : CAD Flow 78

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

8.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.4 Topology Synthesis Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.4.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.4.2 Mesh Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
ix
8.4.3 Candidate Topology Selection . . . . . . . . . . . . . . . . . . . . 83
8.4.4 Area Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.5 Description of Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

8.6 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.7 Experimental Results and Analysis . . . . . . . . . . . . . . . . . . . . . 90
8.7.1 Execution Time Results . . . . . . . . . . . . . . . . . . . . . . . 90
8.7.2 Area Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

8.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9 NoC based System-on-Chip Development 94

9.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.2 IP Core Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
9.2.1 Core Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
9.2.2 Xilinx [1] IP Support . . . . . . . . . . . . . . . . . . . . . . . . . 96

9.2.3 IP Library Characterization . . . . . . . . . . . . . . . . . . . . . 98
9.3 Network Interface Implementation . . . . . . . . . . . . . . . . . . . . . . 99
9.3.1 Primary Design Goals . . . . . . . . . . . . . . . . . . . . . . . . 100
9.3.2 Customized IP Library . . . . . . . . . . . . . . . . . . . . . . . . 101
9.4 NoC Floorplanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

9.4.1 Synthesis of a Predictable NoC . . . . . . . . . . . . . . . . . . . 103
9.5 Image Compression Implementation . . . . . . . . . . . . . . . . . . . . . 106
9.5.1 NoC Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 108
9.6 NoC Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
10 Summary of Contributions and Future Work 113
10.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

10.1.1 NoC Framework: MoCReS . . . . . . . . . . . . . . . . . . . . . . 113
10.1.2 Power Characterization . . . . . . . . . . . . . . . . . . . . . . . . 114
x
10.1.3 Hybrid 2-Layer Architecture . . . . . . . . . . . . . . . . . . . . . 114
10.1.4 Performance and Power Analysis . . . . . . . . . . . . . . . . . . 115
10.1.5 CAD Flow: Topology Synthesis . . . . . . . . . . . . . . . . . . . 115

10.1.6 NoC Based SoC Development . . . . . . . . . . . . . . . . . . . . 115
10.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
xi
List of Tables
1.1 System Level Design Requirements [2] . . . . . . . . . . . . . . . . . . . 5
3.1 Comparison of FPGA Router Designs . . . . . . . . . . . . . . . . . . . . 18

3.2 Standalone Router Area Results . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Standalone: MoCReS (1VC+CC) Power Consumption . . . . . . . . . . 37

4.2 3 × 3 Mesh: MoCReS (1VC+CC) Power Consumption . . . . . . . . . . 38
4.3 Standalone: MoCReS (1VC+MC) vs LiPaR . . . . . . . . . . . . . . . . 39
4.4 Power Dissipation Across NoC Components . . . . . . . . . . . . . . . . 40
4.5 FPGA Resource utilization : MoCReS(1VC+MC) vs LiPaR . . . . . . . 41
5.1 Scaling of Area and Frequency with No.of C-Layer Ports . . . . . . . . . 52

5.2 Scaling of Area and Frequency with No.of P-Layer Ports . . . . . . . . . 53
6.1 Input Flits: Transition Activities . . . . . . . . . . . . . . . . . . . . . . 62
7.1 Application and Synthetic Benchmarks . . . . . . . . . . . . . . . . . . . 68
7.2 Performance and Power Estimates of Routing Resources in XC4VLX100 73

7.3 MoCReS: FPGA Resource Utilization . . . . . . . . . . . . . . . . . . . . 73
7.4 Comparison between Communication Abstractions . . . . . . . . . . . . . 77
8.1 Clustering Results for Benchmarks . . . . . . . . . . . . . . . . . . . . . 83
8.2 Algorithm Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . 90

8.3 MPEG4 Area Improvement . . . . . . . . . . . . . . . . . . . . . . . . . 91
xii
List of Figures
1.1 Scaling of Global Interconnects [2] . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Future of Networks-on-Chip [2] . . . . . . . . . . . . . . . . . . . . . . . 4
3.1 A Multi-Clock 2 × 2 Mesh Based NoC . . . . . . . . . . . . . . . . . . . 19
3.2 Packet Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Router Micro-Architecture (Input Port, Cross-Point, Central Arbiter) . . 22
3.4 Router Functional Simulation . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Router Area Vs Channel Width . . . . . . . . . . . . . . . . . . . . . . . 28
3.6 3 × 3 Mesh FPGA Utilization . . . . . . . . . . . . . . . . . . . . . . . . 29
3.7 Average Latency Vs Injection Rate of 3 × 3 Mesh . . . . . . . . . . . . . 30
5.1 Hybrid Two-Layer Router Architecture . . . . . . . . . . . . . . . . . . . 45

5.2 Modified Central Arbiter Model . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 SystemC Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.4 Design Parameters Vs Area . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.5 Design Parameters Vs Frequency . . . . . . . . . . . . . . . . . . . . . . 52
5.6 Area (Slices) Vs Avg. Bandwidth / Port . . . . . . . . . . . . . . . . . . 54
6.1 Baseline 3 × 2 MoCReS Mesh . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2 Modified Hybrid Router 2 × 2Mesh . . . . . . . . . . . . . . . . . . . . . 58

6.3 Packet Statistics: No. Packets Injected from each IP . . . . . . . . . . . 59
6.4 Results: Baseline 3 × 2 MoCReS Vs Hybrid Router Mesh . . . . . . . . 60
xiii
6.5 Dynamic Power Breakdown of an 8-port Hybrid Router . . . . . . . . . . 62
6.6 P-Layer Ports (Switch Size) Vs Dynamic Power (mW) . . . . . . . . . . 63
6.7 C-Layer Ports (Switch Size) Vs Dynamic Power (mW)@200MHz . . . . . 64

6.8 Power (mW): Baseline MoCReS Vs Hybrid Router Mesh . . . . . . . . . 65
7.1 Nallatech [3] Bendata-V4 Platform FPGA . . . . . . . . . . . . . . . . . 69

7.2 Sample XDL Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.3 Routing Resources: XC4VLX100 . . . . . . . . . . . . . . . . . . . . . . 72

7.4 MoCReS Routing Resource Utilization . . . . . . . . . . . . . . . . . . . 74
7.5 Router Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.6 Experimental Flow for NoC Topology Synthesis . . . . . . . . . . . . . . 77
8.1 NoC Design Dependency Showing Our Approach . . . . . . . . . . . . . 79

8.2 Topology Synthesis Framework . . . . . . . . . . . . . . . . . . . . . . . 81
8.3 Clustering Phase of Topology Synthesis . . . . . . . . . . . . . . . . . . . 82
8.4 IP Mapping and Link Bandwidth Estimation . . . . . . . . . . . . . . . . 85
8.5 Application Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.6 NoC Benchmark Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.1 ITRS 2007 Showing IP Design Reuse Trends . . . . . . . . . . . . . . . . 95

9.2 Xilinx [1] MicroBlaze System Design and Architecture . . . . . . . . . . . 97
9.3 IP Core Abstraction and NI Wrapper . . . . . . . . . . . . . . . . . . . . 99
9.4 Triple DES IP Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

9.5 IP Properties and Customization Overhead . . . . . . . . . . . . . . . . . 102
9.6 IP Customization Overhead: Area and Power . . . . . . . . . . . . . . . 102
9.7 NoC Mesh Floorplan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
9.8 NoC Mesh Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

9.9 Routing Resources: Delay and Power . . . . . . . . . . . . . . . . . . . . 106
9.10 JPEG Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
xiv
9.11 NoC Implementation Alternatives . . . . . . . . . . . . . . . . . . . . . . 110
9.12 JPEG Configuration: Area and Power Overhead Analysis . . . . . . . . . 112
xv
Chapter 1
Introduction
Present platform FPGAs consists of a variety of embedded computation elements, in

addition to the programmable logic and interconnect. The increasing heterogeneity cou-
pled with higher operating frequencies enable FPGAs to replace ASICs in several high
performance applications. In this chapter, we discuss platform FPGAs from an SoC per-
spective and present the inherent performance limitation due to scaling of device sizes.
We then introduce the current solutions proposed to handle the performance bottleneck,
with an emphasis on the Network-on-Chip paradigm.
1.1 Platform FPGAs
Traditional FPGAs are comprised of a large amount of programmable logic and inter-
connects to implement user applications. Recently, a coarse-grained approach has been
adopted by FPGA companies that combines the fine-grained reconfigurable resources
with hard embedded cores that match their ASIC counterparts in performance, power
and area. FPGAs are presently utilized in various application domains, from mobile
portable devices to space devices. They increasingly replace ASICs as the choice of tar-
get technology due to their increasing device sizes and operating frequency. The following
are some of the capabilities that a present platform FPGA sustains:
1
2
• Upto 200,000 Logic Cells
• Abundant Embedded IP (Power PC processors, Multipliers, bRAMs & DCMs)
• Upto 500 MHz Clocking
• Multiple Device Voltages
• Rich Flexible Interconnects
A System-on-Chip (SoC) design integrates processors, memory, and a variety of IPs
in a single design. Due to the FPGA capabilities listed above and high time-to-market
pressures, complex SoC designs are increasingly targeted to FPGA.
1.2 Device vs Interconnect Scaling
FPGA device manufacturers have achieved large device sizes using 65nm technology. In
this nano-meter technology, the interconnects scale poorly compared to transistors. As a
result, the global interconnect in the design sustains a high performance delay compared
to transistors. Figure 1.1 presents the difference between local and global interconnect
delay due to technology scaling [2].
The above performance limitation will be more pertinent to FPGAs (than ASICs)
due to the programmable nature of their interconnects (and large device sizes). As a
result, interconnects in FPGAs are accounting for a significant portion of total circuit
delay and power. As design gets bigger, it is therefore very difficult to maintain high
clock rates. Networks-on-Chip (NoC) is a design paradigm proposed to contend with
this inherent performance bottleneck. Figure 1.2 shows the projected trend for on-chip
network based design methods [2]. NoC on FPGAs are an active area of research and
holds great promise for meeting present SoC communication needs.
3
Figure 1.1: Scaling of Global Interconnects [2]
1.3 FPGA based NoCs
Traditionally cores in FPGAs are connected using bus-based architectures. NoCs are
proposed as an alternative to eliminate the inherent performance bottleneck in bus-based
architectures. In addition to an increase in performance, NoCs present a host of other
advantages, particularly for FPGA designs. In this section, we discuss these advantages
for implementing multi-core applications with NoCs.
1.3.1 Scalability
In NoCs, the IP cores are connected to the network through routers (network backbone).
As the communication between routers is standardized, addition of more cores does not
have an impact on the rest of the design. On the other hand, the bus-based architectures
are poorly scalable with the number of cores. With increasing number of cores and
complexity in arbitration logic, the operating frequency of bus-based communication
4
2007 2010 2013 2016 2019

2005 2006 2008 2009 2011 2012 2014 2015 2017 2018 2020 2021
System Level Component Re-use
On-Chip Network Design Methods
65nm 45nm 32nm 22nm 16nm
Research Required Qualification/Pre-Production
Development Underway Continous Improvement
Figure 1.2: Future of Networks-on-Chip [2]
degrades.
1.3.2 Energy Efficiency
FPGAs are a suitable implementation for portable devices and therefore, low power
operation is a critical requirement. As opposed to bus based architectures, NoCs consume
low power due to less switched capacitance (shorter lines). Further, due to a high level
of parallelism in communication, the overall energy requirement is comparatively low.
1.3.3 Design Re-use
Due to a hierarchical approach, there is a rapid reduction in design and verification time
associated with NoCs. Further, development of IP cores can be independent from the
application/other parts of the design. Table 1.1 shows the future trend for design re-
use [2]. There is a steady increase expected in the % of design component re-used, which
supports our choice of NoC as a design paradigm.
5
Table 1.1: System Level Design Requirements [2]

Year of Production 2005 2006 2007 2008 2009 2010 2011 2012
Design Reuse
% to all logic size 32% 33% 35% 36% 38% 40% 41% 42%
SoC Reconfigurability
Total % of SoC reconfigurable 23% 26% 28% 28% 30% 35% 38% 40%
Year of Production (contd) 2013 2014 2015 2016 2017 2018 2019 2020
Design Reuse
% to all logic size 44% 46% 48% 49% 51% 52% 54% 55%
SoC Reconfigurability
Total % of SoC reconfigurable 42% 45% 48% 50% 53% 56% 60% 62%
1.3.4 Dynamic Reconfiguration
During dynamic reconfiguration, a computational element (core) is replaced by another
task without affecting the execution of other parts of the design. Table 1.1 presents the
expected reconfiguration trend [2]. A steady increase is expected in the % of reconfig-
urable components in an SoC. A primary challenge in the reconfigurable computing is
concurrent design of the communication and computation subsystems. The NoC ap-
proach to design in FPGAs inherently separates these two aspects of the design by pro-
viding standard interfaces to the cores. Several current research efforts [4] [5] advocate
NoC design on FPGAs for efficient module replacement and design re-use
1.4 Research Approach and Overview
In this research, we focus on networks-on-chip based high performance communication
architectures for FPGAs. Once completely developed, this NoC framework can efficiently
replace the traditional communication architecture.
1.4.1 MoCReS
An FPGA based on-chip network has a unique set of design goals that includes satis-
fying the bandwidth requirements with a minimum (limited) resource availability. We
6
implement a minimum area and high performance packet-switched router (MoCReS)

for FPGA based NoCs. Our 5-port virtual cut-through router has an area overhead of
only 282 Virtex-4 slices (a marginal 0.57% of XC4VLX100) and operates at 357 MHz
supporting a competitive data rate of 2.85 Gbit/s.
1.4.2 Power Analysis
The NoC communication architecture competes for resources with the user application
and also sustains a power overhead. We determine the power consumption of the NoC
framework on FPGA. Further, we analyze the power-trade-offs associated with our design
novelties by comparing it with a baseline approach, implemented on the same target
device.
1.4.3 Hybrid Two-Layer Router
A strict packet-switched NoC sustains a high serialization overhead. To offset these per-
formance overheads, we develop a novel hybrid two-layer router architecture that sup-
ports packet-switching for inter router transfers and time-multiplexed circuit-switching
for IP cores connected to the same router. The advanced router architecture achieves
an average improvement of 20.4% in NoC bandwidth (maximum of 24% compared to a
traditional NoC). Furthermore, a thorough analysis of the power performance trade-offs

shows our 2-layer router to be superior to the baseline approach.
1.4.4 Topology Synthesis for FPGAs
Traditional design flow for FPGAs support sophisticated CAD tools to achieve design
closure. However, multi-core designs do not have a standardized CAD flow for FPGAs.
Moreover, the vast heterogeneity of the FPGA device further complicates the design flow
for NoC based FPGA designs. Further, CAD solutions for multi-core applications cannot
7
be borrowed from ASIC domain, due to the inherent differences in the underlying archi-
tecture. As a part of this research, we design an algorithm to effectively automate the
NoC design cycle for FPGAs. For any given application as a task graph, our integrated
synthesis framework determines a suitable NoC topology that satisfies the bandwidth
requirements, while optimizing for the area overhead.
1.4.5 SoC Development
Implementing a complete SoC development flow using FPGA based NoCs require a
thorough characterization of IPs, an efficient Network Interface and methodologies for

floorplanning the complete NoC-NI-IP framework. In this dissertation, we present our
contributions in all the above three domains towards a complete implementation of a
multi-processor application on FPGA. Furthermore, to deeply study the area and power
overheads involved in this alternate communication architecture, an image compression

application is implemented as a case study.
1.5 Thesis Outline
The relevant NoC and FPGA background material required for this research is presented
in Chapter 2. Further, this chapter includes a survey of alternate approaches considered
in current research.
Chapter 3 describes the FPGA based MoCReS framework that is developed as a part
of this research. The chapter also presents the area/performance trade-offs for various
versions of the router. We utilize the NoC framework described in this chapter to conduct
all the experiments presented in the following chapters.
We analyze the power dissipated in FPGA based NoC framework and present its descrip-
tion and results in Chapter 4. Further, a comparison of our MoCReS framework with a
baseline NoC design in terms of power is featured in this chapter.
8
Chapter 5 presents a hybrid two-layer router architecture. The advantages behind the
novel architecture, along with the design issues involved are presented in this chapter.
The area, performance metrics of the hybrid router architecture are characterized and
stored along with parameterized designs in a MoClib NoC component library.
Chapter 6 presents a detailed analysis of performance and power of our novel NoC frame-
work. Through detailed traffic analysis and comparisons, we present the performance
gain obtained in our hybrid router framework. This chapter concludes with a component
based NoC power analysis and a comparison of switch and link power in FPGA based
NoCs.
The experimental platform is outlined in Chapter 7. It includes a description of the
experimental platform for this thesis, including the CAD tools, software and hardware
used. We also present the application and synthetic benchmarks that we utilized to
extract the area/performance results in the experiments.
Chapter 8 formally presents the design and implementation of the CAD tool developed
to perform automatic topology synthesis. It presents the cost constraints and trade-offs
involved in the algorithm during design space exploration. The chapter also includes the
results obtained for a variety of application and synthetic benchmarks.
In Chapter 9, we present the IP implementation methodologies and characterize a library
of frequently used IP cores in FPGAs. The design goals behind a network interface
particularly suitable for our NoC is also presented. Furthermore, the chapter includes a
power-performance study into floorplanning our on-chip network in FPGAs.
Finally, Chapter 10 summarizes all the contributions made in this dissertation, and
outlines the future research directions.
Chapter 2
On-Chip Network Background
The motivation behind implementing NoC architectures in FPGAs is to remove the

performance bottleneck present in bus-based architectures. Even though we give an
emphasis to FPGA based on-chip networks, much of the background material presented
in this chapter is also applicable to ASIC networks. In this chapter, we first present
the alternatives in communication architecture, followed by a detailed description of a
Network-on-Chip. We conclude the chapter by discussing recent research in the NoC
area.
2.1 Alternatives in Communication Architecture
This section presents various ways of implementing communication architectures in FP-

GAs. Some of the main trade-offs involved in choosing an interconnect mechanism are,
• Throughput Available
• Static Schedule Requirement
• Switch/Interconnect Area Overhead
• Signal Integrity
9
10
• Bandwidth Guarantee
Based on the above trade-offs, the interconnection mechanisms can be broadly clas-
sified into the following:
2.1.1 Dedicated FPGA Interconnects
Present FPGA devices support dedicated interconnects. These are the spatially dis-
tributed FPGA resources configured through programmable switches. The latency of
this type of communication is very low and there is guaranteed bandwidth to support
the communications. However, the interconnect utilization is extremely low, as the dedi-
cated connections are almost never time-multiplexed for a different communication. With
the limited resource available within the FPGA and with increasing design complexities,
it is challenging to preserve signal integrity with this form of interconnects. Present
Virtex [1] device architecture supports this type of interconnects.
2.1.2 Time-muxed Interconnects
Time-muxed interconnects offer a high throughput connections with very high intercon-
nect utilization. This approach requires all the communication schedules to be known off-
line. With increasing number of core communications, the area requirement for context
memory offsets the gain achieved in throughput and interconnect utilization. Research
in [6] explores use of time-muxed interconnects for FPGAs.
2.1.3 Circuit-Switched Interconnects
In a Circuit-switched mechanism, the resources utilized for the communication archi-

tecture can be time-multiplexed. The latency of such a network is minimum and the
communication schedules can be determined online. However, preserving signal integrity
11
for larger designs is a big challenge as this circuit-switched connection, once established,
follows a synchronous scheme. PNoC [7] advocates this type of interconnects for FPGAs.
2.1.4 Packet-Switched Networks
The central idea with this structure is to transport data across modules in the form of
packets. Multiple packets are in flight from several source, destination pairs, thereby
increasing the overall performance. Similar to circuit-switched interconnects, this form
of communication is established online and does not require static scheduling. The
packetization/serialization overhead involved along with the lack of bandwidth guarantee

are the main drawbacks of this approach. A detailed comparison of packet-switched
networks with circuit switched networks is presented in [8].
2.2 On-Chip Network Description
The main modules present in an on-chip network are, the IP cores (computation units),
Network Interface (NI) and the NoC backbone.

IP Cores: The Intellectual Property (IP) cores form the computation elements of the
application. These cores are developed independently and often re-used in different parts
of the application. To handle current design complexities, there is a shift in the design
trend to a more modular IP based approach.

Network Interface (NI): The NI forms the glue logic between the IP cores and the
network backbone. It standardizes the interface between the IPs present in a network
and the routers. Further for packet-switched networks, it also packetizes the outgoing
data from the IP and injects it to the network. When an IP receives data from the
network, the NI grants the request of the downstream router to receive the packets. The
NI is customized to suit the requirements of a particular network backbone. Research is
underway to standardize these network interfaces and IP communication protocols [9].
12
Network-on-Chip Component The router forms the heart of the NoC backbone. It
is responsible for transporting packets that originate from the IP cores. Traditionally,
a router in a mesh network has four directional ports (North, East, South and West)
to communicate with the neighboring routers. Further, it has at least one local port
through which an IP core is interfaced to the network. Upon receiving a packet, the
router decodes, buffers and routes it in the appropriate direction based on the destination
node. Throughout this chapter we use the terms router and switch interchangeably.
An on-chip interconnection network can be described by a set of design choices called
network aspects. We describe network aspects in the following section.
2.3 Network-on-Chip Aspects
Choice of suitable network aspects have a large impact on the design metrics of the NoC.
Topology, Flow Control, Arbitration, Buffering and Routing are the main aspects of a
network.
2.3.1 Topology
Network topology comprises of an arrangement and connectivity of the routers. 2D Mesh,

Ring and Star are some of the popular network topologies. The quality of a network can
be defined in terms of some of its characteristics, namely, bisection bandwidth, degree

and diameter. Choice of an appropriate topology has an impact on performance, area
utilized and power consumed. Based on the interconnect requirement, area and power,
2D-mesh networks have been found suitable for our target device [10]. In addition to
this, mesh networks can use a simple, deadlock free XY routing mechanism.
13
2.3.2 Flow Control
The three main alternatives in flow control of an NoC are, Store and Forward, Virtual
Cut-Through, and Wormhole. In the Store & Forward technique, the out-going packet
is completely buffered at the downstream router before making the next hop. This tech-
nique used in [11] sustains a high packet latency (directly proportional to packet size).
In contrast to this, Virtual Cut-Through technique has low latency as the head flit pro-
gresses without waiting for the rest of the packet. However, the buffer requirements are
in terms of multiples of packet sizes for both the above approaches. The wormhole tech-
nique, which operates on flits, sustains low latency with minimum buffer requirements,
with a packet residing across multiple nodes, thereby increasing the complexity of the
switch.
2.3.3 Routing
The routing mechanism determines the decisions taken when a packet is in flight. The
route established for a packet-based communication can be determined either at the
source node (Source Routing) or independently across the routers of the network (Dis-
tributed Routing). In the case of source routing, the first flit in a packet is the header,
that contains the entire route of the packet. Upon decoding the header, the router passes
the packet to the appropriate downstream router. In the case of distributed routing, the
route decisions can be made throughout the network by the routers (depending on con-
gestion and other network conditions).
Another classification in the routing mechanism is based on adaptivity to network con-
ditions. In Deterministic routing, the route between any (source,destination) pair is
always the same. XY routing is a widely used low complexity routing mechanism that
is deadlock free. On the flip side, a deterministic routing mechanism cannot alter the
routes based on network traffic conditions. Paths taken using Adaptive routing can vary
and be non-minimal. However, the switch incurs additional complexity to support this
14
mechanism.
2.3.4 Arbitration
Within a router, arbitration is required while servicing conflicting requests. When mul-
tiple input ports request a common output port, the requested output port determines
the order in which requests are acknowledged. The scheduling can be either static (pre-
determined) or dynamic. Round-Robin arbitration is a popular technique that provides
a fair dynamic granting scheme. Additionally, Quality-of-Service (QoS) guarantees can
be provided to certain class of applications by augmenting this arbitration scheme with

a priority policy.
2.3.5 Buffering
At the intermediate nodes, packets need to be temporarily stored, while waiting for a
channel access. Based on where the buffers are placed in the routers, Input Buffering and
Output Buffering are the two main classifications. While buffering a packet at the input
of a router, additional requests from upstream routers are not granted. This situation
is called Head-of-Line (HOL) blocking. Using output buffering, the HOL blocking is
avoided, thereby decreasing the average latency of the packets.
2.4 Current Research in NoC
In this section, we summarize the recent research work in the area of NoCs. A brief
account of current NoC applications in industry is also presented.
The International Technology Roadmap for Semiconductors 2005 [2] is the first to
address the inherent performance bottleneck due to poor interconnect scaling. The con-
cept of routing packets instead of wires to gain in performance was introduced by Dally
et. al [12] and was later formally presented as a solution paradigm by Benini et al. [13].
15
The first proof-of-concept was presented by Kumar et al. [14] by implementing a complete
NoC framework. Since then, there has been several academic and industrial advance-
ments in the design of NoCs.
2.4.1 Industrial Applications with NoCs
The NoC framework [15] by Philips is one of the first reported industry implementation
of an NoC. Sony Entertainment Play Station PS3 has partnered with IBM [16] for their
network-on-chip implementation of the multi-core design. Further, Sonics [17] has been
in involved in development of interconnect structures based on NoCs. Arteris [18] and

Silistix [19] are other recent startup companies to provide an NoC solution to applications.
2.4.2 FPGA based NoC
Marescaux et al. [4] have applied the NoC paradigm on a Virtex device to enable multi-
tasking by tile-based reconfiguration. The Hermes [20] NoC platform was developed par-
ticularly for FPGAs to enable dynamic reconfiguration. Until recently the potential of
NoCs to address the performance issues in FPGAs was left unexplored. Bartic et al. [21]
present an adaptive NoC design for FPGAs and analyze its implementation issues. Sal-
dana et al. [10] address multi-processor designs in FPGAs by considering various NoC
topologies by scaling the number of IP cores. Hilton et al. [7] design a flexible circuit-
switched NoC for FPGAs. Kapre et al. [22] compare the suitability of packet and circuit
switching for FPGAs.
In addition to the current research presented in this chapter, we also refer to related
work alongside the contributions presented in the subsequent chapters.
Chapter 3
MoCReS: NoC Framework
3.1 Introduction
Design of a high performance, flexible on-FPGA communication architecture with mini-

mum area overhead presents a great challenge. In this chapter, we present the design and
implementation of MoCReS: Multi-Clock On-Chip Network for Reconfigurable Systems.
The key idea is to implement a low area and high performance packet-switched NoC
framework for FPGAs. The central component of the NoC (router) can support inde-
pendent operating frequencies, dictated by placement and routing constraints in FPGA.
Moreover, the router supports a low latency virtual cut-through flow control for vari-
able packet sizes. Our 5-port router has an area overhead of only 282 Virtex-4 slices (a
marginal 0.57% of logic resources of an XC4VLX100 device) and can operate as high
as 357 MHz supporting a competitive data rate of 2.85 Gbit/s. We gain in router area
and performance by reducing the logic depth of the central arbiter and cross point ma-
trix. We utilize our router to construct a mesh based multi-clock on-FPGA NoC. We also
demonstrate its functionality and characterize performance, area and power of several
versions of the router.
16
17
3.2 Related Work
We target our light-weight multi-clock NoC framework for reconfigurable computing

platforms. Requirements of FPGA based SoC design demands the network to have
minimum area overhead, maximum operating frequency and low latency of operation.
In this section, we compare our router to other proposed FPGA based NoC routers [11]
[23] [24]. In [11], the authors present a light-weight FPGA based parallel router that
uses store and forward flow control. This router has the disadvantage of high latency
(directly proportional to packet size) and it supports only fixed packet sizes. The modified
header in our virtual cut-through router overcomes the above mentioned disadvantages
by encoding the packet size as a fraction of the required FIFO depth. This technique
ensures low latency of operation and improved buffer utilization as a result of supporting
variable packet sizes. Research in [23] [24] [4] [25] presents wormhole based routers using
XY deterministic routing. Though the wormhole routing limits the buffer requirements,
it increases the area consumed due to its complexity thus limiting the logic available for
IP implementation in FPGAs. The 5 port wormhole router presented in [23] consumes
1832 Virtex II slices and operates at 66 MHz. Such an increase in router area degrades its
performance and increases the power consumed. Further, [24] [4] support variable packet
sizes, but with an additional header flit overhead as compared to our router. We will
show that our router consumes fewer resources than the above designs when implemented
in a similar target device. Table 3.1 presents a comparison of MoCReS with alternate
designs in terms of area (in slices) and operating frequency (MHz). The number of slices
utilized is a standard metric to compare FPGA area. In terms of logic, a Virtex-II slice is
equivalent to a Virtex-4 slice. The high operating frequency obtained through MoCReS is
also attributed to the low interconnect delays in Virtex-4 (advanced) architectures. The
operating frequency of our router in a comparable Virtex-II pro device was 172 MHz.
Maximum operating frequency of the router varies greatly when implemented in

FPGA due to switch complexity and place & route constraints. The slowest router
18
Table 3.1: Comparison of FPGA Router Designs

Router Area Channel Frequency Flow Target
Design Slices Width MHz Control Device
LiPaR [5] 352 8 33 STF XC2VP30
Moraes [7] 316 8 50 WHR XC2V1000
1506 [5] 56 WHR XC2VP30
RASoC [14] 8
HRSoC [6] 1832 8 66 WHR Virtex-II
Marescaux [8] 446 16 40 WHR XC2V6000
MoCReS 282 8 357 VCut XC4VLX100
implemented in an FPGA NoC determines the overall network operating frequency and
degrades its performance [25] [10]. We overcome this limitation by enabling the router to
support a multi-clock framework. Kim et al. [26] propose to interface the local cores op-
erating on individual frequencies with the network using asynchronous FIFOs for ASICs.
If implemented on FPGA, the slowest router would still dictate the operating frequency
of the network. Moreover, [26] buffers the entire packet from the local core before for-
warding it to the network. On the other hand, our router follows a modified multi-clock
virtual cut-through approach hence sustaining low latency.
3.3 Design Goals
The key design objective is to minimize the area consumed by the router, which is the
central component of a network. Reducing the logic ensures sufficient resources for SoC
design in FPGA and also minimizes power overhead. Secondly, we target to increase
the operating frequency of the router keeping the network latency to a minimum. It
is essential to improve the network bandwidth and avoid the bottleneck present in bus
based architectures. The final objective is to operate multiple routers on independent
clock frequencies thereby preventing the slowest router from restricting the operating
frequency of the network. Figure 3.1 presents a multi-clock framework with routers
functioning at individual frequencies. Dual ported input buffers are used to cross clock
19

L1
L3

CLK_R1 CLK_R3

R (0,1) R (1,1)

L0
L2
Input Ports
Central Arbiter + Crosspoint

R (0,0)
R (1,0)

CLK_R0 CLK_R2

Figure 3.1: A Multi-Clock 2 × 2 Mesh Based NoC
domain boundaries, as shown in Figure 3.1
3.4 Network-on-Chip Aspects
The network topology along with the flow control, routing, buffering and arbitration
schemes describe an interconnection network. The choice of appropriate network aspects

have a significant impact on area and performance of the communication architecture.
3.4.1 Network Topology
We choose a mesh topology for our light-weight network. Mesh networks have a minimum
area overhead [10] (reduced number of nets) and low power consumption. In addition,
area scales linearly with the number of nodes and channel width in a mesh. A mesh also
maps well to the underlying routing structure of FPGA. Hence, choosing mesh networks
reduces the congestion in FPGA logic and routing which minimizes power consumption.
20
3.4.2 Flow Control
Virtual cut-through and wormhole technique (unlike Store and Forward) have a packet
latency that is only proportional to the path length. However, the complexity of a
wormhole router as compared to a virtual cut-through router is less suitable for light-
weight implementation. We have chosen a virtual cut-through flow control mechanism for
our router. This scheme supports higher throughput than wormhole routing by efficiently
releasing the upstream buffers during blockages. Furthermore, virtual cut-through flow
control supports high channel utilization with low latency and does not reserve physical
channels.
3.4.3 Routing
We choose the deadlock free XY routing for our switch. The simplicity of the XY
routing adds little overhead to the header decoding logic. Hence, XY routing is suitable
for implementing our area efficient router on FPGA.
3.4.4 Buffering
We buffer incoming packets only at the input ports. Although the input buffering intro-
duces the head-of-line problem, it leads to a low area overhead. In addition to buffering
the incoming flits, the input buffers also provide a framework to implement a multi-clock
network.
3.4.5 Arbiter
To ensure fairness, the competing input ports are allocated based on a simple round
robin approach. The priority of the last served/denied port is placed at the end of the
queue. The FIFO virtual channels also follow a round robin approach when switching
packets to downstream.
21
3.5 Router Micro-Architecture
The MoCReS router consists of five Input ports, Crosspoint matrix and Central arbiter.
Except for the header decoding logic, the five input ports are identical. As we adopt
a flow control with virtual channels (VCs), the input port contains arbitration logic
for multiple VCs. The input port also contains input buffers to store the incoming
packet. The MoCReS architecture has been developed in collaboration with another
colleague [27]. A more detailed description of the architecture can be found in the above
thesis [27].
3.5.1 Packet Description
For a FIFO depth of 16 (utilizing the Xilinx block RAM), the packet size can vary
between 24 bits and 128 bits with a header overhead of only 1 flit per packet. The flit
size is fixed at 8 bits. The header contains the address of the destination router, flit type
and packet fraction. Our virtual cut-through flow control needs one bit to specify the
flit type. The tail bit is set on the flit prior to the last flit to send the terminate signal
without wasting a clock cycle.
The router supports variable packet sizes by encoding the packet size as a fraction
of the required blockRAM (bRAM) depth (packet fraction) in its header. The Network
Interface (NI) in each IP core takes the onus of storing the fraction in the packets header.
The fraction bits are complemented before storing to enable efficient comparison against
write count of the FIFOs. The number of fraction bits and flit width can be increased
if a higher packet granularity is desired. Increasing the number of fraction bits also
improves the buffer utilization with a marginal area overhead. The remaining bits in the
header are reserved to implement priority based flow control. An advanced version of the
router could utilize the remaining header bits to incorporate Quality-of-Service (QoS)
and offset the impact of contention latency.
22
Packet Head/Tail Bit

Destination

Packet
Y
Fraction X0

Tail - 1
1

Tail Flit
Figure 3.2: Packet Specification

Wr_enA
Rd_enA
emptyA
Cross Point Matrix
N E
2:1
4:1
Wr_cntA BRAM FIFO A BRAM Arbiter L
2
Fract_in VC Arbiter Mux_Sel 4:1
Req_in S 2 W
Ack_out 4:1 2:1
2
Channel
Data_In 2 2 2
Msel_L Msel_N Msel_S Msel_E Msel_W
Demux_Sel

Header
Central Arbiter
Wr_cntB BRAM FIFO B Decoder

Write Clock Read Clock Grnt
Wr_enB
Rd_enB
emptyB
Input Port Req_i
Req_in_W
Grnt_in_W
Req_in_S
Req_in_N
Req_in_E
Req_in_L
Grnt_in_S
Grnt_in_N
Grnt_in_E
Grnt_in_L
Figure 3.3: Router Micro-Architecture (Input Port, Cross-Point, Central Arbiter)
3.5.2 Input Port
The router has a set of input ports, namely, Local (L), North(N), East(E), South(S) and
West(W) to communicate with the local core and neighboring routers. Each input port
can support multiplexed virtual channels, associated arbiters, and a header decoding logic
to make routing decisions. The three main components of the input port (Figure 3.3)
are, the Virtual Channel Selector, the FIFO bRAMs and the bRAM arbiter.
A. Virtual Channel Selector (VC Selector): The decision on the availability of
space in input buffers is made by the VC selector. It receives along with the header, the
23
size of the packet in a coded form. Upon receiving this data from the upstream router
or core, the VC selector compares the size of the packet with the available size in the
least occupied among the virtual channels. This is done by comparing the incoming size
against the FIFOs write count. If adequate space existed the VC selector acknowledges
the request back upstream. The VC selector, in the process, also sets the input de-
multiplexer. The input de-multiplexer is used to route the packet from the input channel
to the appropriate multiplexed input buffer.

B. FIFO bRAMs: The buffer depth is parameterized in our router and we have set
a depth of 16 for our experiments. The buffers are implemented as bRAM First In First
Out (FIFO) memories and perform the following tasks:
1. Buffer the incoming packet partially or fully and when the downstream switch is
available, forward the head and subsequent flits.
2. Demarcate the router to core and router to router frequencies, hence supporting a
multi-clock network design.
3. Support variable packet sizes by enabling the arbiter to monitor the write count
information.
Our router hence does not restrict the size of the packet which might lead to inefficient
transfer of data. The variable packet size capability comes at a marginal (less than 5%)
increase in the area of the switch. This overhead is acceptable, considering the inefficiency
in performance and power, due to the padding of smaller packets with empty flits.
An efficient way of implementing f ull/empty logic is required in the case of having
buffers to separate clock domains. We synchronize the control signals by a) Sending the
granularity of the packet as a fraction of FIFO size b) Setting the tail bit on the flit
before the last to terminate the connection without wasting a clock cycle.
Since the minimum supported FIFO depth in our target FPGA is 16, it is appropriate
to buffer the entire packet during contention (virtual cut-through). The write count,
24
empty and f ull status signals are integrated into the FIFO with minimum additional
logic. The capability to store flits from multiple packets in one buffer improves the
FIFO utilization. Also, we gain in network throughput as the successive flits release the
upstream buffer as soon as the head advances.
C. bRAM Arbiter: The input port also contains the control logic to make arbitra-
tion decisions. A simple round-robin approach is followed when choosing a non-empty
bRAM. Upon choosing a bRAM, the FSM pops the head flit, decodes its destination
and sends appropriate requests. XY routing is adopted in our router which simplifies the
decoder logic significantly. The number of outgoing request lines are reduced according
to the connections that XY routing permits.
Head Decoder: In XY routing, the head flit travels in the X direction and once it
reaches the destination X, it travels in the Y direction. All subsequent flits follow the
header flit in a pipelined fashion. Due to packets traveling in the X direction completely
before Y, a request is never sent from the North and South ports to the downstream
East and West ports. This nature of XY routing is used to reduce the amount of logic
in a) the Header Decoder b) the Cross Point Matrix and c) the Central Arbiter. The
above simplification of the logic translates into significant FPGA slice reduction.
3.5.3 Cross-Point Matrix
We design a multiplexer based cross point matrix to minimize the area. An alternative
would be to support cross point connections for each de-multiplexed virtual channels. The
latter approach, which produces high network throughput, adds a significant complexity
to the switch. The cross point supports parallel connections between exclusive input and
output ports and is used by the central arbiter to support simultaneous requests. Not
all cross point connections are utilized by the XY routing. After optimizing the logic, we
implement the switch with simple 4 and 2-input multiplexers (for L, N, S and E,W ports
respectively). The above optimization reduces the cross point area to only 32 slices and
25
hence, we gain in router area significantly.
3.5.4 Central Arbiter
To ensure fairness, the competing input ports are allocated based on a simple round
robin approach. The last served/denied port is given lowest priority placed at the end of
the queue. We gain in router performance by a) reducing the logic in the central arbiter,
as we determined that it appears at the critical path of our router and b) centralizing
the arbiter which reduces the number of req/grant signals required. Reducing the logic
and routing also minimizes the congestion in the design. This translates into gain in
performance and power.
We minimize the logic in the critical path by reducing the number of service states
in the central arbiter for the requesting downstream input. The East input port will
be requested only by the West and Local ports and similarly, West input port only
by East and Local ports. Also, it is sufficient if only one grant reaches the requesting
input port for all its requests. This reduces the number of nets to be routed hence
minimizing the congestion. We achieve an operating frequency of 357 MHz without
queuing the simultaneous requests for downstream input ports, i.e we enable the arbiter
to handle multiple requests simultaneously. Upon granting an input port, the central
arbiter configures the multiplexers in the cross point matrix to establish a connection.
3.6 Results: Functional Simulation
In this section, we present the functional simulation results of our design in an XC4VLX100-
11 device [1], on a Nallatech BenDAT AT M [3] development board. We use Xilinx ISE
8.2i to synthesize, place and route our design. We validate the functionality of our design
using Modelsim 6.1c [28] and present the results below. We functionally simulate two
versions of MoCReS: a common clock (synchronous) version and a multi-clock version.
26
CLK
L_in 00 10 A2 33 F2
N_in 00 12 A2 63 62
E_in 00 0A 42 23 22
S_in 00 1A 52 33 32
W_in 00 04 72 53 52
W_out 00 10 A2 33 F2
L_out 00 12 A2 63 62
E_out 00 04 72 53 52
N_out 00 1A 52 33 32
S_out 00 0A 42 23 22
60 ns 75 ns 100 ns
(a) Common Clock
L_CLK
L_in 00 10 33 F2
N_CLK
N_in 00 12 63 62
W_CLK
W_in 00 14 53 52
S_CLK
S_in 00 1A 33 32
E_CLK
E_in 00 0A 23 22
Rd_CLK
W_out 00 10 33 F2
L_out 00 12 63 62
E_out 00 14 53 52
N_out 00 1A 33 32
S_out 00 0A 23 22
250 ns 350 ns 400 ns
(b) Independent Clocks
Figure 3.4: Router Functional Simulation

27
3.6.1 Common Clock Design
The router follows virtual cut-through flow control, based on a simple request/acknowledge
protocol. Figure 3.4(a) presents the simulation results of our standalone router operating
on a common clock. Our central arbiter and cross point are capable of establishing par-
allel input port connections without clock penalty. It can be seen that the flits coming in
through the five input ports are simultaneously switched in the appropriate directions.
Table 3.2: Standalone Router Area Results

Slices
Component Common Clock Multiple Clock
(CC) (MC)
Router 282 302
Input Port 32 55
VC Selector 4 4
bRAM FIFO 20 42
bRAM Arbiter 11 11
Central Arbiter 115 115
Cross Point Matrix 32 32
3.6.2 Multi-Clock Design
Figure 3.4(b) shows the operation of our router when each input port receives data
at different frequencies. Once the empty signal of a FIFO is pulled low, the bRAM
arbiter decodes the header at the router’s read frequency. It can be seen that outgoing
packets are synchronized with the read clock and follow an order similar to their incoming
frequencies.
3.7 Results: Area-Performance Characterization
We choose area and performance as the two design metrics to be characterized for our
MoCReS design. A brief description of the experimental platform constructed for the
area-performance study is also presented below along with the results obtained.
28
700
600
500
400
Area (Slices)
300
Basic(1VC+CC)
1VC+MC
2VC+CC
200
2VC+MC
100
0
0 10 20 30 40 50 60 70
Channel Width (bits)
Figure 3.5: Router Area Vs Channel Width
3.7.1 Router Area Analysis
We synthesize, place and route the structural VHDL model of our router and present an
analysis of its FPGA resource utilization in this section. Upon tightly constraining the
area using Xilinx PACE tool [1], the basic version (1 Virtual channel + Common Clock)
of our router consumes 282 Virtex-4 slices (558 LUTs, 289 Slice FFs) which correspond
to a marginal 0.57% of our target FPGA device (XCV4LX100). Table 3.2 presents the
results from synthesis of the basic version of our router, identified as 1VC + CC.
To characterize our common and multi-clock (MC) router for area, we develop three
more versions of it by varying the number of virtual channels (VC). Figure 3.5 presents
the scaling of area versus channel width for various versions of the router. An increase
in the channel width causes a significant increase in the area of the router, due to scaling
of the cross point matrix. This increase in router area could be significant if the cross
point occupies a larger area as in most of the designs. However, the above mentioned
29
4
x 10
2
1.8
1.6
1.4
Logic Overhead (LUTs)
1.2 Routing Overhead (Nets)
0.8
0.6
0.4
0.2
0
0 10 20 30 40 50 60 70
Channel Width (Bits)
Figure 3.6: 3 × 3 Mesh FPGA Utilization
disadvantage is reduced in our router design, due to the area optimizations applied in the
cross point matrix. From Figure 3.5 we observe that even for an 8× increase in channel
width, the router area increases at the most by 2×.
3.7.2 FPGA NoC Resource Analysis
For this analysis we implemented a 3 × 3 mesh topology of MoCReS routers (with 1 VC

+ CC) using the standard Xilinx ISE design flow. Our 3 × 3 mesh framework consumes
only 6.1% of the available FPGA device area leaving the remaining logic to efficiently
implement the IPs. Figure 3.6 presents the scaling of logic and routing utilization with
channel width in a mesh network. The linear scaling of routing resources utilized with
increase in channel width demonstrates the suitability of mesh topology for FPGAs.
30
3.7.3 Performance Analysis
We make use of Xilinx PACE [1] to tightly constrain the critical path and estimate the
post place and route delay. As the routers will be internally connected to the cores,
we need not consider the pad to pad delays while estimating the maximum frequency.
The design is implemented with a Router Functional Module (RFM) [29] wrapper to
estimate the accurate operating frequency. The standalone version of our basic router
can operate at 357 MHz. Therefore, our 8 bits/channel router has a maximum throughput
of 2.85 Gbits/s.
125
100
75
Avg. Latency
cycles
50
25
0.2 0.4 0.6 0.8 1.0

Injection Rate (Flits/Cycle/Node)
Common Clock Multiple Clock
Figure 3.7: Average Latency Vs Injection Rate of 3 × 3 Mesh
In our router, the head flit advances to the next node while the remaining flits flow
in a pipelined fashion. In this scheme the network latency depends only on path length
H (number of hops). In the absence of channel contention, the latency of our network L
can be expressed as:
L = 7 × H + B/w (3.1)
31
where, B is the number of bytes in the packet and w is the number of bytes switched
per clock cycle. The factor 7 in the expression is the setup latency incurred at every
router hop in the MoCReS router. This latency is due to the decoding & arbitration
performed based on the header flit in a packet. If Li denotes the latency of the ith packet
in cycles, then the average latency of the common and multiple clock networks can be
expressed as,
PN P N P Hi fj
i=1 Li i=1 j=1 fworst
Lcc avg = and Lmc avg = (3.2)
N N
Where fj and fworst represent the frequencies of the j th router and the slowest router
respectively. And Lmc avg is given in cycles of frequency fworst .
Multi-Clock Experimental Platform: In order to evaluate the performance of the
proposed multi-clock framework, we utilize our VHDL model of the router and simulate a
3×3 mesh. We implement wrapper to generate packets in the local core frequency. Router
frequency values are extracted after the topology synthesis, placement and routing stages.
For varying configurations (resource availability, inter-router distance, bRAM/dRAM
FIFO versions), the router frequency can degrade up to 18% [29]. Figure 3.7 shows the
latency versus injection rate curve for the common clock and multi-clock versions. For
the common clock case, the network frequency was 286 MHz and for the multiple clock
case, the frequency ranged from 357 MHz to 286 MHz The X-axis represents the injection
rate expressed as number of flits injected from every node in one cycle. The Y-axis plots
the measured average latency of packets in each case. It can be seen that the increase in
performance of the proposed framework significantly delays network saturation.
Reducing Setup Latency
Connection between the upstream and downstream routers is established through a

req/grant protocol. Upon receiving the first flit, the bRAM arbiter in the downstream
router:
32
• arbitrates the bRAMs to decide which packet progresses forward in that cycle
• pops the head flit from the bRAM FIFO
• decodes the header flit and sends appropriate request to the central arbiter
The setup latency of our router (to accomplish the above sequence of operations) as
shown in equation 3.1 is 7 cycles. This latency is a constant for all communications in
the network, i.e for every router hop, a latency of 7 cycles is incurred.
However, by utilizing the First Word Follow Through (FWFT) capability available in
Xilinx FIFOs, this latency can be reduced by one cycle. Here, the head flit when pushed
into the FIFO appears at the output bus in the same clock cycle, thereby feeding itself
as input to the bRAM arbiter to initiate the request process to the Central Arbiter. The
latency of our network L now be expressed as:
L = 6 × H + B/w (3.3)
3.8 Conclusions
We present an area efficient multi clock on-FPGA virtual cut-through router, that has
a minimum area and high performance compared to previously reported designs. We

introduce optimizations in the central arbiter and cross point implementation to gain
in area and performance. We extend the router to a multi-clock NoC framework with
routers functioning at independent frequencies. We validate the functioning of a stand
alone router and a 3 × 3 mesh framework and characterize the network for area and
performance. We use this framework as the baseline design to perform the experiments
presented in the subsequent chapters.
33
3.8.1 Limitations of our Packet Switched MoCReS Framework
The MoCReS framework for FPGA based NoC design is an important contribution of
this thesis. It has low area and high clock rate in comparison to other proposed FPGA
NoCs. However, the performance improvement derived by this NoC is limited by its
strict packet-switched nature. There is a significant performance overhead involved in
converting data to flits, encoding the headers, and serializing them into packets. This
overhead is particularly prominent for IP cores placed close to each other. Overcoming
this overhead is our motivation for designing a novel architecture that is presented in
Chapter 5 which supports an additional time-multiplexed circuit-switched layer for a
high throughput data transfer between nearby IP cores.

Chapter 4
NoC Power Analysis
4.1 Introduction
With increasing device sizes and capabilities, power dissipation in FPGAs has become a
primary concern. Further, FPGA implementation for portable hand held devices demand
low power consumption to extend the battery life. Therefore, it is essential to attain a
balance between power dissipation and performance in an FPGA based NoC. In this
chapter, we analyze power consumption in our FPGA based implementation of an NoC.
The power dissipated across various components of the NoC is presented. Further, we
discuss the power consumption from the FPGA resource utilization perspective. Once
fully developed, the model could be used to enhance the traditional FPGA design flow
for multiprocessor applications.
Further, we feature a comparison of MoCReS with an alternate approach (LiPaR [11])
in this chapter. The alternate router does not support independent clock frequencies,
and thereby sustains a performance bottleneck. Experimental evidence indicates that the
multi-clock novelty has a marginal power overhead due to the additional clock resource
utilized in the FPGA implementation. Moreover, to understand the impact of power
optimization techniques on the NoC, we determine the power consumption share of NoC
34
35
when a typical image compression application is implemented on our target device and
report the same in this chapter. It is shown experimentally that NoCs in FPGA consume
around 45% of total power dissipated in the chosen application.

The above NoC overhead in power is due to significant utilization of programmable
logic and routing resources in the FPGA. In an attempt to model the power overhead of
the two routers, we present the resource utilization results extracted using the approach
described in Section 7.2.1.
4.2 Related Work
Research in [30] addresses power-performance modelling of NoC in the ASIC domain. It

characterizes a 4 × 4 Mesh NoC for total power consumption using a cycle accurate RTL
model. Hu et al. [31] present a mapping technique to reduce dynamic power consumption
in ASICs.
Vestias et. al. [32] propose an approach to explore the design space of an SoC imple-
mented with NoC backbone. The authors validate their technique by mapping a JPEG
encoder application and optimizing the design for performance. However, the authors
do not characterize the power consumption in their design. The main drawback of the
router that the authors implemented in [32] is that it utilizes store-and-forward flow
control mechanism. The latency of a packet is very high, as it is directly proportional to
the packet size. Moreover, this implementation is not suitable for power efficient NoCs
because of its high buffer requirements. The buffers, in addition to increasing the area
consumed, are also switching continuously thus contributing to a significant portion of

power consumed.
To the best of our knowledge this is the first work to address power characterization
of an NoC targeted for reconfigurable platforms. We also determine the NoC share of
total power consumption using a typical image processing application and report the
36
results in this chapter.
4.3 MoCReS Power Consumption
In this section we analyze the power comsumed in the MoCReS router architecture
proposed in Chapter 3. The resource utilization of the MoCReS router is determined

through place & route using the Xilinx ISE tool. To estimate the power consumed,
Xpower utility of the ISE suite is used.
4.3.1 Components of Power
The central component of an NoC is the router and its power consumption has a sig-
nificant impact on the total power consumed by the NoC. The two main components of
FPGA power consumption are, the static power (referred as quiescent power) and the
dynamic power. In spite of current FPGAs supporting upto 65nm, the dynamic power
tends to dominate the total power in FPGAs, due to high operating frequencies (toggle
rates). This trend is in contrary to ASICs, where at current technology, the leakage
power dominates. Therefore it is important for an estimation methodology to account
both the components of power.
The amount of quiescent power is largely dependent on the target device (technology
and preset operating voltages) and the amount of logic utilized by the design. On the
other hand, the dynamic component depends on,
• Switched Capacitance (Resource Utilization)
• Transition Activity
• Operating Voltage
Switched capacitance varies with the type of logic/routing resource utilized for the
design. Transition activity of every signal in the design can be expressed in terms
37
Table 4.1: Standalone: MoCReS (1VC+CC) Power Consumption

Power Component Operating Voltage Power (mW)
Static Power 1.2 402.56
Dynamic Power 1.2 43.62
Clock Power 1.2 14.0
Logic Power 1.2 12.0
of the clock toggle rate. In our target Virtex-4 FPGAs [1], the logic (CLBs), block
RAMs, clock tree and entire routing resources operate under the same voltage source
(VCCIN T = 1.2V ). The Input/Output blocks in our target FPGA operate at a higher
voltage level (2.5V).
For a stand alone router, we estimate the static and dynamic power contributed
by the design. Clock lines in FPGA design contribute to a significant portion of the
dynamic power consumed. We report the standalone power consumption of MoCReS in
Table 4.1. The dynamic power dissipated under 2.5V category is very marginal due to
the Router Functional Module (RFM) that was wrapped around the router for accurate
estimation of power. The RFM restricts the number of Input/Output blocks that will be
utilized by the router. In this section, we also obtain the router models of an alternate
design, LiPaR [11] and compare its power overheads with our modified multi-clock router
framework.
Even though the dynamic component of the standalone router is a small percentage
of the total power, for a large NoC topology implementation with several instances of the
router (and high switching activity), the dynamic component will equally dominate the
total power. Table 4.2 presents the power dissipated in a 3 × 3 mesh implementation
of MoCReS (1VC+CC).
38
Table 4.2: 3 × 3 Mesh: MoCReS (1VC+CC) Power Consumption

Power Component Operating Voltage Power (mW)
Clock Power 1.2 124.96
Logic Power 1.2 97.59
4.3.2 Comparison with LiPaR
In this section, we compare the power consumed by our router architecture with an
alternate router design, LiPaR [11]. Our MoCReS router supports independent clock
frequencies as opposed to LiPaR [11], thereby resulting in increased performance. In
order to evaluate the power trade-offs involved in our design novelty, we compare the
two router designs keeping the following entities a constant:
• Same target device (XC4VLX100 [1])
• 16 Flit Buffer Size
• Channel Width of 8 bits
• bRAM FIFO Implementation
• Number of (Ports,VCs) : (5,1)
• Identical transition activity
In addition to the above parameters, the operating frequency for the power experi-
ments was set at 100 MHz which corresponds to the critical path delay of LiPaR.
Standalone Version: Activity data for power estimation (.vcd) is obtained by simulat-
ing the post place & route model of the two designs (with random inputs). Xpower [1]
takes the design description (resource utilization) as a .ncd file along with the .vcd gen-
erated above. Table 4.3 compares the dynamic and quiescent power consumed by the
39
Table 4.3: Standalone: MoCReS (1VC+MC) vs LiPaR

Power Component Operating Voltage MoCReS Power (mW) LiPaR [11] Power (mW)
Static Power 1.2 489.06 496.12
Dynamic Power 1.2 57.16 51.25
Static Power 2.5 714.15 719.08
Dynamic Power 2.5 14.06 17.82
Clock Power 1.2 28.57 21.26
Logic Power 1.2 22.45 25.19
two approaches on the same target FPGA device. Due to fewer resources utilized in
MoCReS, there is a gain in static power and dynamic power contributed by the logic
resources. This is due to the logic optimizations in the cross-point matrix and central
arbiter which led to fewer FPGA resource utilization. Due to increase in dynamic compo-
nent, the power overhead in our approach is marginally (11.53%) more than the alternate
design presented. It is important to note that the equivalent router for LiPaR is the ba-
sic version of MoCReS (1VC+CC), as the LiPaR does not support virtual channels and
operated on a single clock. Table 4.1 presents the power consumed by that version of
MoCReS.
Multi-Clock Feature Our modified MoCReS architecture supports independent clock
frequencies for router instances, thereby allowing the router to function on the high-
est individual clock rate that the placement, routing and switch complexity constraints
dictate. Therefore the number of clock nets utilized could be higher and can cause
an increase in power consumed. With clock lines typically having higher fan-outs, the
switched capacitance and therefore the dynamic power associated with the clock nets
can be significant. Xilinx power estimation framework permits estimating the power
consumed by the clock lines independently for both the designs. It can be seen that
there is only a marginal 11.53% additional power overhead in our MoCReS approach
that gives significant performance improvement.

40
Table 4.4: Power Dissipation Across NoC Components

S.No NoC Component Dynamic Power (mw)
1. Input Port 7.88
FIFO bRAMs 6.02
VC Arbiter 0.35
bRAM Arbiter 1.72
2. Cross-Point 12.11
3. Central Arbiter 15.69
4.3.3 NoC Component Power
To effectively characterize the FPGA NoC implementation for power, we determine the
dynamic and quiescent power of each major component in the NoC. The experimental
platform involves an incremental place and route of the NoC. During every stage we
retain all the components of it and stimulate the part of design under investigation. We
extract dynamic and quiescent power for identical activity rates.
In order to better understand the contributions to dynamic power of each of the
router components, we activated parts of the MoCReS design with random input vectors
(following a uniform distribution) and observed the dynamic power consumed. It can
be seen that the buffers of the router consumes highest dynamic power. The results
are in agreement with those extracted for ASICs [30]. It is to be noted that there are
five instances of the Input Port component We believe these results will be useful in
developing future NoC designs targeted for FPGAs with optimized power consumption.
Stimulus is applied to the primary inputs. We use Modelsim 6.1c with TCL scripts
to extract the transition activity of internal nodes. This activity generated for the com-
ponent alone is applied to it after an incremental place and route with remaining com-
ponents contribution eliminated. We use the technique developed by Arole [33] in his
power profiler. We exhaustively analyze power consumed in every resource of the NoC
by this methodology.
Resource Utilization and Dynamic Power: Based on the dynamic power consumed
by every component and its resource utilized, which is extracted using the ncd2xdl,
41
Table 4.5: FPGA Resource utilization : MoCReS(1VC+MC) vs LiPaR

FPGA Resource
Router Logic (Slices) Routing Resources
Single Double Hex Long Nets
MoCReS 302 5633 1402 439 5 902
LiPaR 352 8504 2314 740 8 1177
we model the dynamic power across every resource. Table 4.5 compares the resource
utilization between our router and the baseline design. Significant reduction in the
number of nets and routing & logic resources used, contribute to the gain in dynamic
power.
4.4 Conclusions
In this chapter, we discuss the power consumed by our NoC framework on FPGA. We de-
termine the various power components in the standalone design and a 3 × 3 mesh NoC.
Further, to determine the power overhead incurred due to supporting the multi-clock
feature, we compare its power consumption with an alternate design that supports only
one frequency. Results show a marginal 11.53% increase in dynamic power compared to
the baseline approach. Further, we determine the power contributed by various compo-
nents of the router. Results show the buffers to consume majority of power consumed in
the NoC design. An account on the FPGA resources utilized by MoCReS in comparison
with the alternate design is also presented in this chapter.
Chapter 5
Hybrid Two-Layer Router

Architecture
5.1 Introduction
The two main concerns with NoC designs that are strictly packet-switched are the control
and serialization overhead involved in transfering data between IP cores that are placed
close to each other in the FPGA. In order to ensure high throughput between these cores,
we advocate time-multiplexed circuit-switched connections. In addition to this mode of
transfer, the router also preserves the online nature of communication between farther
cores through the packet-switched layer. The area efficient MoCReS architecture pre-
sented in Chapter 3 is modified to support both the above mentioned layers of operation.
The design goals and issues involved in the hybrid two-layer architecture are presented
in this chapter. We also develop a SystemC model of our router for both functionally
verifying the design as well as to vary its specifications and obtain the performance re-
sults rapidly through simulation. We present the results and analysis of the novel router
architecture in this chapter.
42
43
5.2 Motivation
Packet-switching performs online scheduling by dynamically negotiating communication

between the cores. An alternate technique, namely circuit-switching offers high through-
put dedicated connections to overcome the performance drawbacks in packet-switching

by scheduling time-multiplexed communication across the cores. Even though this static
scheduling requires all the communication patterns to be known before hand, it can pro-
vide a very high throughput with marginal area overhead (for storing schedules). We
propose a modified router architecture which interfaces multiple IP cores to the router
and supports packet-switching for inter router transfers and time-multiplexed circuit-
switching for IP cores connected to the same router. This technique also eliminates
the latency in req/grant protocol, serialization and control overheads for data transfers
between cores placed close to each other in FPGAs and mapped to the same router.
5.2.1 Packetization and Control Overheads
In this section, we quantify the overheads associated with the existing baseline approach
(MoCReS). Control and Packetization are the two main overheads associated with the
MoCReS framework.
1. Control Overhead: In MoCReS, connection between various ports are estab-
lished through a req/grant protocol which involves round-robin arbitration in the case
of common ports requests (conflicts). From Chapter 3, we see that it takes at least 6
cycles for the data at the input port to appear at the output of a router (as input to the
downstream router/local IP). This setup latency is a fixed overhead in addition to the
delays due to network congestion.

2. Packetization Overhead: Due to the nature of interconnection network, the
channel width between ports/routers are limited to a fixed size (8 bits in MoCReS,
baseline version). Due to this fixed channel width, the communication data that is to be
44
sent over the network must be quantized into flits. Variable number of flits constitute a
packet. If F is the number of flits in a packet and b is the channel width, then F/b is
the serialization latency associated with the communication.
5.3 Related Work
We target our proposed NoC framework for reconfigurable computing platforms and
therefore we restrict our discussions in this section primarily to existing FPGA based
NoCs. NoCs were introduced into the FPGA domain mainly to simplify tile-based recon-
figuration [4] [5], and its potential as an effective communication architecture is largely
unexplored [34]. Research in [10] [21] address the capabilities of FPGAs to support
NoC based multi-processor applications. Hilton et al. [7] incorporate flexibility into their
design for FPGA based circuit-switched NoCs. However, their strictly circuit-switched
router suffers from signal integrity and path reservation issues which we overcome in our
design. SoCBUS [35] proposes a circuit-switched router with a packet based setup. Here,
control packets are responsible for setting up strict circuit-switched connections, which is
different from our two-layer approach. Research in [36] [7] [6] also present FPGA based
NoCs. The above designs ignore implementation level area-performance trade-offs while
proposing the architecture, thereby limiting to a system-level performance analysis.

To the best of our knowledge, this is the first work to propose an FPGA-suitable
hybrid router architecture integrated with an automatic topology synthesis framework
that satisfies the bandwidth requirements of an application while optimizing its area
overhead.
5.4 Architecture Description
In this section, we first present the modified router micro-architecture, followed by its
architectural advantages and design issues involved. The network topology along with the
45
N
NI NI
IP0 IP1
E
W Circuit-Switched Layer
Packet-Switched
Layer
NI NI
IP2 S IP3
Figure 5.1: Hybrid Two-Layer Router Architecture
flow control for the packet-switched layer are kept the same as presented in Chapter [37].
Network Topology: Mesh networks have minimum area overhead (reduced long lines) [10] [37],
low power consumption and map well to the underlying routing structure of FPGAs.
Hence, we choose a mesh topology to optimize logic and routing in FPGAs, and to
provide sufficient resources for the IP cores.
Flow Control: Our router supports multi-clock virtual cut-through flow control with a
deadlock-free XY routing. The switch complexity involved in the above choice is more
suitable for a light-weight implementation [37].
5.4.1 Cross-Point Matrix
Architecture Modifications The modified switch is comprised of two layers of oper-

ation: a high throughput time-multiplexed circuit-switched layer (C-layer) and a multi-
clock packet-switched layer (P-layer). Variable number of IP cores connected to the

46
switch participate in the C-layer, thereby achieving guaranteed throughput and more
predictable latencies between IP cores placed close to each other in the FPGA.
Figure 5.1 presents the novel two-layer hybrid router architecture. This modified
router has four local IP ports, in addition to the four directional ports. Further, in
this case two of the four local IPs (IP 0,IP 3) are participating in the time-multiplexed
circuit-switched layer. Using the packet-switched layer, all the four IPs can communicate
to the neigbouring routers through the directional ports.

The cross-point matrix is multiplexer based, as opposed to providing connections for
each virtual channel. The following are the design issues involved with the cross-point.
Packet-Switched Cross-Point: In the packet-switched layer, the directional input
ports (N,E,S,W) are multiplexed to every local port. Therefore cross-point connections
are introduced to support these additional local ports. However, all the connections
between the local ports in this layer are removed, as they are connected in the circuit-
switched layer. The ports connected through the C-Layer (IP 0,IP 3) cannot participate
in the P-Layer to transfer data between themselves. This translates into gain in area
which we utilize to increase the bandwidth available.

Circuit-Switched Cross-Point: Let Li be the total number of local IPs and Pi
be the number of ports participating in the circuit-switched layer. The bus width of this
cross-point is currently set to 32 bits in order to support a very high bandwidth. Further,
this cross-point can handle a maximum of Pi high throughput parallel connections. The
scheduling memory configures this cross-point during various time slots.
Router Channel Widths: Due to high throughput requirement between the cores
participating in the circuit-switched layer, we set the channel width to 32 bits (corre-
sponding to the data width of microblaze soft processor). In the packet-switched layer,
we retain the bus width of MoCReS (8 bits/channel). However, choice of an appropriate
channel width is a trade off between resources available and bandwidth required.
47
Cross Point Matrix

Proc_L0: PROCESS (FROM_L0)
8:1 4:1 Case From_L0 is

When L0_Last_N =>
Directional Pkt S/W Local Ckt S/W N, E, S, W Req Check Only
3
When L0_Last_E =>
3 3 3 3 3
Msel_L0 Msel_L3 Msel_N Msel_E Msel_S Msel_W
When L0_Last_S =>
Central Arbiter When L0_Last_W =>

No L1-L3 Last Grant States
end case;
end process Proc_L0;

Req_in_W
Grnt_in_W
Req_in_L0
Grnt_in_L0
Req_in_S
Req_in_L3
Req_in_N
Req_in_E
Grnt_in_S
Grnt_in_L3
Grnt_in_N
Grnt_in_E
Figure 5.2: Modified Central Arbiter Model
5.4.2 Central Arbiter
The Central Arbiter is responsible for configuring the simultaneous connections by setting
the cross-point in the P-Layer. We run parallel FSMs to ensure that no queing takes place
between requests. As long as the participating IPs request mutually exclusive ports, the
connections happen parallely. In case of queing/conflicts, the arbitration is performed
through the round robin approach. The IPs that participate in the C-Layer will not
need arbitration between themselves in the P-Layer. We perform state reduction in the
FSMs corresponding to those inter-local port connections .i.e in correspondence with the
inter local IP connections that are removed (Section 5.4.1) in the packet-switched layer.
The Central Arbiter is also customized to not support states for these connections. The
simplicity of round-robin arbitration coupled with the above state reduction translates
into significant area savings. Figure 5.2 shows the modified central arbiter model.
48
5.4.3 NI Design
The network interface arbitrates the choice of packet/circuit switched layer and is also
responsible for supporting variable size packets.
Mode Switching: Upon receiving the target IP co-ordinates, it triggers the mode signal
to decide if the packet will be decoded to leave the router or the cross point is triggered
in circuit switch mode.
Variable Packet Sizes: As mentioned in Section 3.5.1, during packet-switched trans-

fer, the network interface is also responsible for encoding the header with:
1. Packet Size (As a fraction of bRAM depth)
2. X co-ordinate of destination IP
3. Y co-ordinate of destination IP
The packets transfered through the network can be broadly classified as control (lesser
number of flits) or data. Therefore, the packets will be of varied sizes. The NI encodes
the packet size as a fraction of the total bRAM depth along with the header. This novelty
improves buffer utilization, thereby increasing the performance of the NoC.
5.4.4 Design Parameters
In order to quickly explore the NoC design space, we have parameterized the structural
VHDL model of our router for:
1. Total number of ports
2. Channel width
3. Virtual Channels/port
4. Number of ports participating in the C-Layer

49
By varying the above parameters, we develop a component library, M oClib which we

use to characterize variants of the router for area and operating frequency.
5.5 Architectural Advantages
Bandwidth Increase: Bandwidth available in a switch is the product of the num-

ber of ports, operating frequency and channel width. The C-layer has minimum logic
overhead with no buffering and can operate at a clock rate significantly higher than the
P-layer. Furthermore, increasing the number of ports also scales the available band-
width in a switch. Moreover, the absence of control/serialization overheads (req/grant)

also increases the throughput.
Power Savings: The amount of logic required for the NoC reduces with router count,
thereby saving static power. Further, with increasing number of ports within a router, the
average packet latency is also reduced [36]. Therefore dynamic power drops considerably
with reduction of router hops.
Guaranteed Throughput: The time-multiplexed nature of the C-Layer scheduling
provides good Quality of Service (QoS) to the application, particularly, between cores
placed close to each other. Otherwise, the NoC would have to support area expensive
QoS protocols to ensure the required bandwidth.

Inherent Multi-Cast Capability: The cross-point in the C-layer can be configured
simultaneously for a multi-cast (one to many destinations) operation among IPs con-
nected to the same router without any penalty in performance. Further, this capability
also optimizes the area required for storing the schedules (with fewer bits required to
encode the configuration data of the circuit-switched network).
50
Setup Latency
P_CLK
N_in 00 10 A2 33 F2
E_in 00 12 A2 63 62
S_in 00 0A 42 23 22
W_in 00 1A 52 33 32
L2_in 00 04 72 53 52
S_out 00 10 A2 33 F2
L2_out 00 12 A2 63 62
W_out 00 04 72 53 52
E_out 00 1A 52 33 32
N_out 00 0A 42 23 22
C_CLK
0000
L0_C 0000 A01A 1011 A054 814B xxxx 7054 810B 9910 xxxx DF54 614B 7071 D054
L1_C 0000 A01A 1011 A054 814B xxxx DF54 614B 7071 D054
0000
L3_C 0000
0000 7054 810B 9910 xxxx DF54 614B 7071 D054
Multi-Cast
operation
50 ns 100 ns
Figure 5.3: SystemC Simulation
5.6 System-Level Router Model
With increasing design complexities, there is a need for rapid design space exploration
that makes use of a set of specifications. We model our NoC router framework using
SystemC. By doing so, we functionally verify the model as well as setup a platform to
estimate the advantages of this architecture over the baseline approach.
SystemC is a description language that abstracts the computation elements of a design
by behaviors (or processes) and simplifies the communication between the cores using
transaction level modelling. The framework has a set of library routines and macros
implemented using C++. The behavior of the hardware to be modeled is captured by
simulating concurrent processes coded in C++.
SystemC Tool Flow: Every component in the router is modeled in C++ as a process.
This .cpp file can be compiled and executed with the SystemC engine that is written in
C++. We use the opensource SystemC version 2.1 to compile our router design. The
set of .cpp files are first compiled with the appropriate command options. Then, an
51
600
Area (Slices)
400
200
0
3
2 8
#C
ircu 6
it s/ 1 rts
wP
orts
4
t s/w Po
0 2 #P acke
Figure 5.4: Design Parameters Vs Area

executable is created to run the toolflow. We dump out the Value Change Dump (VCD)
file from the engine.

The .VCD file of the router model can be used as follows:
• Applied to standard simulation tool for verifying the functionality of the model by
viewing the waveform
• Estimate preliminary power consumed by the implementation on FPGAs, by using

Xpower and the architecture information (Virtex-4)
5.7 Synthesis Results
In this section we present the Area/Synthesis results for our modified router implemented
on Xilinx Virtex 4 [1].
The additional bandwidth offered by the proposed router comes with an increase
in switch complexity. The amount of FPGA logic and routing resources consumed by
the router instance depends on its complexity. Figure 5.4 presents this variation in
switch area with the number of ports (C & P-Layer) it supports. Further, the operating
frequency of the router instances vary greatly due to different critical path lengths.
Also, with increasing number of ports participating in the circuit-switched layer, the
routing resources deplete rapidly (due to increased channel widths). This degradation in
52
Max. Frequency (MHz)

400
200
0
3 8
6
2
# Cir Ports
cuit 1 4 e t s/w
s/w P ack
orts 0 2 #P
Figure 5.5: Design Parameters Vs Frequency
Table 5.1: Scaling of Area and Frequency with No.of C-Layer Ports
MoClib Component Area (Slices) Frequency (MHz)
MC (4,2,2) 314 336
MC (5,3,2) 326 318
MC (5,2,3) 341 303
MC (6,3,3) 394 240
MC (6,2,4) 382 258
MC (7,3,4) 440 221
performance in turn affects the bandwidth the switch can offer. Figure 5.5 presents the
variation in switch operating frequency with the number of ports in both layers. The
above area and frequency estimates are obtained by varying the parameters in the VHDL
model of the router and by implementing them on the target device.
Furthermore, to perform automatic topology synthesis, we estimate the increase/decrease
in switch area with exclusive variations in number of P-Layer ports and C-Layer ports
independently. When NoC area is in the cost function, the above data will aid rapid
design space exploration. Tables 5.1 and 5.2 present the scaling of area & frequency
with increasing C-Layer and P-Layer ports respectively. In the tables, MC(x,y,z) denote
an instance of the M oClib library, where y is the total number of C-Layer ports, z is
the total number of P-Layer ports and x is the sum of the two (total number of ports).
Table 5.2 presents the scaling of area and frequency only with respect to the P-Layer
ports and therefore they can be considered as variations of the MoCReS baseline router.
53
Table 5.2: Scaling of Area and Frequency with No.of P-Layer Ports
MoClib Component Area (Slices) Frequency (MHz)
MC (3,0,3) 296 378
MC (4,0,4) 318 362
MC (5,0,5) 349 324
MC (6,0,6) 390 296
MC (7,0,7) 435 267
MC (8,0,8) 493 229
5.8 Results: Performance Improvement
Area vs Average Available Bandwidth/Port: The baseline version in this com-

parison is MoCReS with 1VC+MC. The area (in slices) of the switch increases with the
number of ports it supports. We measure the area values for increasing number of ports
(packet-switched) in the baseline version. For similar area values, when the alternate
hybrid router is used, there is an increase in available bandwidth per port. This band-
width increase associated with the hybrid router architecture is compared in this section
with the baseline approach. For equivalent area overheads (in slices) on a similar FPGA,
Figure 5.6 presents the bandwidth capacity (in MB/s) of the NoC (per port) for both
approaches. In spite of a rapid degradation in operating frequency (with increase in
circuit-switched ports), there is a significant bandwidth gain using the hybrid two-layer
approach. For the area window utilized in our library of routers, there is an average
20.4% gain in bandwidth (maximum of 24%) offered by our NoC. This gain in perfor-
mance is due to supporting a high throughput circuit-switched layer with a marginal
area overhead.
5.8.1 Design Issues
Even though it appears intuitively that an increase in number of ports in the C-layer
gives performance benefits without any area overhead, there are certain design issues
that can potentially limit the performance due to increase in switch complexity.
54
800 25
700 20
Avg. Bandwidth / Port
% Gain in BW
600 15
500 10
400 Hybrid Two−Layer Router 5

Baseline Approach
% BW Gain
300 0
250 300 350 400 450 500
Area (Slices)
Figure 5.6: Area (Slices) Vs Avg. Bandwidth / Port
Operating frequency Vs Switch Complexity: There is a depletion of critical re-
sources associated with an increase in switch complexity (number of ports, bus width).
As a result, the operating frequency of the switch degrades which in turn affects the
bandwidth offered by the router. For the NoC paradigm to efficiently be an alterna-
tive to the bus-based architecture, the performance design parameters must be chosen
carefully so that it is possible to operate the routers at the highest possible frequency.
Switch Power vs Link Power: By increasing the number of ports, we can reduce
the average hop count [36], i.e we minimize the routers and links. This translates into
a reduction in power consumed by the links, but an increase in power consumed by the
switches. Beyond a cut-off, the increase in switch power can potentially overshadow the
gain in link power, thereby it can increase the power/flit ratio.
Explosion of Schedule Memory: With increasing number of C-layer ports, the
schedule memory also scales linearly. The schedule memory, expressed in number of
LUTs is a function of number of schedule cycles and C-layer ports present. If C is the
number of ports participating in the C-Layer, then dlog2 Ce is the number of configuration
55
bits required per cycle.

Clock Signal Integrity: Operation of the C-layer ports require the participating IP
cores to be synchronous, as there is no buffering done, as opposed to packet-switch where

multi-clock FIFOs separate the clock domains. Increasing the number of C-layer ports
could potentially increase the distance between the connected IP cores. In this case, the
signal integrity acts as a limitation to the number of C-layer ports, and reduces the clock
rate.
It can be seen that all of the above factors limit the amount of performance gain that
can be achieved using our hybrid approach. This trade-off between performance, area
and port count merits a balance and requires an application-suitable tuning of the NoC
topology. We present an algorithm along with a CAD flow in Chapter 8 to automate
topology synthesis for FPGA based NoCs.
5.9 Conclusions
In this chapter, we present the limitations associated with the MoCReS packet switched
NoC and then design and implement a hybrid two-layer router architecture for FPGA
based NoCs. We functionally verify the design and characterize several versions of the
novel router for area and operating frequency. We also present the bandwidth results
along with the design advantages and issues involved in the proposed architecture.
Chapter 6
Hybrid NoC: Performance and

Power Analysis
In this chapter we analyze the novel router architecture presented in Chapter 5 for
performance and power. Our MoCReS router design in Chapter 3 is utilized as the
baseline router in making the performance and power comparisons. We retain the Virtex-
4, XC4VLX100 [1] device as the target for all the comparisons.
6.1 Performance Analysis
The main advantage behind the hybrid approach used in the router architecture is in
offering increased overall throughput. The C-layer connections are pre-scheduled between
IP cores that require high bandwidth. These short distance high bandwidth connections
come with a less resource penalty in FPGAs. In this section, we quantify the average
improvement in performance compared to the baseline approach.
56
57
IP6 IP5 IP4
R6 R5 R4
IP1 IP2 IP3
R1 R2 R3
MoCReS Baseline Router
Figure 6.1: Baseline 3 × 2 MoCReS Mesh
6.1.1 Packet Injection Rate Vs Average Latency
In a bus based architecture, performance is measured in terms of overall bus frequency
and number of masters/slaves. As opposed to this, the overall performance of an NoC is

measured in terms of the net traffic it can sustain. Therefore, we design our experiments
by injecting packets of finite flit size with varying rates at each node in the network.
The experimental framework used in this section has been partially adopted from [27].
We have instantiated a 3 × 2 mesh network (Figure 6.1) with the baseline MoCReS
router. Furthermore, six packet injecting modules were wrapped around the mesh frame-
work. The framework adopted from [27] uses a C++ module which generates an input
file that contains all the packets (input vectors) that the network needs to transport.
Input parameters to the C++ program are, the mesh co-ordinates, number of packets
to be generated and the length of each packet. In every generated packet, the first flit
contains the destination IP X and Y co-ordinates and packet size (fraction). The VHDL
testbench wrapped around the mesh directly controls the injection rate at all IP input
ports, thereby serving as an abstraction of the cores that inject/receive packets.

The above model is simulated using Modelsim 6.1 [28] along with the input vectors
58
IP6 IP5 IP4

R6
R5

IP1 IP2 IP3

MoCReS Baseline Router
R1 R2

2-Layer Hybrid Router
Figure 6.2: Modified Hybrid Router 2 × 2Mesh
from the perl tool. While generating the packets, it is ensured that the source and des-
tination of the packet are never the same. The rest of the testbench reads the generated
text files and injects the packet into the source port. While doing so, the packet injection
timestamp is also recorded. When the same packet is received at the destination after
a finite number of clock cycles, the testbench also marks the time stamp. Therefore,
upon successful completion of simulation, the VHDL testbench creates two files for every
IP. One with the injection timestamp of every packet and other with the received time
stamp for every packet arriving at the destination IP.

Using the above data, the number of cycles that is elapsed for each injection rate
case is computed and the average for all the injected packets is computed. We repeat
the same process for a 2 × 2 mesh network that has two routers two C-layer ports each
(with hybrid routers), as shown in Figure 6.2
The experimental flow developed in [27] determines the total execution time, wherein
the largest timestamp of the received packet is reported. We modify the flow to compute
the latencies of every packet injected into the network and finally determine the average
latency of the baseline & modified network for that particular injection rate. We increase
the injection rate in steps of 0.1 flits/node/cycle and observe the increase in average
latency of the two networks.
59
Figure 6.3: Packet Statistics: No. Packets Injected from each IP
For both the baseline and hybrid router approaches, we apply a combination of two
traffic scenarios: random traffic and hot spot traffic.
• Random Traffic: The source-destination pairs and the number of packets are com-
pletely random values that follow a uniform distribution. Once generated, the same
traffic is applied for both the baseline MoCReS and hybrid mesh framework.
• Hot Spot Traffic: Along with the above random approach, we forcefully choose
source-destination pairs such that a majority of the transfers occur between one or
two IPs. We manually perform this task to create hot spots in the traffic. This
could be a common scenario in a SoC, where critical components such as memories
have a significant % of overall packets transfers.
Figure 6.3 presents source-destination pairs for all the packets injected into the net-
work. As mentioned before, a packet is never routed back to the source IP. The figure
presents the number of packets sent into the five possible destinations for each source IP
(IP1 to IP6).
60
1000
900
Hybrid Router Mesh
Baseline MoCReS
800
700
Average Packet Latency
600
500
400
300
200
100
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
# Flits/Node/Cycle
Figure 6.4: Results: Baseline 3 × 2 MoCReS Vs Hybrid Router Mesh
Experimental results are presented in Figure 6.4. As the injection was increased from
0.1, the average latency in both cases increases linearly as well. However, it can be seen
that the network saturates for the baseline approach around the 0.65 flits/node/cycle
mark. In the hybrid router approach, the network sustains linear average latency until
0.75 flits/node/cycle which is a significant improvement in a small mesh network that
consists of 6 cores. The above saturation point will considerably vary with the number
of IPs (dimension of mesh).
We obtain the above improvement in saturation due to two main reasons:
1. Increased bandwidth between two pairs of cores that reduces the traffic burden
from the network.
2. The reduction in average number of packet hops due to reduced number of routers
compared to baseline approach. Mapping of IPs have a significant impact on the
61
total number of packet hops. Compared to the baseline case, the total number of
packet hops reduced by 42.8% in the hybrid mesh due to the routers that support
more than one IP.
In the above analysis we have simplified the latency analysis by neglecting the zero
load setup time for a C-Layer connection. Our hybrid router incurs a 2 clock cycle penalty
for every C-layer schedule memory look up and cross-point connection. Furthermore, in
the above hybrid router mesh, utmost only two ports participate in the C-Layer. In
case of routers that have several C-layer ports with connection between C-layer ports
changing more frequently, this penalty needs to be included while comparing with the
baseline mesh.
6.2 Power Analysis
This section presents the detailed power trade-offs involved in the proposed router ar-
chitecture. We first present the amount of total dynamic power consumed in our hybrid
router along with the power breakdown across its components. Further, we also present
how the above power metric scales with the number of C- and P- Layer ports present
in a router. Furthermore, power dissipated in our hybrid router mesh is compared with
the baseline router. Before determining the above power numbers, we floorplan the NoC
in our target FPGA and then place and route the design to obtain accurate resource
utilization estimates.
6.2.1 Power Breakdown Results
For this analysis, we consider an 8 port hybrid router (4 directional + 4 local) with
3 IPs participating in the C-Layer. The C-Layer components are the Schedule FSM,
bRAM Schedule memory and the C-Layer CPM (C-CPM). We apply random inputs
to the router and determine the switching activity of the placed and routed model.
62
Table 6.1: Input Flits: Transition Activities

Bit Position 0 1 2 3 4 5 6 7
Transition Activity (%) 50.02 49.58 50.94 50.12 49.24 49.28 51.36 48.82
Bit Position 8 9 10 11 12 13 14 15
Transition Activity (%) 51.06 51.02 50.50 49.62 50.22 50.28 50.12 49.58
Total Dynamic Power: 134.87 mW @ 200 MHz
Figure 6.5: Dynamic Power Breakdown of an 8-port Hybrid Router
Using XPower[1], we obtain dynamic and static power estimates for our router. Towards
obtaining a component based power estimate, we adopt the power profiler methodology
(with incremental synthesis) used in [33]. We apply 5000 flits with random requests
to the hybrid router and capture the applied transition activity in Table 6.1. The net
dynamic power consumed by this router is 134.87mW at 200 MHZ operating frequency.
Figure 6.5 presents the power breakdown of the 8 port hybrid router.
The power consumed/packet data for each router varies based on its complexity.
The complexity of the router translates into the amount of logic and routing resources
(switched capacitance) consumed by it. Figure 6.6 presents the scaling of dynamic power
(mW) with increasing P-Layer ports. With increasing P-Layer ports, the Cross-Point
and Central Arbiter scale linearly in terms of amount of logic utilized, thereby increasing
the dynamic power linearly.
63
180
Dynamic Power Vs Switch Size
160
Dynamic Power (mW@200MHz)

140
120
100
80
60
40
2 3 4 5 6 7 8 9 10
P−Layer Ports
Figure 6.6: P-Layer Ports (Switch Size) Vs Dynamic Power (mW)
With increasing C-Layer connections, the amount of long interconnects utilized within
the router increases, thereby increasing the amount of switched capacitance. For equiva-
lent routers (in terms of no. ports), higher C-Layer connections can increase the dynamic
power consumed by up to 18%. Figure 6.7 presents dynamic power scaling in three cases
with varying number of directional ports. In this analysis we fixed the size of schedule
memory at 16 words with 8 bits/word. We have configured the schedule FSM to take as
input the source IP (2 bits), destination IP (2 bits) and the number of clock cycles (4
bits) in binary representation, thereby reducing the number of words required to store a
schedule. However, as the number of ports or configurations increase, the schedule mem-
ory needs to expand. In the above case the power fraction contributed by the schedule
memory will increase.
6.2.2 Switch vs Link Power
The two components of power consumed in an NoC are the switch power and link power.
The switch power is determined by the cross point size (number of ports), arbitration
and routing logic used. Fewer number of ports/router implies reduced switch complexity
64
135

2 - C-Layer Ports

130

Dynamic Power (mW) at 200 MHz
3 - C-Layer Ports

125
4 - C-Layer Ports

120

115

110

105

100

2 Directional Ports + 3 Directional Ports + 4 Directional Ports +
4 Local IPs 4 Local IPs 4 Local IPs
Figure 6.7: C-Layer Ports (Switch Size) Vs Dynamic Power (mW)@200MHz

65

Baseline MoCReS Mesh
Power (mW)
550
Hybrid Router Mesh
500
450

400

350

300

250

200

150

Dynamic Power @ 200 MHz Static (Quiescent) Power
Figure 6.8: Power (mW): Baseline MoCReS Vs Hybrid Router Mesh
and increased number of links. In our router designs, we have employed the simple XY
routing with round robin arbitration. This leads to a marginal scaling in switch power
with router ports. Further, with increasing interconnect sizes (farther cores), the link
power begins to dominate the overall power consumed in the NoC. In our hybrid router
architecture, we use C-layer connections only for IP cores placed close to each other. This
incurs a marginal penalty in link power compared to long bus based connections. We
present the impact of floorplanning on power and performance of the on-chip network at
a greater depth in Chapter 9.4
Figure 6.8 presents a comparison between the amount of total power consumed in the
hybrid two-layer router and the baseline version. The NoC supports 6 IPs with the same
network topologies in Figures 6.1 and 6.2. Due to reduced number of interconnects and
logic resources in the hybrid router, there is a gain of about 15.38% in dynamic power
and 10.1% in static power consumed.
Chapter 7
Experimental Platform
In this chapter we present the experimental platform developed to evaluate the ap-
proach presented in this thesis. First, a description of the multi-processor benchmarks
is presented. This is followed by an enumeration of the phases in router design and

implementation on our target FPGA.
7.1 Multi-Processor Benchmarks
To determine the impact of our approach on typical designs, we utilize a set of real and
synthetic benchmarks that are widely used in NoC studies [36] [38]. The benchmarks are
represented by task graphs. We use the following graph model in which communication
between IP cores in a design can be abstracted using a Core Communication Graph

G(V, E), where,
• Vi denotes the vertices (cores),
• E(i, j) and BW (i, j) denote the directional edges (communication pattern) and
corresponding bandwidth requirement between IP core i and j
These task graphs can represent many real applications, including MPEG, DFT etc.
They can also be easily generated for a comprehensive study of the proposed approach.
66
67
There are two main kinds of benchmarks used in our research:

1. Application Benchmarks: The four application benchmarks [39] used are Finite
Fourier Transform (FFT), Moving Photographers Experts Group (MPEG4) encoder,

Video Object Place Decoder (VOPD) and Multi-Window Display (MWD), The bench-
marks are provided as task graphs where vertices in the graph represent computation
elements and directional edges represent communication patterns (precedence). These
class of applications permit traffic characterization early in the design cycle. This infor-
mation is used to fine-tune the NoC topology. Each of the edges are annotated with the
required bandwidth in Mega Bytes/second (MB/s).
2. Synthetic Benchmarks: In addition to these real application benchmarks, we obtain
a rich set of synthetic benchmarks that were generated using Task Graphs For Free
(TGFF) [40] in [36]. These benchmark cases sustain a rich variety of communication
properties, namely, in-degree, out-degree and dependence width and therefore represent a
wider class of multi-processor applications. For these synthetic benchmarks, we randomly
generate bandwidth requirements that follow a uniform distribution and use them to
annotate the edges.

Table 7.1 summarizes the properties of the application and synthetic benchmarks used
in this research. The table presents the number of cores (|V |), number of edges (|E|),
max in/out-degree, max/min bandwidth required. The degree of the nodes represent the
packed nature of the benchmarks. The maximum bandwidth requirement denotes the
opportunities for clustering (violation), while the minimum bandwidth requirements
represents the potential to reduce area (by increasing the number of packet switched
ports). The clustering and area reduction approach are discussed in Chapter 8
68
Table 7.1: Application and Synthetic Benchmarks
Benchmark |V| |E| In-degree Out-degree Bandwidth MB/s

Application Max Min
MPEG4 12 26 7 7 910 1
FFT 15 22 2 2 473 26
VOPD 12 14 2 2 500 16
MWD 12 13 2 2 128 64
LU Decomposition 9 11 2 3 510 76
Laplace Solver 9 12 2 2 378 68
Synthetic
Basic -1 9 8 1 4 196 34
Parallel -1 9 14 3 4 225 47
Packed -1 9 16 3 5 334 59
Packed -2 9 15 3 4 412 106
7.2 Xilinx ISE Design Flow
For the router design and implementation, we follow the Xilinx ISE [1] flow. The various
versions of the router are modeled in structural VHDL. We incorporate the design opti-
mizations using the VHDL models. The optimized model is then synthesized, mapped,
placed and routed using Xilinx ISE. The design is targeted for an XC4VLX100-11 on
a nallatech [3] BenDAT AT M development board. Figure 7.1 presents a Virtex-4 [1]
platform FPGA on a BenDATA [3] development board. Xilinx XST [1] which is a part
of ISE 8.2i is used to synthesize the VHDL models. Xilinx LogiCORE FIFO Genera-
tor v2.3 is used to generate common clock/independent clock FIFO buffers. Functional
simulation of the router and mesh versions are performed using Modelsim 6.3c [28]
7.2.1 FPGA Resource Characterization
An NoC router implemented on FPGA consumes certain existing logic and routing re-
sources. Placement and routing constraints and switch complexity dictate the operating
69
Figure 7.1: Nallatech [3] Bendata-V4 Platform FPGA

70
frequency of the router. This variation in frequency is due to its implementation on a

variety of available resources within the constrained area. In an attempt to character-
ize this variation in router properties with the availability of resources, we develop an
experimental platform.
XDL Flow: Xilinx Design Language (XDL) is a standard to transform the native circuit
description (.ncd) of the placed and routed design. The XDL utility which is a part of
ISE suite is used in our research to the convert the .ncd file of the router design to .xdl
(using xdl -ncd2xdl command). Later, this .xdl file is parsed to extract the resource
information. The XDL file contains two parts:
1. Resource Instances
2. Net Description
Resource Instances: This section of the XDL file contains instances of components uti-
lized in the target device, including LUTs, bRAMs, and embedded blocks if any.
Net Description: This part consists of a detailed description of every net instances in
the design, implemented on the FPGA. The description includes source pin, destination
pin(s), fan-out, routing resources utilized by the design. Figure 7.2 illustrates portions
of a sample XDL file for for our router.
Programmable routing resources in FPGA consume a significant portion of circuit
power & delay. An efficient NoC implementation will take into account these factors
of the underlying FPGA architecture. Routing resources in our target Virtex-4 FPGA
can be divided into: 1) Single (OMUX/IMUX) Lines, 2) Double Lines, 3) Hex lines, 4)
Long Lines, 5) Clock Tree, and 6) Programmable Interconnect Points (PIPs). Figure 7.3
illustrates the main types of routing resources in the target FPGA device.
The interconnect distance, switched capacitance of the above resource significantly
vary, thereby accounting for a range of delay and power values. We estimate the delay (ns)
and power (mw) overheads involved in utilizing these resources in the design. Table 7.2
71
design "VCRouter_top" xc4vlx100ff1513-11 v3.1 ,
cfg "
_DESIGN_PROP : : BUS_INFO : 8 : INPUT : E_channel_data_in <7 : 0>

_DESIGN_PROP : : BUS_INFO : 8 : INPUT : L_channel_data_in <7 : 0>
inst "E_channel_data_in <0>" "IOB" , placed IOIS_NC_LX0Y142 H37 ,

cfg " DIFFI_INUSED : : #OFF DIFF_TERM : : #OFF
PULL : : #OFF SLEW : : #OFF INBUF : E_channel_data_in_0_IBUF :
PAD : E_channel_data_in <0> : "
;
net "E_channel_data_in_3_IBUF" ,
outpin "E_channel_data_in <3>" ,
inpin "input_E/fifoA/BU2/U0/BU7"
,
pip BRAM_X10Y132 IMUX_B19_INT0 > RAMB16_DIA3
pip INT_X0Y134 BEST_LOGIC_OUTS0 > E6BEG8
pip INT_X10Y132 S2END8 > IMUXB19
pip INT_X10Y134 W2END6 > S2BEG8
pip INT_X12Y134 E6END6 > W2BEG6
pip INT_X6Y134 E6END8 > E6BEG6
pip IOIS_NC_L_X0Y134 IOIS_IO > BEST_LOGIC_OUTS0_INT ,
pip IOIS_NC_L_X0Y134 IOIS_IBUF0 > IOIS_IBUF_PINWIRE0 ,
pip IOIS_NC_L_X0Y134 IOIS_IBUF_PINWIRE0 > IOIS_I0 ,
;
Figure 7.2: Sample XDL Description

72
Single Line
Double Line
Hex Line
Horizontal Long Line
Figure 7.3: Routing Resources: XC4VLX100

73
Table 7.2: Performance and Power Estimates of Routing Resources in XC4VLX100
Routing Resource Delay (ns) Dynamic Power (mW @ 200MHz)

IMUX/OMUX 0.335 0.23
Double 0.406 0.27
Hex 0.863 0.33
Long 2.507 4.84
Table 7.3: MoCReS: FPGA Resource Utilization
Router Version Nets Hex Double Long Single Slices Max.Freq (MHz)
MoCReS (1VC+MC) 902 439 1402 5 5633 282 303
summarizes the performance and power results of the interconnects. The performance
estimates are produced by re-routing a specific net across various resources using Xilinx
FPGA Editor [1]. For power measurements, the operating frequency is set at 200 MHz
with an operating voltage of 1.2v corresponding to the programmable interconnects in
XC4VLX100. Xpower utility is used to determine the dynamic power contributed by
that particular net that is re-routed.
As a part of this research, we develop a Perl utility to parse the XDL file to obtain
the number of nets, and type of routing/logic resources utilized in the design. This
information is presented in Table 7.3. This version of MoCReS consumed 282 Virtex-4
slices and operated at 303 MHz. Figure 7.4 shows the % of each type of routing resource
used by the design. It can be seen that the direct lines (IMUX/OMUX) contribute to
75% of the total routing resources utilized by the design. The above design is tightly
constrained in terms of area, and therefore the longer lines utilization is less.
Based on the placement and routing constraints present in the FPGA and the switch
complexity (number of ports, etc.), there will be varied number of configurations for every
router (based on the available FPGA resources). We consider three such configurations
for a 5-port MoCReS router and present them below. We used Xilinx PACE tool [1] to
74
HEX
6%
Double
19%
Long
<1%
Single
75%
Figure 7.4: MoCReS Routing Resource Utilization
floorplan and set the area constraints to the router design for every configuration.
Configuration A: This router is closely packed with a square shape and with a max-
imum area constraint on it. Further, the FIFO bRAMs are placed close to the switch.
The above constraints rapidly reduce the utilization of high capacitance routing resources
and achieves a router design with maximum operating frequency (303 MHz). However,
traditional CAD tools avoid such a configuration due to excessive depletion of routing
resources in the constrained area. This leads to a low performance in the user logic that
surrounds the router. Figure 7.5 presents this heavily constrained configuration.
Configuration B: In some cases, user logic (IPs) might be prioritized for CLBs present
near bRAMs. In such cases, the network component (router) must be implemented
farther from the bRAM FIFOs. As a result, there is an increase in critical path in the
router design leading to lower operating frequency (286 MHz in this case). Figure 7.5
also presents this configuration of the router.
Configuration C: Finally, we capture the effect of inter router distances in FPGA
based NoC by means of this configuration. Due to varied core sizes, it is possible that
75
Configuration A Configuration B
Figure 7.5: Router Configurations

76
the network elements (routers) get placed and routed farther apart, thereby increasing
the delay between the output port of a router to the input buffer of the downstream
router. This increase will have an impact on the operating frequency of the upstream
router. In our experiments, the critical path increased by 0.802 ns for an increase in
inter-router distance increase by 6 CLBs. The above increase in critical path leads to
deterioration of operating frequency of the router.
7.3 NoC Design Flow
The NoC topology synthesis tool presented in Chapter 8 is implemented primarily using
C++ language and is supported by Perl scripts. Perl is used for benchmark processing
and mesh topology generation while the C++ tool executes the computationally intensive
operations. Data structures implemented in C++ use the Standard Template Library
(STL). The implemented synthesis algorithm is executed on a AMD Opteron Processor

with Linux, operating at 2.4 GHz and having 3GB RAM.
Figure 7.6 presents the experimental design flow used for NoC topology synthesis.
7.3.1 Bandwidth Requirement Vs No.Flits
Edges in the task graph represent communication requirements between cores. There are
two ways in which this communication can be abstracted: as a bandwidth requirement in

MB/s or as Injection Rates in No. Flits/Cycles/Node. Choice of a model has a significant
impact on accuracy and simulation time of performance estimation.
Table 7.4 presents a qualitative comparison between the two approaches. We model
the communication traffic between the IP cores as a bandwidth requirement, expressed in
MB/s, due to the simplicity of the model and the quick execution time which allows us
to perform many experiments.
77
Hardware Description
P-Layer & C-Layer Design

SystemC VHDL Parameters
MoCReS
Model Design
C++ Compiler SystemC MoClib

Library
+ Simulator Library
gcc* .vhd
.vcd output Area/Freq
Models
Topology Synthesis Framework
Functional
Simulation
Synthesis vsim*
.vcd
Map
ISE Power Analysis
.ncd
Design Flow Place & Route xpower*
.bit
Platform FPGA
Figure 7.6: Experimental Flow for NoC Topology Synthesis
Table 7.4: Comparison between Communication Abstractions

Comparison Metric Bandwidth (MB/s) Injection Rate
Cost Function Simplicity Simple Requires Event Simulation
Estimation for Hybrid Two-Layer Design Simple Moderately Complex
Accuracy Moderate Very Accurate
Power Estimation Less Accurate More Accurate
Chapter 8
FPGA Based NoC : CAD Flow
8.1 Introduction
Our router architecture can support a host of design parameters including, channel link
width (b), number of ports (Pi ), number of virtual channnels/port (v) and number of lo-
cal cores participating in the circuit-switched layer (Li ). Certain domain of applications
provide static communication traces early in the design cycle, i.e these classes of appli-
cations permit traffic characterization at an early stage. We utilize this information to
customize the NoC topology overlayed on FPGAs. Due to a large number of parameters,
it is fundamentaly impossible to instantiate and hand tune the communication topology
for the needs of every application.

This chapter discusses related work and the CAD flow that we have developed to
automate NoC framework design for FPGA based multi-processor designs. Figure 8.1
presents the NoC design dependency cycle that exists between various design parameters.
In our approach, we break this cycle by manipulating the number of router instances and
number of ports in each router. This chapter also presents the various phases of NoC
topology design in detail.
78
79
Our
Approach
# C-Layer Ports
# P-Layer Ports # Routers
NoC Design Dependency Cycle
Max Operating Freq. Average Hop

f MHz Count
Available BW
MB/s
Figure 8.1: NoC Design Dependency Showing Our Approach
8.2 Related Work
While ASIC implementations have a well developed CAD flow for NoC design [41] [42],
there is no automated methodology for FPGAs that takes into account the features of
the underlying architecture. Moreover, the limitation in resources, higher power con-
sumption, and increasing heterogenity of the FPGA device complicates the design flow.
Research in [5] is the first work to address automated design for FPGAs. Their underly-
ing NoC model enables fast performance verification and is less suitable for supporting
high performance NoC in FPGAs, as opposed to our characterized M oClib NoC library
that we have developed in this research.
To the best of our knowledge, this is the first work to propose an FPGA-suitable
hybrid router architecture integrated with an automatic topology synthesis framework

80
that satisfies the bandwidth requirements of an application while optimizing its area
overhead.
8.3 Problem Formulation
Given a task graph G(V, E), where each vi ∈ V represents an IP core, and directed
edge eij = {vi , vj } ∈ E denotes a communication edge with a bandwidth weight function
bij : Eij → R, Find a mapping G(V, E) → C(V, E), where C represents a mesh topol-
ogy graph, such that, ∀i, j ∈ V the available topology bandwidth meets the required
P P
bandwidth ∀eij ∈E bij with minimum NoC area, ∀i∈C <i .
8.4 Topology Synthesis Flow
In this section, we present the four important phases in our flow, a detailed description
of the synthesis algorithm, its functioning and complexity.
The input to the algorithm consists of a core communication graph (G), annotated
with bandwidth requirement between modules. Further, the design space parameters,
namely, the maximum bandwidth supported in a packet-switched link (critical band-
width bc ), available FPGA area (Aav ) along with area and performance models of our
router architecture are also provided as input. Figure 8.2 shows our topology synthesis
framework. Our algorithm supports four main operations in the phases mentioned below:
1. Clustering
2. Mesh Generation
3. Candidate Topology Selection
4. Area Optimization
The above four phases are presented in detail with graphical examples.
81
Core Task Graphs G(V,E)
MoClib
Library 1
50 100
.vhd
Area/Perf 2 3
models 90
90 90 75
4 5
Exhaustive IP
Mapping
XY Routing
BWcritical
Clustering
C max
Link Capacity Required BW
Estimation Estimation
No U' - Router Upper Bound

- Core Clusters
No
All Comb. Is BW Mesh Topology

over? req. met?
Yes Generation
Yes
Optimize NoC
Area
Output
NoC Topology
Topology Synthesis
Figure 8.2: Topology Synthesis Framework

82
BW = 500 MB/s
1 critical 1,3
250 650 450
Clustering
2 3 2,5 3
100
200
150 150
600 200
100
4 6 4 6
5 5
G G'
Figure 8.3: Clustering Phase of Topology Synthesis
8.4.1 Clustering
During the clustering phase, the edges in the input task graph (G) whose required band-
width violates the critical bandwidth (bc ) are identified. The packet switched NoC frame-
work does not have sufficient link capacity to support these communications. Therefore,
we utilize the hybrid router architecture that has enough bandwidth available between
specific cores. The cores requiring these bandwidths that exceed the available inter-
router capacity are grouped to form clusters of multiple IPs connected via the C-layer of
a single router. Upon completion, this phase outputs the clustered core graph (G’), the
upper bound (U’) on the number of routers and the information about the types of net-
work components chosen from the MoCReS Library. Therefore, in the new graph G’, the
upper bound in the number of routers is U 0 = |V |. Figure 8.3 demonstrates clustering,

where by incrementing the C-layer ports for two routers, we eliminate the bandwidth
violations. However, this approach comes with a penalty as the high bus width circuit
switched layer depletes the available FPGA resources. The above resource depletion also
reduces the operating frequency of the router. Therefore, cores must be clustered judi-
ciously to avoid degrading overall performance. Table 8.1 presents the clustering results
from this phase for the chosen benchmarks.
83
Table 8.1: Clustering Results for Benchmarks

Benchmark |V | U0
MPEG4 12 9
FFT 15 12
VOPD 12 10
MWD 12 12
LU Dec 9 8
Laplace 9 7
Basic-1 9 8
Parallel-1 9 9
Packed-1 9 6
Packed-2 9 7
8.4.2 Mesh Generation
During the Mesh Generation phase, we generate all mesh topologies with U’ routers. Due
to its suitability for FPGAs, we consider only mesh based topologies in this research.
From U’, we determine all its factors and identify all possible mesh topologies. During
this topology generation step, we preserve the clustered nature of the cores, output from
the previous phase. Of all these possible meshes, an appropriate topology that satisfies
the bandwidth requirements is later determined in the Candidate Topology Selection

phase.
8.4.3 Candidate Topology Selection
During this phase, the operations performed are, the exhaustive IP mapping and the link
bandwidth estimation.
Exhaustive IP Mapping: Mapping of IPs to a mesh, based on an unconventional hy-

brid router architecture presents a great challenge. During this phase, we iterate through
a large search space by permuting the possible IP mappings exhaustively. We select a
candidate topology based on whether a mapping satisfies the required bandwidth and
optimize it for area during the last phase. If a valid mapping is present for that particular
topology, our exhaustive search algorithm is guaranteed to output the mapping, as the
84
search is exhaustive in nature.

Link Bandwidth Estimation: Selecting a candidate topology requires verifying if the
bandwidth required for all (source,destination) pairs in the clustered core graph meets the
available topology bandwidth. Our choice of XY routing simplifies this phase in addition
to reducing the switch complexity due to its simple logic. The cumulative bandwidth
requirement on each edge (contributed by each source to destination route) establishes
the required link capacity constraint. For a given MPEG4 application as a task graph,
Figure 8.4 presents a candidate topology and its cumulative link bandwidth requirement
(cost) in MB/s. As a result of Task 1 mapped to router (0,0), the link connecting the
router to (1,0) requires a 1912 MB/s bandwidth (equal to the sum of bandwidths between
Task 1 and rest of the tasks). To select a candidate topology, we conservatively estimate
the available link bandwidth that the communication architecture supports. This is
determined from the link width and the operating frequency of the router. During the
Link Bandwidth Estimation operation, the router models from the M oClib library (<)
are input to estimate the available link capacities for the router configurations chosen in
the NoC topology by the above phases. This process involves estimating the bandwidth
available between routers operating over different frequencies. For instance, a bulky
router present in a high communication path, severely degrades the total performance of
the NoC. The candidate topology selection phase incorporates these trade-offs using the
router models.
8.4.4 Area Optimization
The primary objective behind this algorithm is to synthesize NoC topologies for FPGA
based designs that satisfy the required bandwidth with a minimum area overhead. We
estimate this area overhead in terms of Virtex-4 [1] FPGA slices. The routers contribute
to the area utilization of an NoC. The area utilized by the router varies with its configu-
ration (number of C-layer & P-Layer ports, buffering and channel width). For a chosen
85
Candidate NoC Topology MPEG - Core Communication Graph

G'
| V | = 12
4 7 8 12 9 | E | = 13
80 80
72 173
0,2 1,2 2,2
6 7
3 1274 6 245 9 1173 11
120 1200
100 0
0,1 1,1 2,1 380 2
2 1 3
1 1296 2 265 5 1673 10 64 2 1820
4
0,0 1,0 2,0 8 5
1912 1513
346 1340
Router(x,y) IP Core(s)
10
0,0 1,2 1,0 5 2,0 10
500 1000
0,1 3 1,1 6,9 2,1 11
0,2 4,7 1,2 8 2,2 12 11 12
Figure 8.4: IP Mapping and Link Bandwidth Estimation

86
topology that has information on the configurations of routers used, we determine the to-
tal area by summing the individual slices by looking them up from the M oClib library of
NoC components. To summarize, upon determining the candidate topology that satisfies
the bandwidth, we conservatively estimate the area required by the chosen topology.
During the Area Optimization phase, the required number of router components (U’)
is decreased iteratively by one. In each iteration, we prune the NoC topology by removing
a router with a single IP connected to it. In order to balance the total number of IPs
with the local ports of the routers, we perform the following in order:
• Increment the # P-layer ports by one for a router configuration.
• Increment the # C-layer ports by one for a router configuration.
Substituting a router configuration with an alternate one that supports an increased
P-Layer count has the following effects:
• Very marginal area increase compared to the gain obtained by removing one router.
• Reduction in router read frequency, leading to a degradation in available band-

width across its neighboring edges (possibly introducing bandwidth violations in
the resulting topology).
In the area optimization phase, we first perform the above operation and determine
if the bandwidth requirements are still met. However, if the above step introduces
violations for all combinations of mesh and IP mappings, we substitute that chosen
router with an alternate configuration that supports increased C-Layer ports instead.
Substituting a router configuration with an alternate one that supports an increased

C-Layer count has the following effects:
• Marginal reduction in router read frequency, while introducing high bandwidth IP

connections (which could possibly eliminate the violations seen above).
87
• Significant increase in routing resource consumption and more importantly, increase

in area due to supporting additional schedule memory.
Reducing the number of router by above fashion also minimizes the average hop count
of the network, leading to improved execution time. However, our primary objective is
only to ensure that the bandwidth requirements are met with a minimum area NoC. The
new NoC topology is then input back to the Mesh Generation and Candidate Topology
Selection phases to determine (exhaustively) if the bandwidth requirements are met.
8.5 Description of Algorithm
The four phases described in Section 8.4 are presented in the form of a pseudocode in
the following algorithm. Input to the algorithm consists of the core task graph, G(V,E),
with |V | Cores, |E| Edges, along with the bandwidth values annotated. The M oClib
component library values, critical bandwidth (bc ) are also input to the algorithm.
The Clustering phase (lines 1-5) involves iterating over all the edges to determine the
bandwidth violations. The output of this operation is the clustered core graph, G’(V,E)
and the upper bound on the number of routers (U’). Based on the factors of this upper
bound, all possible mesh topologies are generated (lines 6-7) and output to perform
candidate topology selection (lines 8-13). As mentioned in Section 8.4.3, this operation
can be partitioned into two sub-operations: IP Mapping (lines 8-9) and Link Bandwidth
Estimation (lines 11-13). Finally, optimizing area (lines 15-21) involves decrementing U’
and determining if the new topology satisfies the required bandwidth. The terminating
conditions to the iterations are U 0 = 1 and when Pmax , which is the maximum number of
ports in routers for the suggested NoC topology, exceeds Pcritical (maximum supported
number of ports by the library).
88
Algorithm 8.1: Topology Synthesis Design Flow

Require: Core Task Graph G(V,E), with |V | Cores, |E| Edges/Links and Edge e ij associated with a bandwidth weight
function bij : eij → R.
Require: Ap/c−ovrhd(i,j) Area overhead in replacing a packet/circuit-switched router with i → j ports (M oClib Library),
Critical Bandwidth bc , Aav Available FPGA area in Slices for NoC, Maximum C-layer ports Ccritical .
Ensure: Low Area FPGA based NoC Topology, Satisfying Bandwidth
1: for all Edges eij ∈ E, G(V,E) do
2: if Required bandwidth for edge eij : bij > bc then
3: Clustering: Group Vertices i, j, update edges, and add no. circuit-switched ports by one, thereby
removing the bandwidth violation
4: end if
5: end for
6: Output Modified Core Task Graph G’(V,E) such that |V | = U 0 , the upper bound on the no. of router instances
7: γ: generate all mesh topologies possible with U’ network components
8: for all Mesh Topology in γ do
9: δ: permute all IP mapping combinations of G’(V,E) → C(V,E), where C(V,E) is one instance of the mesh
topology.
10: for all Topology mapping in δ do
11: Candidate Topology Selection: foreach source, destination pair (i, j) ∈ |V | of G’(V,E), estimate required
link bandwidth in C(V,E) by applying XY routing
12: M oClib Library: Estimate the bandwidth available in the edge eij ∈ C(V,E), Link BW = Channel
Width (b)× Router Operating Frequency (f)
13: Determine one valid topology instance such that available BW meets required BW b i,j ∀eij ∈ E
14: end for
15: repeat
16: U 0 = U 0 − 1 and Pmax = Pmax + 1
17: AreaPNgain: sum the area of the routers for every instance
18: if i=0 Ai < Aav , where N is the number of routers then
19: best.topology ← current.topology
20: end if
21: until {U 0 = 1 .or. all Pmax > Pcritical }
22: end for
23: Output best.topology to ISE design flow
89
8.6 Complexity Analysis
We analyze the time complexity of our algorithm in this section and present the execution
time results for a set of chosen benchmarks. With respect to the type of computation
performed, the algorithm presented in Section 8.5 can be divided into the following
phases,
1. Clustering
2. Mesh Generation
3. Candidate Topology Selection
4. Area Optimization
The Clustering phase presents a time complexity of (|E|), where |E| is the number
of edges in the task graph. For the design sizes considered in this research, the above
phase contributes only to a negligible portion of the total execution time. Based on
the determined router upper bound U’, the next phase, Mesh Generation outputs all
possible mesh configurations. In terms of complexity, this step is linear to the number
of vertices, |V |, therefore having a time complexity of the order of (|V |). During the
Candidate Topology Selection phase, the operations performed are, the exhaustive IP
mapping and the link bandwidth estimation. The worst case time complexity of both
the phases can be expressed as, ( (U 0 !) + (|E|))× (# mesh configurations). Finally,
the Area Optimization phase also has a time complexity of the order of (|V |). Of the
above four phases, the candidate topology selection phase dominates the computational
complexity of the algorithm due to its exhaustive nature. Even though the IP mapping
design space is factorial, we exit early from the exhaustive search once the first valid
mapping is found. i.e we do not optimize the CAD Flow for performance. As a result,
it will be shown in Section 8.6 that the typical execution times are much less compared
to the worst case complexity.
90
Table 8.2: Algorithm Execution Time

Benchmark |V | |E| Execution Time (minutes)
MPEG4 12 26 12.67
FFT 15 22 35.12
VOPD 12 14 11.26
MWD 12 13 10.75
LU Dec 9 11 4.84
Laplace 9 12 5.42
Basic-1 9 8 4.26
Parallel-1 9 14 5.98
Packed-1 9 16 6.94
Packed-2 9 15 6.76
8.7 Experimental Results and Analysis
Experimental Platform: As mentioned in Section 7.2, our target FPGA device is

XC4VLX100 [1], on a Nallatech BenDATAT M [3] development board, where the per-
formance and area results for the M oClib NoC library are extracted. To characterize
our library of routers accurately for area and performance, we model them in structural
VHDL and use Xilinx ISE 8.2i [1] to follow the FPGA design flow for the router models.
8.7.1 Execution Time Results
Algorithm 6.1 is implemented in C++ using Standard Template Library (STL) data
structures and is supported with Perl for benchmark processing. We execute the above
algorithm on a AMD Opteron Processor with Linux, operating at 2.4 GHz and having
3GB RAM on our chosen benchmarks and report the results in Table 8.3.
The execution time of the algorithm for a benchmark is directly related to the time
complexity of the mesh generation and IP mapping phase. With an exception of one
benchmark (FFT, with 15 cores), the average execution time was around 8 minutes.
91
MPEG4 VOPD MWD FFT
12 11 1 2 3 9 5 3 3 1 2
10
6 5 7 8 6 1 7 6 4 5
5 8
4 10 11 8 9
9 8 4 7 4 2
3 1 2
10 12 13 14 15
12 11 10
7 6
11 12
Figure 8.5: Application Benchmarks
Table 8.3: MPEG4 Area Improvement

Iteration # U’ # Meshes × Mappings NoC Area (slices)
0 12 1437004800 4350
1 9 725760 3860
2 8 80640 3610
3 7 5040 3370
4 6 1440 3120
8.7.2 Area Results
In order to determine the impact of the proposed algorithm on area, we compare our
results in this chapter with the solution provided by the baseline NoC described in
Chapter 3. This traditional multi-clock NoC has one IP attached to every router and
does not support the hybrid architecture presented in Chapter 5.

Benchmarks: As described in Section 7.1, we utilize the multi-processor SoC applica-
tions modeled as Directed Communication Task Graphs (CTG). Vertices in the graph
represent IP cores and edges represent precedence and bandwidth requirement. We apply
our technique on four widely used application benchmarks, (FFT, MPEG4, VOPD and
MWD) [39] and six synthetic benchmarks [36] that represent a variety of communication
patterns that are frequenty encountered in multi-processor designs.
Using our hybrid architecture and integrated design flow, results were obtained for
92
various benchmarks. For similar bandwidth constraints applied through task graph edges,
Figure 8.6 compares the synthesized topology area between the proposed and baseline
approaches. With the number of cores in the benchmarks varying between 6 and 15, it
can be seen that there is an average reduction of 21.6% (maximum of 26%) in the NoC
area which can be used for efficient implementation of application logic. The bandwidth
constraints were translated into the original design and estimation of area was performed
in slices. It is to be noted that the CAD tool does not optimize the design for execution
time. However, ensuring that the required bandwidth is satisfied is the primary goal. For
all of the application benchmarks, our approach was able to obtain alternate topologies
utilizing our hybrid router library with fewer FPGA resources.
8.8 Conclusions
In chapters 5 to 8 of this proposal, we present a multi-clock hybrid two-layer router

architecture suitable for FPGAs. We analyze the merits and issues involved with the
architecture and characterize a library of network components for area and performance.
For equivalent area overhead, our proposed architecture achieves 20.4% increased band-
width when compared with a baseline approach. We effectively automate the NoC design
cycle by integrating the router with an algorithm that optimizes for FPGA area while
satisfying the required bandwidth. Experimental results for a set of real applications and
synthetic benchmarks show an average reduction of 21.6% in FPGA area (maximum of
26%) for equivalent bandwidth constraints when compared with a baseline approach.
93
% Area Savings for Eq. Bandwidth

14 16 18 20 22 24 26
MPEG4
FFT
VOPD
MWD
Benchmarks
Lu
Laplace
Basic - 1
% Area Savings
Hybrid Router
Parallel - 1
Baseline (MoCReS)
Packed - 1
Packed - 2
0
2000 2500 3000 3500 4000 4500 5000
NoC Topology Area (Slices)
Figure 8.6: NoC Benchmark Results

Chapter 9
NoC based System-on-Chip

Development
In this chapter we present the design methodology that we have developed for implement-
ing a complete SoC application using our on-chip network backbone. The methodology
comprises of the following three important parts:
• IP Core Characterization
• Network Interface Implementation
• NoC Floorplanning
Upon presenting the motivation behind our methodology, we will then address our
contribution in each of its parts listed above at a greater depth. In this chapter, we
also present a case study comprising of a multi-processor Image Compression Applica-
tion, wherein we obtain real VHDL cores and perform comparisons between alternate
implementations in our target FPGA.
94
95
Figure 9.1: ITRS 2007 Showing IP Design Reuse Trends
9.1 Motivation
With increasing device capacities and design sizes, ITRS 2007 [43] advocates high design
re-use as the solution paradigm. As shown in Figure 9.1, the percentage of the whole
design that is constructed from re-used IPs is expected to increase steadily in the next
several years. Figure 9.1 also presents the increasing future trend for the percentage
of reconfigurable components in future designs. The above phenomenon along with
increasing time-to-market constraints motivates the standardization of IP cores in FPGA
based designs.
Multimedia applications are widely prevalent in automotive industries (Global Po-

sitioning Systems), medical imaging, HDTVs, Military/Space applications. Due to the
high computation requirements and parallelism available, FPGAs are increasingly be-
coming a target for these applications. We implement a complete framework for NoC
based multi-processor implementation in FPGAs that is presented in subsequent sections.

96
9.2 IP Core Library
9.2.1 Core Abstraction
IP Abstraction: Towards implementing a fully customizable Network Interface and

NoC framework, we make certain important assumptions on the properties of IPs. Based
on the amount of hardware commitment, the IPs implemented in FPGA can be classified
as soft, firm or hard IPs. Even though our framework can be adapted to all the above
kinds of IPs, we restrict ourselves to soft IP cores in this discussion. We assume the Soft
IPs to have the following properties:
• Supports an RTL implementation under one main hierarchy.
• All the inputs/outputs to/from the IP block are registered.
• Data transfers to/from the IP block take place through a finite number of buses
with large data widths (32 bits for example).
9.2.2 Xilinx [1] IP Support
Xilinx MicroBlaze [1] offers soft configurable IP cores for software implementation of
the design. This feature adds tremendous flexibility to the application implemented in
FPGAs. The alternative hard processors available in recent FPGA devices are called
Power PC hard processors. These processors offer very high performance while they are
limited in number. The soft microblaze IPs as opposed to the embedded processors must
be implemented in the configurable logic of FPGA, thereby competing for resources with
the other parts of the design. However, as opposed to the hard embedded procssors, tens
of these microblaze cores can be implemented using present FPGA devices. Figure 9.2
presents the architecture of microblaze IPs. Some of the main features of soft IPs in
Xilinx FPGAs can be classified into - Computation based features and Communication
based features:
97
Figure 9.2: Xilinx [1] MicroBlaze System Design and Architecture
Computation Based:
• Variable Size Cache implementations
• High Throughput Floating Point Support
• Abundant Debug and Memory Management Capabilities
Communication Based:
• Flexible and Efficient Processor Local Bus (PLB) or On-chip Peripheral Bus (OPB)
• Upto 16 FSLs each 32 bits (Fast Simplex Links) for interfacing external modules
With the above logic and interface support, microblaze IPs can be efficiently imple-
mented in a multi-processor SoC. The standardized interfaces of the IP readily supports
the NoC paradigm for communication. Especially the 16 FSL links available to intercon-
nect external co-processors or other computation modules can serve as the interface for
the NoC.
98
We have designed the On-Chip network and the network interface keeping in mind
these communication requirements. For example, multiple FSL links (of size 32 bits)
emerging out of the microblaze cores could be interfaced using our multi-module cus-
tomizable NI to the network back bone. Certain IP communications might be less time
critical. In those cases, the data could be packetized and transmitted over the P-Layer
of the router, thereby achieving tremendous parallelism in the applications. On the
other hand, time critical data communication requiring predictable latencies could be
pre-scheduled through the C-layer of the NoC.
In addition to the above soft cores, Xilinx CORE Generator [1] provides a rich set
up IP cores optimized for Xilinx FPGAs. The kinds of IPs they provide span from Au-
dio, Video and Image processing to Automotive Industry applications to FPGA specific
storage elements. However, Xilinx does not automatically synthesize a suitable commu-
nication architecture for the IPs. We advocate our NoC based framework for this IP
based design environment. In the next section we obtain a set of freely available soft IP
cores [44] and customize them for our NoC-NI framework and also present the overheads
involved in the customization process.
9.2.3 IP Library Characterization
We obtain a set of publicly available cores [44] and develop an IP library that will be
compatible with our Network Interface and NoC. These application IPs can serve as
individual cores of a wide variety of multi-processor applications.
Later in this chapter we present the area characterization of the library of IPs we
consider in this study. As mentioned before, each of the IPs will be wrapped by a
suitable instance of the Network Interface that will be described in the next section.
Further, in subsequent sections we study the area and power overheads incurred by our
Network Interface when used with this library of IPs.

99
NI Wrapper
IP Core

Computation
Data/Ctrl Bus
Data/Ctrl Bus

Unit
Upto 4 IP Module
Connections
Figure 9.3: IP Core Abstraction and NI Wrapper
9.3 Network Interface Implementation
The modified two-layer router architecture presented in Chapter 5 supports high through-
put intra router connections (C-Layer) in addition to the packet switched online routing
layer (P-Layer). This hybrid router sustains a high average bandwidth per port thereby
increasing the overall performance of the communication architecture. Towards support-

ing the varied application needs, a library of such routers were developed (MoClib) and
was integrated with an automatic topology synthesis framework presented in Chapter 8.
As a part of this research, we make use of a generic two-layer network interface com-
patible with our library of routers. The primary objective behind this network interface
is to standardize the external communication of the IP core, thereby hiding the imple-
mentation details of the interconnect. In this section, we first present the design goals
behind this network interface and then describe its compatibility with our IP abstraction.
This work was carried out in collaboration with another student. See his thesis [45] for
detailed information on design goals, architecture and implementation of the Network

Interface.
100
As mentioned above, data transfers can take place to/from the core through a finite
number of buses. Towards designing a customizable Network Interface to this generic
core abstraction, we keep an upper bound of 4 for the number entry/exit points for the
IP (called IP Modules).
9.3.1 Primary Design Goals
Some of the key design objectives behind the NI are,
• Hybrid router compatibility
• Customizablility
• Low Area
Hybrid router compatibility: Being compatible with our library of hybrid two-
layer routers was our most important design goal. The NI must be able to support data
transfers between variable number of IP cores through the Circuit Switched Layer as well
as the Packet Switched Layer. If required, the NI needs to resolve operating frequency
differences between the communicating IPs.
Customizability: The RTL description of the NI needs to support certain important
design parameters. These parameters allow seamless integration of the NI with a library
of IP cores. The main parameters of the NI are,
• Number of Modules in the IP
• Bus Widths in C- and P- Layers
• FIFO Depths
• Configuration Modes
101
CLK RST Func. Sel LD_Data LD_Key OUT_RDY
Key 1 (0 : 63) Data_Out (0 : 63)
Key 2 (0 : 63)
Key 3 (0 : 63)
Triple DES IP Core
Data (0 : 63)
NI Wrapper
Figure 9.4: Triple DES IP Interface
9.3.2 Customized IP Library
The IP cores presented in previous sections needs to be interfaced with our NoC. For do-
ing so, we customize the NI to suit the variable requirements of every IP. Upon preparing
the IPs for our NoC framework, we characterize them for their area and power overheads.
The results are presented in Figure 9.5. It can be seen that the average area increase
due to the NI overhead is 18.82% while there is an average 10.37% increase in dynamic
power incurred to interface the IPs with our NoC.
9.4 NoC Floorplanning
In this section we first present overview of floorplanning multi-processor applications

in FPGAs. Xilinx Planahead [1] is a tool that can be obtained as part of Xilinx ISE
and used for hierarchical place and route of the design. A main advantage behind the
NoC approach is that it has predictable power-performance metrics. Compared with

102
Input (I) /Output (O) Area (Slices) Dynamic Power (mW)

CORES
I/O Bus Width w/NI IP+NI %Increase w/NI IP+NI %Increase
I 8
DCT 1246 1324 6.26 74.22 82.35 10.95
0 13
Color Converter I 8, 8, 8 765

658 16.26 56.17 62.69 11.61
0 8, 8, 8
Triple-DES I 64, 64, 64, 64
1615 1926 19.25 119.26 127.51 6.92
0 64
SIMD CPU I 32 5911 6211 5.08 87.67 92.56 5.58
0 32, 32
LCD Ctrl 0 8 218 19.73 36.45 41.13 12.84
261
Quantizer I 12, 8
1175 1367 16.34 33.08 37.81 14.30
0 8
Figure 9.5: IP Properties and Customization Overhead
Figure 9.6: IP Customization Overhead: Area and Power

103
ASICs, on-chip networks implemented in FPGAs have higher unpredictability in design

metrics due to limited resources available. In this section we also quantify the variation
in power-performance of the NoC due to this unpredictability. However addressing a

complete CAD flow for floorplanning that includes these issues is beyond the scope of
this study.
Xilinx Planahead [1]: The Xilinx Planahead [1] utility has recently become avail-
able as a part of the existing Xilinx Design flow for FPGA implementation. The objective
behind this tool is to support hierarchical place and route (from the synthesized design),
thereby improving overall performance of the design.
Advantages: 1) This CAD enhancement can be used to identify and move critical
blocks in the design 2) Direct place and route based on Pin & Memory constraints and
3) Implementing the Clock Tree with logic

Increase in design size and complexity traditionally is accompanied with an increase
in amount design re-use. This paradigm leads to significant increase in multi-processor
applications that support standardized IPs. As the number of IPs continue to scale,
floorplanning them along with the communication architecture is a great challenge. In

this research we attempt to alleviate the problems in implementing an efficient commu-
nication architecture. Presently IPs are independently synthesized, placed and routed
based on its performance, power and resource requirements. Also, until now power is
hardly featured as a priority in traditional FPGA design flow. However, the current de-
sign complexities and increasing need for portable hand held applications are motivating
power aware CAD enhancements to the design flow.
9.4.1 Synthesis of a Predictable NoC
As we mentioned before, the limited set of resources supported by FPGAs presents a
great challenge to achieve predictable performance and power goals.

Some of the factors that contribute to this complexity are: 1. Unknown IP Core
104
Figure 9.7: NoC Mesh Floorplan
Aspect Ratio and Resource needs

2. Variable router complexity and topology (based on application)
3. Variable inter port, inter router, inter core distances within the implementation
Until now floorplanning in the design flow (Planahead) only emphasizes on manual
area constraints placed on multiple cores and the design takes a lot of time to converge,
which we believe as the number of cores increase would be almost infeasible. Figure 9.7
presents the area constraints applied for a 3 × 2 mesh NoC. Upon completion of place
and route, the resource congestion of the mesh is shown in Figure 9.8.
In this research we have pre-characterized the network components and obtained
delay and power models for its components. When this communication architecture
knowledge is used to floorplan the multi-processor application and NoC, it could lead to
rapid timing violation removal and design convergence. Furthermore, while proposing
a CAD enhancement it needs to be ensured that existing industry level design flow for
FPGA implementation are only marginally varied.
105
Figure 9.8: NoC Mesh Routing
NoC links implemented in FPGAs are time-multiplexed over various communications.

The link length has a significant impact on the performance of the NoC. In our routers,
buffering is performed only at the input ports. Therefore, the inter router links appear in
the critical path of the design. Also, with increasing link lengths and distances between
routers (increased switched capacitance), the link power tends to dominate the overall
power consumption. While implementing an NoC with predictable power-performance
metrics, the above factors needs to be considered. We present the routing resource
characterization results in Figure 9.9. Xilinx FPGA Editor [1] is used to vary the routing
between various points in the NoC and the delay variations are measured in ns and
dynamic power is measured at 200 MHz clock frequency.
It can be seen that with every inter router connection (NoC Link) that spans 4 CLBs,
delay could increase upto 0.5 ns (2 × 0.25). As the link delay appears in the critical path,
a router operating at 250 MHz could suffer a performance degradation of upto 12.5%,
which is significant. Through efficient floorplanning and choice of appropriate frequencies
106
Figure 9.9: Routing Resources: Delay and Power
in our multi-clock NoC framework, these variations in delay can be optimized.

The above analysis we believe will limit the number of IPs that can be added to
a router. When the number of ports are increased, theoretically the performance of
the NoC could increase while keeping the area overhead low. However, the floorplan
perspective presented above needs to be considered while estimating the actual benefits
behind routers that support increased number of IPs.
9.5 Image Compression Implementation
Image processing applications are in general computation intensive due to the need for
processing several million pixels for operating with reasonable resolutions. Further, these
applications have portability and constrained time-to-market requirements and there-
fore merit FPGAs as the target architecture. NoC based multi-media communication
107
architectures are expected to highly competitive alternatives to its existing bus based
counterparts. The main reason for this is the natural division of these applications into
multiple cores that require highly parallel communications.

With FPGAs are increasingly used in portable devices and space applications, image
compression is an important application implemented on it. Image processing hardware
can be synthesized and implemented on FPGAs. Alternatively, the same application
can be implemented as software in the microblaze or other processors. Therefore, in

this discussion we restrict ourselves to hardware implementation (and not microblaze
implementation) of the image compression application.
Using these experiments, we demonstrate the feasibility of our NoC framework to
implement multimedia applications on FPGAs. Furthermore, as opposed to custom
designs, the freely available cores that we utilize to construct our experiments serve to
significantly reduce the design and test time. As the multi-processor domain of digital
design continues to evolve, more sophisticated applications could be implemented using
our NoC framework for FPGAs.
Application Description: The name JPEG stands for Joint Photographic Experts
Group. It specifies the way in which an input image is transformed/compressed and
stored. The JPEG conversion is performed through a set of IPs that convert the raw
input data format into jpeg standard. It is performed through multiple stages from
RGB to YCrCb conversion and DCT and huffman encoding and Run Length Encoding
(RLE) and finally stored in the memory. Traditionally, a color conversion from RGB
space to Ycrcb space is performed before storage and transmission. This leads to drastic
bandwidth reduction sustaining a marginal quality trade-off. The application is designed
to accept 352 × 288 pixel images in bitmap format as input and output in JPEG format.
Figure 9.10 presents the JPEG application that we have implemented in this case study.
108
9.5.1 NoC Implementation
The Network Interface described in the previous section is customized for implementing
the IPs of the image compression application. The objective behind this experiment is to
implement a typical SoC application on an FPGA along with the network components
and characterize the NoC design for area, performance and power. Present industry
applications operating at very high frequencies and having various highly complex pro-
cessing units certainly merit these sophisticated communication architectures. However,
as we could not obtain these industry level benchmarks for performing our experiments,
we restrict ourselves to freely available cores from Opencores [44].
The network channel width is set at 16 bits per flit. We assume each packet to
consist of a block with 8 × 8 pixels, each represented using 16 bits. Therefore, an
entire packet could consist of 64 flits with each having 16 bits to constitute one block
information independently. To study the area and power overheads of the NoC in a
typical application scenario, we implement various NoC topology versions. Figure 9.11
presents the various configurations that we have synthesized in order to study the NoC
overheads. We have retained a mesh topology for all the experiments as we already have
a library of router components that supports its flow control. Furthermore, all the multi-
port routers utilized in implementing this application sustain only the P-Layer. We make
this decision keeping in mind the marginal bandwidth requirements of this application.
9.6 NoC Analysis
In this section we present detailed area and power overheads of the NoC design. Fig-
ure 9.12 presents the comparison between these alternate implementations. The four
configurations shown in this figure are the same as those shown in Figure 9.11. The
last column presents results from a flat synthesis implementation of the whole applica-
tion. Upon enforcing tight timing constraints, the maximum operating frequency of the
109
Raw BMP Input
RGB
_ > YCrCb
DCT DCT DCT
QUANTIZER QUANTIZER QUANTIZER
Run Length
Encoder
Compressed JPEG Output
Figure 9.10: JPEG Application

110
DCT DCT QNT

DCT DCT DCT QNT
RGB Router RLE Router Router
QNT QNT RLE

RGB
QNT DCT QNT
(a) (b)
RGB DCT DCT RGB DCT QNT QNT
Router Router Router Router Router Router

RLE
DCT
Router Router Router Router Router Router
QNT QNT QNT DCT DCT QNT RLE

(c) (d)
RGB __ RGB2YCrCb QNT __ Quantizer

__ Run Length Encoder
DCT __ Discrete Cosine Transform RLE
Figure 9.11: NoC Implementation Alternatives
application was around 86 MHz in our target device.

We refrain from making a detailed performance comparison of our NoC architecture
with the flat synthesis approach as the amount of parallelism inherently present in the
application is very marginal. Our experimental results mainly serve to demonstrate the
trade-off in area and power of the NoC present while choosing an appropriate implemen-
tation methodology.
Area and Power Overhead: Within the limited available logic and routing re-
sources, the FPGAs need to contain the user logic along with the communication archi-
111
tecture. During all phases of our design we have accomplished the design goals with as
fewer resources as possible. Routers being the central component of the network, are
being replicated multiple times to interface all the IPs to the network. In this research
we determine the area overhead of the NoC as a % of total slices utilized by the applica-
tion and NoC. To accurately estimate the logic, we: 1) floorplan the design manually 2)
place large constraints on the area and 3) obtain the place and route results. Figure 9.12
shows that the overall area utilization of the image compression application gradually
increases with the number of routers. Also, the % overhead of NoC increases and reaches
11.23% for the 8-router topology (configuration d). As the number of cores and amount
of parallelism scales, the performance and power benefits obtained through this approach
is expected to outweigh the area overhead seen above. Figure 9.12 also presents power
overheads for various configurations. For the purposes of comparing power consumption
between alternate implementations, we consider only the dynamic component. It can
be seen that with variations in number of routers, dynamic power follows a different
trend compared to the area overhead. Compared with configuration (a), the two-router
version sustains lower dynamic power due to reduced switch power. Even though there
is a marginal increase in area, the reduced switching activity in the routers leads to
lower dynamic power. The average power overhead in the on-chip network across all
configurations was around 18% for the JPEG application.
112
Flat
Configuration (a) (b) (c) (d)
Synthesis
Total Area
9947 10158 10336 10743 9546
(Slices)
% NoC Area 11.23

4.28 5.93 7.76
Overhead
Total Dyn.
337.19 322.42 335.64 367.85 274.77
Power (mW)
%NoC Dyn.
17.74 16.86 17.28 19.11
Power
NoC/Design
108 MHz 186 MHz 242 MHz 273 MHz 86 MHz
FMax
RGB
DCT DCT DCT
QNT QNT QNT
RLE
Figure 9.12: JPEG Configuration: Area and Power Overhead Analysis

Chapter 10
Summary of Contributions and

Future Work
In the following sub sections, we briefly summarize the contributions made in this dis-
sertation. Furthermore, we have suggested directions for future work beyond the contri-
butions made in this dissertation.
10.1 Contributions
This dissertation primarily focuses on development of a high performance communica-

tion architecture for current Field Programmable Gate Arrays using Networks-on-Chips.
Present SoCs implemented in FPGAs merit these sophisticated on-chip network based
communication architectures to offset the performance bottleneck inherent in bus based

architectures.
10.1.1 NoC Framework: MoCReS
As a first step towards designing a novel communication architecture, an NoC framework

for FPGAs, MoCReS, an area efficient Multi-Clock On-Chip Network for Reconfigurable
113
114
Systems was developed. The design addresses area, performance and multi-clock ca-
pability which are the primary design goals in NoC design for FPGAs. Our 5-port
virtual cut-through router has an area overhead of only 282 Virtex-4 slices (a marginal
0.57% of XC4VLX100) and operates at 357 MHz supporting a competitive data rate of
2.85 Gbit/s.
10.1.2 Power Characterization
We determine the power consumption of the NoC framework on FPGA. Various com-
ponents of power consumed were presented in detail. Further, we analyze the power
trade-offs associated with our design novelties by comparing it with a baseline approach,
implemented on the same target device. Results show a marginal dynamic power over-
head (11.5%) for the performance advantage observed in our multi-clock NoC design.
Further, we associate the power consumed by various components in the NoC architec-
ture to the underlying FPGA resources utilized by them.
10.1.3 Hybrid 2-Layer Architecture
To address the bandwidth limitations of MoCReS, we extend the design by developing

a hybrid two-layer router architecture. The novel design of the network component
supports high throughput time-multiplexed circuit-switched connections between IPs

interfaced to the same router, in addition to the packet-switched communication layer.
Various instances of the NoC components are characterized for area and performance in
the form of a M oClib NoC component library. The advanced router architecture achieves
an average improvement of 20.4% in NoC bandwidth (maximum of 24% compared to a
traditional NoC).
115
10.1.4 Performance and Power Analysis
The proposed alternate router architecture inherently supports higher throughput. We

quantify the performance benefits over our baseline approach using a previously devel-
oped experimental platform. Utilizing this modified router architecture to instantiate
mesh topologies involves analyzing power trade-offs during topology synthesis. We thor-
oughly characterize the novel router architecture and its library of network components
for power consumed.
10.1.5 CAD Flow: Topology Synthesis
A CAD tool has been developed for implementing the design flow for FPGA NoC Topol-
ogy Synthesis. It implements an exhaustive search algorithm with multiple phases that
optimizes the area of the NoC while meeting the bandwidth requirements of the applica-
tion. The M oClib NoC component library is used to perform the NoC topology design
space exploration. For any given specific application as a task graph, our integrated
synthesis framework determines a suitable NoC topology that satisfies the bandwidth
requirements, while optimizing for the area overhead. We report the results for a wide
set of application and synthetic benchmarks, represented as task graphs. Results show
an average reduction of 21.6% in FPGA NoC area (maximum of 26%) for equivalent
bandwidth constraints when compared with a baseline approach.
10.1.6 NoC Based SoC Development
A framework for characterizing IP cores has been developed. To integrate variable IP

cores along with our NoC, a two-layer network interface was designed preserving cus-
tomizability, induced latency and area utilized as its primary goals. A customized li-
brary of IP cores were characterized for area and power overheads. This library of our
NoC compatible IPs were utilized to implement an image compression application using
116
FPGA based NoCs. We evaluate the area and power overhead involved in our alternate
implementation methodology.
10.2 Future Work
In this section we outline the future research directions that could serve as an extension
of our work.
FPGA CAD Enhancement: As the number of processing elements within a SoC
increases, the design time tends to increase due to the manual effort required. Therefore,
it is important to automate floorplanning of the on-chip network and IPs, taking into
consideration the constraints of the application. Real IP cores have their independent
pin, bRAM and logic/routing resource requirements. We have modeled the resource,
performance and power overheads of our NoC and IP library in our research. An effi-
cient CAD flow that takes into consideration these resource requirements needs to be
developed.
Multi-Processor Benchmarking: Standardization of multi-processor benchmarks
would greatly help future research in enhancing FPGA based NoCs. It is certain that the
semiconductor industry is driving towards more and more processing elements. However,
It is important to benchmark these SoC applications that have high inherent parallelism
in them. This would model the realistic traffic scenarios and congestion that are typical
to FPGAs. Furthermore, similar to bus-based architectures, IP cores developed in future
could be standardized for NoC centric communication.
FPGA Device Support: In this research we have considered NoCs suitable for
implementing on FPGAs. Overlaying on-chip networks on an existing FPGA device
offers tremendous flexibility at the loss of performance. Future heterogeneous FPGAs
could incorporate hardware support for on-chip networks and thereby increase the overall
performance of the SoC. As pointed out in [46], it is an interesting design challenge to

117
partition the NoC components over hard embedded and soft configurable FPGA blocks.
Bibliography
[1] Xilinx Inc. http://www.xilinx.com.
[2] International Sematech. International technology roadmap for semiconductors 2005

edition.
[3] Nallatech Inc. http://www.nallatech.com.
[4] Theodore Marescaux et al. Interconnection Networks Enable Fine-Grain Multi-

Tasking on FPGAs. In FPL’2002, pages 795–805, 2002.
[5] A.Kumar et al. An FPGA Design Flow for Reconfigurable Network-Based Multi-
Processor Systems on Chip. In DATE’07, 2007.
[6] N.Kapre. Packet-Switched On-Chip FPGA Overlay Networks. ms thesis, california

institute of technology. 2006.
[7] Clint Hilton and Brent Nelson. PNoC: a flexible circuit-switched NoC for FPGA
based systems. In IEEE Proc. Computers and Digital Techniques, 2006.
[8] N. K. Kavaldjiev and G. J. M. Smit. A survey of efficient on chip communications

for SoC. In 4th PROGRESS Symp. on Embedded Systems, Nieuwegein, Netherlands,
2003.
[9] OCP-IP. http://www.ocpip.org.
118
119
[10] Manuel Saldaa, Lesley Shannon, and Paul Chow. The Routability of Multiprocessor
Network Topologies in FPGAs. In SLIP’06, pages 49–56, 2006.
[11] Balasubramanian Sethuraman, Prasun Bhattacharya, Jawad Khan, and Ranga Ve-
muri. LiPaR: A Light-Weight Parallel Router for FPGA-based Networks-on-Chip.
In Great Lakes Symposium on VLSI, 2005.
[12] William J. Dally and Brian Towles. Route Packets, Not Wires: On-Chip Intercon-
nection Networks. In Design Automation Conference, pages 684–689, 2001.
[13] Luca Benini and Giovanni De Micheli. Network on Chips: A New SOC Paradigm.
In IEEE Computer, 2002.
[14] S. Kumar, A. Jantsch, J.-P. Soininen, M. Forsell, M. Millberg, J. berg, K. Tiensyrj,
and A. Hemani. A Network on Chip Architecture and Design Methodology. In IEEE

International Symposium on VLSI, 2002.
[15] J.Dielissen et al. Concepts and implementation of the phillips network-on-chip. In

Proceedings of the IPSOC’03, 2003.
[16] J.A.Kahle et al. Introduction to the cell multiprocessor. In IBM Journal of Research
and Development, 2005.
[17] Sonics - smart interconnects. http://www.sonicsinc.com.
[18] Arteris. http://www.arteris.com.
[19] Silistix. http://www.silistix.com.
[20] F. Moraes, N. Calazans, A. Mello, L. Moller, and L. Ost. HERMES: an Infrastruc-

ture for Low Area Overhead Packet-Switching Networks on Chip. In INTEGRA-
TION, The VLSI Journal, 2002.
120
[21] T.A Bartic et. al. Topology Adaptive Network-on-Chip Design and Implementation.
In Computer and Digital Tecniques, IEE Proceedings, pages 467–472, 2005.
[22] N. Kapre, N. Mehta, M. deLorimier, R. Rubin, H. Barnor, M. J. Wilson,

M. Wrighton, and A. DeHon. Packet-Switched vs. Time-Multiplexed FPGA Over-
lay Networks. In IEEE Symposium on Field-Programmable Custom Computing Ma-
chines, 2006.
[23] N.Kavaldjiev and G.J.M.Smit. An Energy-efficient Network-on-Chip for a Heteroge-
nous Tiled Reconfigurable Systems-on-Chip. In EUROMICRO Symposium on DSD,

pages 492–498, 2004.
[24] Fernando Moraes et al. A Low Area Overhead Packet-switched Network on Chip:
Architecture and Prototyping. In IFIP VLSI-SOC, 2003.
[25] T.Bartic et.al. Topology Adaptive Network-on-Chip Design and Implementation.

In IEEE Proc Computer Digit. Tech, 2005.
[26] Daewook Kim, Manho Kim, and Gerald E.T Sobelman. Asynchronous FIFO Inter-
faces for GALS On-Chip Switched Networks. In Intl. SoC Design Conference’2005,
pages 186–189, 2005.
[27] Vijay Swaminathan. Performance analysis of multi-clock noc for fpgas. Master’s
thesis, University of Cincinnati, 2007.
[28] MentorGraphics Inc. http://www.mentorgraphics.com.
[29] Prasun Bhattacharya. Comparison of Single-Port and Multi-Port NoCs with Con-
temporary Buses on FPGAs. Master’s thesis, University of Cincinnati, 2006.
[30] N. Banerjee, P. Vellanki, and K. S. Chatha. A Power and Performance Model for
Network-on-Chip Architectures. In DATE 04: Proceedings of the conference on
Design, automation and test in Europe, 2004.
121
[31] J. Hu and R. Marculesu. Exploiting the Routing Flexibility for Energy/Performance

Aware Mapping of Regular NoC Architectures. In DATE’03, 2003.
[32] Mrio P. Vstias and Horcio C. Neto. Co-Synthesis of a Configurable SoC Platform
based on a Network on Chip Architecture. In ASPDAC, 2006.
[33] Alukayode Arole. Power profiling: An incremental power analysis technique for
fpga-based designs. Master’s thesis, University of Cincinnati, 2006.
[34] T.S.T. Mak et.al. On-FPGA Communication Architectures and Design Factors. In
FPL’06, 2006.
[35] D. Wiklund and L.Dake. SoCBUS: switched network on chip for hard real time
embedded systems. In Parallel and Distributed Processing Symposium, 2003, 2003.
[36] Balasubramanian Sethuraman and Ranga Vemuri. optiMap: a tool for automated
generation of NoC architectures using multi-port routers for FPGAs. In Design,
Automation and Test in Europe, 2006. DATE ’06, 2006.
[37] A.Janarthanan et.al. MoCReS: an Area-Efficient Multi Clock On-Chip Network for
Reconfigurable Systems. In IEEE Computer Society ISVLSI’07, 2007.
[38] T. Lei and S. Kumar. A two-step Genetic Algorithm for Mapping Task Graphs to a
Network on Chip Architecture. In Euromicro Symposium on Digital System Design,
2003, 2003.
[39] . A.Jalabert et. al. xpipes: a Latency Insensitive Parameterized Network-on-chip

Architecture For Multi-Processor SoCs. In ICCD, 2003.
[40] R.P.Dick et. al. TGFF: Task Graphs for Free. In 6th International Workshop on
Hardware/Software Codesign, 1998.
122
[41] Davide Bertozzi et. al. NoC Synthesis Fow for Customized Domain Specific Multi-
processor Systems-on-Chip. In IEEE Transaction on Parallel and Distributed Sys-
tems, 2005.
[42] K.Srinivasan and K.Chatha. A low complexity heuristic for design of custom
network-on-chip architectures. In DATE 2006, 2006.
[43] International Sematech. International technology roadmap for semiconductors 2007

edition.
[44] http://www.opencores.org.
[45] Daniel Williams. Implementation of a Generic Two-Layer Network Interface for

FPGA based NoCs. Master’s thesis, University of Cincinnati, 2008.
[46] R.Gindin, I.Cidon, and I.Keidar. NoC-Based FPGA: Architecture and Routing. In
NOCS 2007, 2007.
[47] Balasubramanian Sethuraman and Ranga Vemuri. A Force-directed Approach for

Fast Generation of Efficient Multi-Port NoC Architectures. In VLSI Design, 2007,
2007.
[48] N.Kavaldjiev, G.J.M.Smit, and P.G.Jansen. A Virtual Channel Router for On-Chip
Networks. In SOC Conference, 2004.
[49] N.Kavaldjiev, G.J.M.Smit, and P.G.Jansen. Two Architectures for On-Chip Virtual
Channel Router. In PROGRESS Symposium on Embedded Systems, 2004.
[50] Theodore Marescaux et al. Networks on Chip s Hardware Components of an OS for

Reconfigurable Systems. In FPL’2003, 2003.
[51] C.A. Zeferino, M.E. Kreutz, and A.A Susin. RASoC: a router soft-core for networks-
on-chip. In DATE’2004-Designer’s Forum, IEEE CS Press, 2004, pages 198–203,
2004.
123
[52] William J. Dally and Brian Towles. Principles and Practices of Interconnection
Networks. Morgan Kaufmann, 2003.
[53] K.Srinivasan and K.Chatha. A technique for low energy mapping and routing in
network-on-chip architectures. In ISPELD, 2005.
[54] P. T. Wolkotte, G. J. M. Smit, G. K. Rauwerda, and L. T. Smit. An Energy-Efficient

Reconfigurable Circuit Switched Network-on-Chip. In 19th IEEE International Par-
allel and Distributed Processing Symposium (IPDPS05), 2005.
[55] C.R.Hilton. A Flexible Circuit-Switched Communication Network for FPGA-Based
SoC Design. MS Thesis, Brigham Young University. 2005.
[56] S. Murali and G. D. Micheli. SUNMAP: A Tool for Automatic Topology Selection
and Generation for NoCs. In In Proceedings of the ACM/IEEE Design Automation
Conference, 2004.
[57] L.-R. Zheng J. Liu, , and H. Tenhunen. A Circuit-Switched Network Architecture
for Network-on-chip. In Proceedings of the International Symposium on System-on-

Chip, 2004.
[58] T.Bjerregaard and S.Mahadevan. A Survey of Research and Practices of Network-

on-Chip. In ACM computing surveys, 2006.
[59] J. Duato, A. Robles, F. Silla, and R. Beivide. A comparison of router architectures
for virtual cut-through andwormhole switching in a now environment. In Interna-

tional Symposium on Parallel and Distributed Processing, 1999.
[60] R.Gindin, I.Cidon, and I.Keidar. Noc-based fpga: Architecture and routing. In
International Symposium on Networks-on-Chip, 2007.
[61] A.Mello et. al. Virtual channels in networks on chip: Implementation and evaluation
on hermes noc. In SBCCI’05, 2005.
124
[62] I.Kuon and J.Rose. Measuring the Gap Between FPGAs and ASICs. In FPGA’06,
2006.
[63] S.E.Lee and N.Bagherzadeh. Increasing the Throughput of an Adaptive Router in

Network-on-Chip (NoC). In CODES + ISSS’06, 2006.
[64] A.Reimer, A.Schulz, and W.Nebel. Modelling Macromodules for High-Level Dy-
namic Power Estimation of FPGA-based Digital Designs. In ISPELD’06, 2006.
[65] R.Soares, I.S.Silva, and A.Azevedo. When Reconfigurable Architecture Meets

Network-on-Chip. In SBCCI’04, 2004.
[66] J.Xu, W.Wolf, J.Henkel, and S.Chakradhar. A design methodology for application-
specific networks-on-chip. In ACM Transactions on Embedded Computing Systems,
2006.
[67] T.Bartic et.al. Network-on-Chip for Reconfigurable Systems: From High-Level De-
sign Down to Implementation. In FPL’04, 2004.
[68] Y.Zhang, J.Roivainen, and A.Mmmel. A design methodology for application-specific
networks-on-chip. In DSD’06, 2006.
[69] P.Vstias and H.Heto. Area and Performance Optimization of a Generic Network-
on-Chip Architecture. In SBCCI’06, 2006.
[70] U.Y.Ogras et.al. Communication Architecture Optimization: Making the Shortest
Path Shorter in Regular Networks-on-Chip. In DATE 06, 2006.
[71] C.A.Zeferino, F.G.M.E.Santo, and A.A.Susin. ParIS: A Parameterizable Intercon-

nect Switch for Networks-on-Chip. In SBCCI’04, 2004.
[72] U.Y.Ogras and R.Marculescu. Its a Small World After All: NoC Performance
Optimization Via Long-Range Link Insertion. In IEEE Transactions on VLSI, 2006.
125
[73] S.L.Liu, P.Yiannacouras, and T.Suh. An FPGA Based Pentium in a Complete

Desktop System. In FPGA’07, 2007.
[74] A.Kumar et. al. Express Virtual Channels: Towards the Ideal Interconnection
Fabric. In ISCA’07, 2007.
[75] E. Rijpkema et. al. Trade offs in the design of a router with both guaranteed and
best-effort services for networks on chip. In DATE’03, 2003.
[76] S. Stergiou et. al. A synthesis oriented design library for networks on chips. In
DATE’05, 2005.
[77] S. Murali et.al. Bandwidth Constrained Mapping of Cores onto NoC Architectures.
In DATE’04, 2004.
[78] H.S. Wang et. al. Orion: A Power-Performance Simulator for Interconnection Net-
works. In Microarchitecture’02, 2002.
[79] U.Y.Ogras, J. Hu, and R.Marculescu. Key research problems in NoC design: a
holistic perspective. In International Workshop on Hardware/Software Codesign,
2005.
[80] H. G. Lee, U. Y. Ogras, R. Marculescu, and N. Chang. Design Space Exploration

and Prototyping for on-chip Multimedia Applications. In DAC’06, 2006.
[81] Y.Feng and D.P.Mehta. Heterogenous Floorplanning for FPGAs. In VLSID’06,
2006.
[82] M.Wang, A.Ranjan, and S.Raje. Multi-million gate fpga physical design challenges.
In ICCAD’03, 2003.
[83] J.Liu, L.Zheng, and H.Tenhunen. Global Routing for Multicast-Supporting TDM
Network-on-Chip. In SOC’04, 2004.
126
[84] S.Murali et. al. Designing Application Specific Networks on Chips with Floorplan
Information. In ICCAD’06, 2006.
[85] K.Poon, A.Yan, and J.E.Wilton. A Flexible Power Model for FPGAs. In FPL’02,
2002.
[86] J.Kim, C.Nicopoulos, and D.Park. A Gracefully Degrading and Energy-Efficient

Modular Router Architecture for On-Chip Networks. In ISCA’06, 2006.
[87] D.Wu, B.M.Hashimi, and M.T.Schmitz. Improving Routing Efficiency for Network-
on-Chip through Content-Aware Input Selection. In ASPDAC’06, 2006.
[88] Tong Li. Estimation of Power Consumption in Wormhole Routed Networks on Chip.
Master’s thesis, IMIT/LECS Stockholm, Sweden, 2005.
[89] K.Paulsson, M.Hubner, and J.Becker. Online Optimization of FPGA Power Dissipa-
tion by Exploiting Runtime Adaptation of Communication Primitives. In SBCCI’06,
2006.
[90] L.Shang, A.S.Kaviani, and K.Bathala. Dynamic Power Consumption in Virtex-II
FPGA Family. In FPGA’02, 2002.
[91] V.Deghalahal and T.Tuan. Methodology of High-Level Estimation of FPGA Power

Consumption. In FPGA’05, 2005.
[92] R.A.Shafik, P.Rosinger, and B.M.Al-Hashimi. MPEG-based Performance Compar-
ison between Network-on-Chip and AMBA MPSoC. In IEEE Workshop on Design

and Diagnostics of Electronic Systems, 2008.
[93] A.Laffely, J.Liang, P.Jain, W.Burleson, and R. Tessier. Adaptive systems on a chip
(aSoC) for low-power signal processing. In IEEE Signals, Systems and Computers,
2001.
127
[94] A. Kumar, S. Fernando, Yajun Ha, B. Mesman, and H. Corporaal. Multi-Processor

System-Level Synthesis for Multiple Applications on Platform FPGA. In IEEE FPL,
2007.
[95] Z.Lu, M.Liu, and A.Jantsch. Layered Switching for Networks on Chip. In DAC
2007, 2007.

University of Cincinnati: 07/11/2008 Arun Janarthanan Doctor of Philosophy Computer Engineering

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

University of Cincinnati: 07/11/2008 Arun Janarthanan Doctor of Philosophy Computer Engineering

Uploaded by

Copyright:

Available Formats

UNIVERSITY OF CINCINNATI

This work and its defense approved by:

Dr. Ranga Vemuri

A Dissertation submitted to the

Division of Research and Advanced Studies

in partial fulfillment of the

B. E. (ECE), Sri Venkateswara College of Engineering, University of

Dissertation Advisor : Dr: Karen Tomko

This work focuses on design and development of high performance communication

I am grateful to Intel and my supervisors for providing me with challenging respon-

me develop strong interests in On-Chip Networks. I was inspired by the comprehensive

and others who I have not mentioned.

Thanks for being there for me always.

1.1 Platform FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.3.2 Energy Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4.2 Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 On-Chip Network Background 9

2.1 Alternatives in Communication Architecture . . . . . . . . . . . . . . . . 9

2.2 On-Chip Network Description . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4.2 FPGA based NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 MoCReS: NoC Framework 16

3.4.1 Network Topology . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.5.3 Cross-Point Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.6.2 Multi-Clock Design . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.7.3 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 NoC Power Analysis 34

4.3 MoCReS Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . 36

5 Hybrid Two-Layer Router Architecture 42

5.4.1 Cross-Point Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.6 System-Level Router Model . . . . . . . . . . . . . . . . . . . . . . . . . 50

6 Hybrid NoC: Performance and Power Analysis 56

6.1 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.2.2 Switch vs Link Power . . . . . . . . . . . . . . . . . . . . . . . . . 63

7.1 Multi-Processor Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 66

8 FPGA Based NoC : CAD Flow 78

8.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

8.4.2 Mesh Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

8.5 Description of Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

8.7.2 Area Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

9 NoC based System-on-Chip Development 94

9.2.2 Xilinx [1] IP Support . . . . . . . . . . . . . . . . . . . . . . . . . 96

9.4 NoC Floorplanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

9.6 NoC Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

10 Summary of Contributions and Future Work 113

10.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

10.1.5 CAD Flow: Topology Synthesis . . . . . . . . . . . . . . . . . . . 115

1.1 System Level Design Requirements [2] . . . . . . . . . . . . . . . . . . . 5

3.1 Comparison of FPGA Router Designs . . . . . . . . . . . . . . . . . . . . 18

4.1 Standalone: MoCReS (1VC+CC) Power Consumption . . . . . . . . . . 37

5.1 Scaling of Area and Frequency with No.of C-Layer Ports . . . . . . . . . 52

6.1 Input Flits: Transition Activities . . . . . . . . . . . . . . . . . . . . . . 62

7.1 Application and Synthetic Benchmarks . . . . . . . . . . . . . . . . . . . 68

7.2 Performance and Power Estimates of Routing Resources in XC4VLX100 73

8.1 Clustering Results for Benchmarks . . . . . . . . . . . . . . . . . . . . . 83

8.2 Algorithm Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . 90

1.1 Scaling of Global Interconnects [2] . . . . . . . . . . . . . . . . . . . . . . 3

3.1 A Multi-Clock 2 × 2 Mesh Based NoC . . . . . . . . . . . . . . . . . . . 19

3.2 Packet Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.7 Average Latency Vs Injection Rate of 3 × 3 Mesh . . . . . . . . . . . . . 30

5.1 Hybrid Two-Layer Router Architecture . . . . . . . . . . . . . . . . . . . 45

5.4 Design Parameters Vs Area . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.1 Baseline 3 × 2 MoCReS Mesh . . . . . . . . . . . . . . . . . . . . . . . . 57