You are on page 1of 115

Linköping Studies in Science and Technology

Dissertation No. 1130

Performance and Energy Efficient


Network-on-Chip Architectures
Sriram R. Vangal

Electronic Devices
Department of Electrical Engineering
Linköping University, SE-581 83 Linköping, Sweden
Linköping 2007

ISBN 978-91-85895-91-5
ISSN 0345-7524
ii

Performance and Energy Efficient Network-on-Chip Architectures


Sriram R. Vangal
ISBN 978-91-85895-91-5

Copyright  Sriram. R. Vangal, 2007

Linköping Studies in Science and Technology


Dissertation No. 1130
ISSN 0345-7524

Electronic Devices
Department of Electrical Engineering
Linköping University, SE-581 83 Linköping, Sweden
Linköping 2007
Author email: sriram.r.vangal@gmail.com

Cover Image
A chip microphotograph of the industry’s first programmable 80-tile teraFLOPS
processor, which is implemented in a 65-nm eight-metal CMOS technology.

Printed by LiU-Tryck, Linköping University


Linköping, Sweden, 2007
Abstract

The scaling of MOS transistors into the nanometer regime opens the
possibility for creating large Network-on-Chip (NoC) architectures containing
hundreds of integrated processing elements with on-chip communication. NoC
architectures, with structured on-chip networks are emerging as a scalable and
modular solution to global communications within large systems-on-chip. NoCs
mitigate the emerging wire-delay problem and addresses the need for substantial
interconnect bandwidth by replacing today’s shared buses with packet-switched
router networks. With on-chip communication consuming a significant portion
of the chip power and area budgets, there is a compelling need for compact, low
power routers. While applications dictate the choice of the compute core, the
advent of multimedia applications, such as three-dimensional (3D) graphics and
signal processing, places stronger demands for self-contained, low-latency
floating-point processors with increased throughput. This work demonstrates
that a computational fabric built using optimized building blocks can provide
high levels of performance in an energy efficient manner. The thesis details an
integrated 80-Tile NoC architecture implemented in a 65-nm process
technology. The prototype is designed to deliver over 1.0TFLOPS of
performance while dissipating less than 100W.
This thesis first presents a six-port four-lane 57 GB/s non-blocking router
core based on wormhole switching. The router features double-pumped crossbar
channels and destination-aware channel drivers that dynamically configure
based on the current packet destination. This enables 45% reduction in crossbar

iii
iv

channel area, 23% overall router area, up to 3.8X reduction in peak channel
power, and 7.2% improvement in average channel power. In a 150-nm six-metal
CMOS process, the 12.2 mm2 router contains 1.9-million transistors and
operates at 1 GHz at 1.2-V supply.
We next describe a new pipelined single-precision floating-point multiply
accumulator core (FPMAC) featuring a single-cycle accumulation loop using
base 32 and internal carry-save arithmetic, with delayed addition techniques. A
combination of algorithmic, logic and circuit techniques enable multiply-
accumulate operations at speeds exceeding 3GHz, with single-cycle throughput.
This approach reduces the latency of dependent FPMAC instructions and
enables a sustained multiply-add result (2FLOPS) every cycle. The
optimizations allow removal of the costly normalization step from the critical
accumulation loop and conditionally powered down using dynamic sleep
transistors on long accumulate operations, saving active and leakage power. In a
90-nm seven-metal dual-VT CMOS process, the 2 mm2 custom design contains
230-K transistors. Silicon achieves 6.2-GFLOPS of performance while
dissipating 1.2 W at 3.1 GHz, 1.3-V supply.
We finally present the industry's first single-chip programmable teraFLOPS
processor. The NoC architecture contains 80 tiles arranged as an 8×10 2D array
of floating-point cores and packet-switched routers, both designed to operate at
4 GHz. Each tile has two pipelined single-precision FPMAC units which feature
a single-cycle accumulation loop for high throughput. The five-port router
combines 100 GB/s of raw bandwidth with low fall-through latency under 1ns.
The on-chip 2D mesh network provides a bisection bandwidth of 2 Tera-bits/s.
The 15-FO4 design employs mesochronous clocking, fine-grained clock gating,
dynamic sleep transistors, and body-bias techniques. In a 65-nm eight-metal
CMOS process, the 275 mm2 custom design contains 100-M transistors. The
fully functional first silicon achieves over 1.0TFLOPS of performance on a
range of benchmarks while dissipating 97 W at 4.27 GHz and 1.07-V supply.
It is clear that realization of successful NoC designs require well balanced
decisions at all levels: architecture, logic, circuit and physical design. Our results
demonstrate that the NoC architecture successfully delivers on its promise of
greater integration, high performance, good scalability and high energy
efficiency.
Preface

This PhD thesis presents my research as of September 2007 at the Electronic


Devices group, Department of Electrical Engineering, Linköping University,
Sweden. I started working on building blocks critical to the success of NoC
designs: first on crossbar routers followed by research into floating-point MAC
cores. I finally integrate the blocks into a large monolithic 80-tile teraFLOP NoC
multiprocessor. The following five publications are included in the thesis:

• Paper 1 - S. Vangal, N. Borkar and A. Alvandpour, "A Six-Port 57GB/s


Double-Pumped Non-blocking Router Core", 2005 Symposium on VLSI
Circuits, Digest of Technical Papers, June 16-18, 2005, pp. 268–269.

• Paper 2 - S. Vangal, Y. Hoskote, D. Somasekhar, V. Erraguntla, J.


Howard, G. Ruhl, V. Veeramachaneni, D. Finan, S. Mathew and N.
Borkar, “A 5 GHz floating point multiply- accumulator in 90 nm dual VT
CMOS”, ISSCC Digest of Technical Papers, Feb. 2003, pp. 334–335.

• Paper 3 - S. Vangal, Y. Hoskote, N. Borkar and A. Alvandpour, "A 6.2


GFLOPS Floating Point Multiply-Accumulator with Conditional
Normalization”, IEEE Journal of Solid-State Circuits, Volume 41, Issue
10, Oct. 2006, pp. 2314–2323.

• Paper 4 - S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J.


Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman,
Y. Hoskote and N. Borkar, “An 80-Tile 1.28TFLOPS Network-on-Chip in
65nm CMOS,” ISSCC Dig. Tech. Papers, pp. 98–99, Feb. 2007.

v
vi

• Paper 5 - S. Vangal, A. Singh, J. Howard, S. Dighe, N. Borkar and A.


Alvandpour, "A 5.1GHz 0.34mm2 Router for Network-on-Chip
Applications ", 2007 Symposium on VLSI Circuits, Digest of Technical
Papers, June 14-16, 2007, pp. 42–43.

As a staff member of Microprocessor Technology Laboratory at Intel


Corporation, Hillsboro, OR, USA, I am also involved in research work, with
several publications not discussed as part of this thesis:

• J. Tschanz, N. Kim, S. Dighe, J. Howard, G. Ruhl, S. Vangal, S.


Narendra, Y. Hoskote, H. Wilson, C. Lam, M. Shuman, C. Tokunaga, D.
Somasekhar, D. Finan, T. Karnik, N. Borkar, N. Kurd, V. De, “Adaptive
Frequency and Biasing Techniques for Tolerance to Dynamic
Temperature-Voltage Variations and Aging”, ISSCC Digest of Technical
Papers, Feb. 2007, pp. 292–293.

• S. Narendra, J. Tschanz, J. Hofsheier, B. Bloechel, S. Vangal, Y.


Hoskote, S. Tang, D. Somasekhar, A. Keshavarzi, V. Erraguntla, G.
Dermer, N. Borkar, S. Borkar, and V. De, “Ultra-low voltage circuits and
processor in 180nm to 90nm technologies with a swapped-body biasing
technique”, ISSCC Digest of Technical Papers, Feb. 2004, pp. 156–157.

• Y. Hoskote, B. Bloechel, G. Dermer, V. Erraguntla, D. Finan, J. Howard,


D. Klowden, S. Narendra, G. Ruhl, J. Tschanz, S. Vangal, V.
Veeramachaneni, H. Wilson, J. Xu, and N. Borkar, ”A TCP offload
accelerator for 10Gb/s Ethernet in 90-nm CMOS”, IEEE Journal of Solid-
State Circuits, Volume 38, Issue 11, Nov. 2003, pp. 1866–1875.

• Y. Hoskote, B. Bloechel, G. Dermer, V. Erraguntla, D. Finan, J. Howard,


D. Klowden, S. Narendra, G. Ruhl, J. Tschanz, S. Vangal, V.
Veeramachaneni, H. Wilson, J. Xu, and N. Borkar, “A 10GHz TCP
offload accelerator for 10Gb/s Ethernet in 90nm dual-VT CMOS”, ISSCC
Digest of Technical Papers, Feb 2003, pp. 258–259.

• S. Vangal, M. Anders, N. Borkar, E. Seligman, V. Govindarajulu, V.


Erraguntla, H. Wilson, A. Pangal, V. Veeramachaneni, J. Tschanz, Y. Ye,
D. Somasekhar, B. Bloechel, G. Dermer, R. Krishnamurthy, K.
Soumyanath, S. Mathew, S. Narendra, M. Stan, S. Thompson, V. De and
S. Borkar, “5-GHz 32-bit integer execution core in 130-nm dual-VT
CMOS”, IEEE Journal of Solid-State Circuits, Volume 37, Issue 11,
Nov. 2002, pp. 1421- 1432.
vii

• T. Karnik, S. Vangal, V. Veeramachaneni, P. Hazucha, V. Erraguntla and


S. Borkar, “Selective node engineering for chip-level soft error rate
improvement [in CMOS]”, 2002 Symposium on VLSI Circuits, Digest of
Technical Papers, June 2002, pp. 204-205.

• S. Vangal, M. Anders, N. Borkar, E. Seligman, V. Govindarajulu, V.


Erraguntla, H. Wilson, A. Pangal, V. Veeramachaneni, J. Tschanz, Y. Ye,
D. Somasekhar, B. Bloechel, G. Dermer, R. Krishnamurthy, K.
Soumyanath, S. Mathew, S. Narendra, M. Stan, S. Thompson, V. De and
S. Borkar, “A 5GHz 32b integer-execution core in 130nm dual-VT
CMOS”, ISSCC Digest of Technical Papers, Feb. 2002, pp. 412–413.

• S. Narendra, M. Haycock, V. Govindarajulu, V. Erraguntla, H. Wilson, S.


Vangal, A. Pangal, E. Seligman, R. Nair, A. Keshavarzi, B. Bloechel, G.
Dermer, R. Mooney, N. Borkar, S. Borkar, and V. De, “1.1 V 1 GHz
communications router with on-chip body bias in 150 nm CMOS”, ISSCC
Digest of Technical Papers, Feb. 2002, Volume 1, pp. 270–466.

• R. Nair, N. Borkar, C. Browning, G. Dermer, V. Erraguntla, V.


Govindarajulu, A. Pangal, J. Prijic, L. Rankin, E. Seligman, S. Vangal
and H. Wilson, “A 28.5 GB/s CMOS non-blocking router for terabit/s
connectivity between multiple processors and peripheral I/O nodes”,
ISSCC Digest of Technical Papers, Feb. 2001, pp. 224 – 225.

I have also co-authored one book chapter:

• Yatin Hoskote, Sriram Vangal, Vasantha Erraguntla and Nitin Borkar,


“A High-Speed, Multithreaded TCP Offload Engine for 10 Gb/s
Ethernet”, chapter 5 in “Network Processor Design, Issues and Practices,
Volume 3”, Elsevier, February 2005, ISBN: 978-0-12-088476-6.

I have the following patents issued with several patent applications pending:

• Sriram Vangal, “Integrated circuit interconnect routing using double


pumped circuitry”, patent #6791376.

• Sriram Vangal and Howard A. Wilson, “Method and apparatus for


driving data packets”, patent #6853644.

• Sriram Vangal and Dinesh Somasekhar, “Pipelined 4-2 compressor


circuit”, patent #6701339.
viii

• Sriram Vangal and Dinesh Somasekhar, “Flip-flop circuit”, patent #


6459316.

• Yatin Hoskote, Sriram Vangal and Jason Howard, “Fast single precision
floating point accumulator using base 32 system”, patent # 6988119.

• Amaresh Pangal, Dinesh Somasekhar, Shekhar Borkar and Sriram


Vangal, “Floating point multiply accumulator”, patent # 7080111.

• Amaresh Pangal, Dinesh Somasekhar, Sriram Vangal and Yatin


Hoskote, “Floating point adder”, patent #6889241.

• Bhaskar Chatterjee, Steven Hsu, Sriram Vangal, Ram Krishnamurthy,


“Leakage tolerant register file”, patent #7016239.

• Sriram Vangal, Matthew B. Haycock, Stephen R. Mooney, “Biased


control loop circuit for setting impedance of output driver”, patent
#6424175.

• Sriram Vangal, “Weak current generation”, patent #6791376.

• Sriram Vangal and Gregory Ruhl, “ A scalable skew tolerant tileable


clock distribution scheme with low duty cycle variation”, pending patent.

• Sriram Vangal and Arvind Singh, “ A compact crossbar router with


distributed arbitration scheme”, pending patent.
Contributions

The main contributions of this dissertation are:

• A single-chip 80-tile sub-100W NoC architecture with TeraFLOPS (one


trillion floating point operations per second) of performance.
• A 2D on-die mesh network with a bisection bandwidth > 2 Tera-bits/s.
• A tile-able high-speed low-power differential mesochronous clocking
scheme with low duty-cycle variation, applicable to large NoCs.
• A 5+GHz 5-port 100GB/s router architecture with phase-tolerant
mesochronous links and sub-ns fall-through latency.
• Implementation of a shared double-pumped crossbar switch enabling up
to 50% smaller crossbar and a compact 0.34mm2 router layout (65nm).
• A destination-aware CMOS driver circuit that dynamically configures
based on the current packet destination for peak router power reduction.
• A new single-cycle floating-point MAC algorithm with just 15 FO4
stages in the critical path, capable of sustained multiply-add result (2
FLOPS) every clock cycle (200ps).
• An FPMAC implementation with a conditional normalization technique
that opportunistically saves active and leakage power during long
accumulate operations.
• A combination of fine-grained power management techniques for active
and standby leakage power reduction.
• Extensive measured silicon data from the TeraFLOPS NoC processor and
key building blocks spanning three CMOS technology generations (150-
nm, 90-nm and 65-nm).

ix
x
Abbreviations

AC Alternating Current
ASIC Application Specific Integrated Circuit
CAD Computer Aided Design
CMOS Complementary Metal-Oxide-Semiconductor
CMP Chip Multi-Processor
DC Direct Current
DRAM Dynamic Random Access Memory
DSM Deep SubMicron
FF Flip-Flop
FLOPS Floating-Point Operations per second
FP Floating-Point
FPADD Floating-Point Addition
FPMAC Floating-Point Multiply Accumulator
FPU Floating-Point Unit
FO4 Fan-Out-of-4

xi
xii

GB/s Gigabyte per second


IEEE The Institute of Electrical and Electronics Engineers
ILD Inter-Layer Dielectric
I/O Input Output
ITRS International Technology Roadmap for Semiconductors
LZA Leading Zero Anticipator
MAC Multiply-Accumulate
MOS Metal-Oxide-Semiconductor
MOSFET Metal-Oxide-Semiconductor Field Effect Transistor
MSFF Master-Slave Flip-Flop
MUX Multiplexer
NMOS N-channel Metal-Oxide-Semiconductor
NoC Network-on-Chip
PCB Printed Circuit Board
PLL Phase Locked Loop
PMOS P-channel Metal-Oxide-Semiconductor
RC Resistance-Capacitance
SDFF Semi-Dynamic Flip-Flop
SoC System-on-a-Chip
TFLOPS TeraFLOPS (one Trillion Floating Point Operations per Second)
VLSI Very-Large Scale Integration
VR Voltage Regulator
Acknowledgments

I would like to express my sincere gratitude to my principal advisor and

supervisor Prof. Atila Alvandpour, for motivating me and providing the

opportunity to embark on research at Linköping University, Sweden. I also

thank Lic. Eng. Martin Hansson for generously assisting me whenever I needed

help. I thank Anna Folkeson and Arta Alvandpour at LiU for their assistance.

I am thankful for support from all of my colleagues at Microprocessor

Technology Labs, Intel Corporation. Special thanks to my manager, Nitin

Borkar, and lab director Matthew Haycock, for their support; Shekhar Borkar,

Joe Schutz and Justin Rattner for their technical leadership and for funding this

research. I also thank Dr. Yatin Hoskote, Dr. Vivek De, Dr. Dinesh

Somasekhar, Dr. Tanay Karnik, Dr. Siva Narendra, Dr. Ram Krishnamurthy, Dr.

Keith Bowman (for his exceptional reviewing skills), Dr. Sanu Mathew, Dr.

Muhammad Khellah and Jim Tschanz for numerous technical discussions. I am

xiii
xiv

also grateful to the silicon design and prototyping team in both Oregon, USA

and Bangalore, India: Howard Wilson, Jason Howard, Greg Ruhl, Saurabh

Dighe, Arvind Singh, Venkat Veeramachaneni, Amaresh Pangal, Venki

Govindarajulu, David Finan, Priya Iyer, Arvind Singh, Tiju Jacob, Shailendra

Jain, Sriram Venkataraman and all mask designers for providing an ideal

research and prototyping environment for making silicon realization of the

research ideas possible. Clark Roberts and Paolo Aseron deserve credit for their

invaluable support in the lab with evaluation board design and silicon testing.

All of this was made possible by the love and encouragement from my

family. I am indebted to my inspirational grandfather Sri. V. S. Srinivasan (who

passed away recently), my parents Sri. Ranganatha and Smt. Chandra, for their

constant motivation from the other side of the world. I sincerely thank Prof. K.

S. Srinath, for being my mentor since high school. Without his visionary

guidance and support, my career dreams would have remained unfulfilled. This

thesis is dedicated to all of them.

Finally, I am exceedingly thankful for the patience, love and support from

my soul mate: Reshma, and remarkable cooperation from my son, Pratik ☺.

Sriram Vangal
Linköping, September 2007
Contents

Abstract iii

Preface v

Contributions ix

Abbreviations xi

Acknowledgments xiii

List of Figures xix

Part I Introduction 1
Chapter 1 Introduction 3
1.1 The Microelectronics Era and Moore’s Law.................................... 4
1.2 Low Power CMOS Technology ....................................................... 5
1.2.1 CMOS Power Components........................................................ 5
1.3 Technology scaling trends and challenges ....................................... 7
1.3.1 Technology scaling: Impact on Power ...................................... 8
1.3.2 Technology scaling: Impact on Interconnects ......................... 11
1.3.3 Technology scaling: Summary ................................................ 14

xv
xvi

1.4 Motivation and Scope of this Thesis .............................................. 15


1.5 Organization of this Thesis............................................................. 16
1.6 References....................................................................................... 17

Part II NoC Building Blocks 21


Chapter 2 On-Chip Interconnection Networks 23
2.1 Interconnection Network Fundamentals......................................... 24
2.1.1 Network Topology................................................................... 25
2.1.2 Message Switching Protocols .................................................. 26
2.1.3 Virtual Lanes............................................................................ 27
2.1.4 A Generic Router Architecture................................................ 28
2.2 A Six-Port 57GB/s Crossbar Router Core...................................... 29
2.2.1 Introduction.............................................................................. 30
2.2.2 Router Micro-architecture ....................................................... 30
2.2.3 Packet structure and routing protocol...................................... 32
2.2.4 FIFO buffer and flow control .................................................. 33
2.2.5 Router Design Challenges ....................................................... 34
2.2.6 Double-pumped Crossbar Channel.......................................... 35
2.2.7 Channel operation and timing.................................................. 36
2.2.8 Location-based Channel Driver............................................... 37
2.2.9 Chip Results............................................................................. 39
2.3 Summary and future work .............................................................. 42
2.4 References....................................................................................... 42
Chapter 3 Floating-point Units 45
3.1 Introduction to floating-point arithmetic ........................................ 45
3.1.1 Challenges with floating-point addition and accumulation..... 46
3.1.2 Scheduling issues with pipelined FPUs................................... 48
3.2 Single-cycle Accumulation Algorithm........................................... 49
3.3 CMOS prototype............................................................................. 51
3.3.1 High-performance Flip-Flop Circuits...................................... 51
3.3.2 Test Circuits............................................................................. 52
3.4 References....................................................................................... 54

Part III An 80-Tile TeraFLOPS NoC 57


Chapter 4 An 80-Tile Sub-100W TeraFLOPS NoC Processor in 65-nm
CMOS 59
4.1 Introduction..................................................................................... 59
xvii

4.2 Goals of the TeraFLOPS NoC Processor ....................................... 61


4.3 NoC Architecture............................................................................ 62
4.4 FPMAC Architecture...................................................................... 63
4.4.1 Instruction Set .......................................................................... 64
4.4.2 NoC Packet Format.................................................................. 65
4.4.3 Router Architecture ................................................................. 66
4.4.4 Mesochronous Communication ............................................... 69
4.4.5 Router Interface Block (RIB) .................................................. 69
4.5 Design details.................................................................................. 70
4.6 Experimental Results...................................................................... 74
4.7 Applications for the TeraFLOP NoC Processor............................. 84
4.8 Conclusion ...................................................................................... 85
Acknowledgement ................................................................................... 85
4.9 References....................................................................................... 85
Chapter 5 Conclusions and Future Work 89
5.1 Conclusions..................................................................................... 89
5.2 Future Work.................................................................................... 91
5.3 References....................................................................................... 92

Part IV Papers 95
Chapter 6 Paper 1 97
6.1 Introduction..................................................................................... 98
6.2 Double-pumped Crossbar Channel with LBD ............................. 100
6.3 Comparison Results ...................................................................... 102
6.4 Acknowledgments ........................................................................ 103
6.5 References..................................................................................... 104
Chapter 7 Paper 2 105
7.1 FPMAC Architecture.................................................................... 106
7.2 Proposed Accumulator Algorithm................................................ 107
7.3 Overflow detection and Leading Zero Anticipation..................... 109
7.4 Simulation results ......................................................................... 111
7.5 Acknowledgments ........................................................................ 112
7.6 References..................................................................................... 113
Chapter 8 Paper 3 115
8.1 Introduction................................................................................... 116
xviii

8.2 FPMAC Architecture.................................................................... 118


8.2.1 Accumulator algorithm.......................................................... 119
8.2.2 Overflow prediction in carry-save ......................................... 123
8.2.3 Leading zero anticipation in carry-save................................. 125
8.3 Conditional normalization pipeline .............................................. 127
8.4 Design details................................................................................ 129
8.5 Experimental Results.................................................................... 132
8.6 Conclusions................................................................................... 136
8.7 Acknowledgments ........................................................................ 137
8.8 References..................................................................................... 138
Chapter 9 Paper 4 141
9.1 NoC Architecture.......................................................................... 142
9.2 FPMAC Architecture.................................................................... 143
9.3 Router Architecture ...................................................................... 144
9.4 NoC Power Management.............................................................. 146
9.5 Experimental Results.................................................................... 148
9.6 Acknowledgments ........................................................................ 149
9.7 References..................................................................................... 150
Chapter 10 Paper 5 151
10.1 Introduction................................................................................... 152
10.2 NoC Architecture and Design....................................................... 153
10.3 Experimental Results.................................................................... 157
10.4 Acknowledgments ........................................................................ 158
10.5 References..................................................................................... 158
List of Figures

Figure 1-1: Transistor count per Integrated Circuit. .................................................................. 4


Figure 1-2: Basic CMOS gate showing dynamic and short-circuit currents.............................. 6
Figure 1-3: Main leakage components for a MOS transistor. .................................................... 7
Figure 1-4: Power trends for Intel microprocessors................................................................... 8
Figure 1-5: Dynamic and Static power trends for Intel microprocessors .................................. 9
Figure 1-6: Sub-threshold leakage current as a function of temperature. .................................. 9
Figure 1-7: Three circuit level leakage reduction techniques. ................................................. 10
Figure 1-8: Intel 65 nm, 8-metal copper CMOS technology in 2004. ..................................... 11
Figure 1-9: Projected fraction of chip reachable in one cycle with an 8 FO4 clock period [20].
.................................................................................................................................................. 12
Figure 1-10: Current and projected relative delays for local and global wires and for logic
gates in nanometer technologies [25]....................................................................................... 12
Figure 1-11: (a) Traditional bus-based communicaton, (b) An on-chip point-to-point network.
.................................................................................................................................................. 13
Figure 1-12: Homogenous NoC Architecture. ......................................................................... 14
Figure 2-1: Direct and indirect networks ................................................................................. 24
Figure 2-2: Popular on-chip fabric topologies ......................................................................... 25
Figure 2-3: A fully connected non-blocking crossbar switch. ................................................. 26
Figure 2-4: Reduction in header blocking delay using virtual channels. ................................. 28
Figure 2-5: Canonical router architecture. ............................................................................... 29
Figure 2-6: Six-port four-lane router block diagram................................................................ 31
Figure 2-7: Router data and control pipeline diagram. ............................................................ 32
Figure 2-8: Router packet and FLIT format............................................................................. 33
Figure 2-9: Flow control between routers. ............................................................................... 34
Figure 2-10: Crossbar router in [19] (a) Internal lane organization. (b) Corresponding layout.
(c) Channel interconnect structure. .......................................................................................... 34
Figure 2-11: Double-pumped crossbar channel. ...................................................................... 36

xix
xx

Figure 2-12: Crossbar channel timing diagram........................................................................ 37


Figure 2-13: Location-based channel driver and control logic. ............................................... 38
Figure 2-14: (a) LBD Schematic. (b) LBD encoding summary............................................... 38
Figure 2-15: Double-pumped crossbar channel results compared to work reported in [19].... 39
Figure 2-16: Simulated crossbar channel propagation delay as function of port distance....... 40
Figure 2-17: LBD peak current as function of port distance.................................................... 41
Figure 2-18: Router layout and characteristics. ....................................................................... 41
Figure 3-1: IEEE single precision format. The exponent is biased by 127.............................. 46
Figure 3-2: Single-cycle FP adder with critical blocks high-lighted. ...................................... 47
Figure 3-3: FPUs optimized for (a) FPMADD (b) FPMAC instruction. ................................. 48
Figure 3-4: (a) Five-stage FP adder [11] (b) Six-stage CELL FPU [10]. ................................ 49
Figure 3-5: Single-cycle FPMAC algorithm. ........................................................................... 50
Figure 3-6: Semi-dynamic resetable flip flop with selectable pulse width. ............................. 52
Figure 3-7: Block diagram of FPMAC core and test circuits. ................................................. 52
Figure 3-8: 32-entry x 32b dual-VT optimized register file. .................................................... 53
Figure 4-1: NoC architecture.................................................................................................... 60
Figure 4-2: NoC block diagram and tile architecture............................................................... 62
Figure 4-3: FPMAC 9-stage pipeline with single-cycle accumulate loop. .............................. 63
Figure 4-4: NoC protocol: packet format and FLIT description.............................................. 65
Figure 4-5: Five-port two-lane shared crossbar router architecture. ........................................ 67
Figure 4-6: (a) Double-pumped crossbar switch schematic. (b) Area benefit over work in [11].
.................................................................................................................................................. 68
Figure 4-7: Phase-tolerant mesochronous interface and timing diagram................................. 69
Figure 4-8: Semi-dynamic flip-flop (SDFF) schematic. .......................................................... 70
Figure 4-9: (a) Global mesochronous clocking and (b) simulated clock arrival times. ........... 71
Figure 4-10: Router and on-die network power management.................................................. 72
Figure 4-11: (a) FPMAC pipelined wakeup diagram and simulated peak current reduction and
(b) state-retentive memory clamp circuit. ................................................................................ 73
Figure 4-12: Full-Chip and tile micrograph and characteristics. ............................................. 74
Figure 4-13: (a) Package die-side. (b) Land-side. (c) Evaluation board. ................................. 75
Figure 4-14: Measured chip FMAX and peak performance. ...................................................... 76
Figure 4-15: Measured chip power for stencil application. ..................................................... 78
Figure 4-16: Measured chip energy efficiency for stencil application..................................... 79
Figure 4-17: Estimated (a) Tile power profile (b) Communication power breakdown. .......... 80
Figure 4-18: Measured global clock distribution waveforms. ................................................. 81
Figure 4-19: Measured global clock distribution power. ......................................................... 81
Figure 4-20: Measured chip leakage power as percentage of total power vs. Vcc. A 2X
reduction is obtained by turning off sleep transistors. ............................................................. 82
Figure 4-21: On-die network power reduction benefit............................................................. 82
Figure 4-22: Measured IMEM virtual ground waveform slowing transition to and from sleep.
.................................................................................................................................................. 83
Figure 4-23: Measurement setup.............................................................................................. 84
Figure 6-1: Six-port four-lane router block diagram................................................................ 99
Figure 6-2: Router data and control pipeline diagram. ............................................................ 99
Figure 6-3: Double-pumped crossbar channel with location based driver (LBD). ................ 101
Figure 6-4: Crossbar channel timing diagram........................................................................ 101
Figure 6-5: (a) LBD schematic (b) Peak current per LBD vs. port distance.......................... 102
xxi

Figure 6-6: Router layout and characteristics. ....................................................................... 103


Figure 7-1: FPMAC pipe stages and organization. ................................................................ 107
Figure 7-2: Floating point accumulator mantissa datapath. ................................................... 108
Figure 7-3: Floating point accumulator exponent logic. ........................................................ 109
Figure 7-4: Toggle detection circuit....................................................................................... 110
Figure 7-5: Post-normalization pipeline diagram................................................................... 111
Figure 7-6: Chip layout and process characteristics............................................................... 112
Figure 7-7: Simulated FPMAC frequency vs. supply voltage. .............................................. 112
Figure 8-1: Conventional single-cycle floating-point accumulator with critical blocks high-
lighted..................................................................................................................................... 118
Figure 8-2: FPMAC pipe stages and organization. ................................................................ 119
Figure 8-3: Conversion to base 32. ........................................................................................ 120
Figure 8-4: Accumulator mantissa datapath........................................................................... 121
Figure 8-5: Accumulator algorithm behavior with examples. ............................................... 122
Figure 8-6: Accumulator exponent logic. .............................................................................. 122
Figure 8-7: Example of overflow prediction in carry-save. ................................................... 124
Figure 8-8: Toggle detection circuit and overflow prediction. .............................................. 124
Figure 8-9: Leading-zero anticipation in carry-save. ............................................................. 126
Figure 8-10: Sleep enabled conditional normalization pipeline diagram. ............................. 127
Figure 8-11: Sparse-tree adder. Highlighted path shows generation of every 16th carry...... 128
Figure 8-12: Accumulator register flip-flop circuit and layout bit-slice. ............................... 129
Figure 8-13: (a) Simulated worst-case bounce on virtual ground. (b) Normalization pipeline
layout showing sleep devices. (c) Sleep device insertion into power grid............................ 130
Figure 8-14: Block diagram of FPMAC core and test circuits. ............................................. 131
Figure 8-15: Chip clock generation and distribution. ............................................................ 133
Figure 8-16: Die photograph and chip characteristics. .......................................................... 133
Figure 8-17: Measurement setup............................................................................................ 134
Figure 8-18: Measured FPMAC frequency and power Vs. Vcc. ............................................ 134
Figure 8-19: Measured maximum frequency Vs. Vcc. ........................................................... 135
Figure 8-20: Measured switching and leakage power at 85°C............................................... 136
Figure 8-21: Conditional normalization and total FPMAC power for various activation rates
(N). ......................................................................................................................................... 137
Figure 9-1: NoC block diagram and tile architecture............................................................. 143
Figure 9-2: FPMAC 9-stage pipeline with single-cycle accumulate loop. ............................ 144
Figure 9-3: Shared crossbar router with double-pumped crossbar switch. ............................ 145
Figure 9-4: Global mesochronous clocking and simulated clock arrival times. .................... 146
Figure 9-5: FPMAC pipelined wakeup diagram and state-retentive memory clamp circuit. 147
Figure 9-6: Estimated frequency and power versus Vcc, and power efficiency with 80 tiles
(N) active................................................................................................................................ 148
Figure 9-7: Full-Chip and tile micrograph and characteristics. ............................................. 149
Figure 10-1: Communication fabric for 80-tile NoC. ............................................................ 153
Figure 10-2: Five-port two-lane shared crossbar router architecture. .................................... 154
Figure 10-3: NoC protocol: packet format and FLIT description.......................................... 154
Figure 10-4: (a) Double-pumped crossbar switch schematic. (b) Area benefit over work in [2].
................................................................................................................................................ 155
Figure 10-5: Phase-tolerant mesochronous interface and timing diagram............................ 156
Figure 10-6: (a) Semi-dynamic flip-flop (SDFF) schematic. (b) Power management. ......... 156
xxii

Figure 10-7: (a) Measured router FMAX and power Vs. Vcc. (b) Power reduction benefit. .... 157
Figure 10-8: (a) Communication power breakdown. (b) Die micrograph of individual tile. 158
Part I

Introduction

1
Chapter 1

Introduction

For four decades, the semiconductor industry has surpassed itself by the
unparalleled pace of improvement in its products and has transformed the world
that we live in. The remarkable characteristic of transistors that fuels this rapid
growth is that their speed increases and their cost decreases as their size is
reduced. Modern day high-performance Integrated Circuits (ICs) have more than
one billion transistors. Today, the 65 nanometer (nm) Complementary Metal
Oxide Semiconductor (CMOS) technology is in high volume manufacturing and
industry has already demonstrated fully functional static random access memory
(SRAM) chips using 45nm process. This scaling of CMOS Very Large Scale
Integration (VLSI) technology is driven by Moore’s law. CMOS has been the
driving force behind high-performance ICs. The attractive scaling properties of
CMOS, coupled with low power, high speed and noise margins, reliability,
wider temperature and voltage operation range, overall circuit and layout
implementation and manufacturing ease, has made it the technology of choice
for digital ICs. CMOS also enables monolithic integration of both analog and
digital circuits on the same die, and shows great promise for future System-on-a-
Chip (SoC) implementations.
This chapter reviews trends in Silicon CMOS technology and highlights the
main challenges involved in keeping up with Moore’s law. CMOS device

3
4 Introduction

scaling increases transistor sub-threshold leakage. Interconnect scaling coupled


with higher operating frequencies requires careful parasitic extraction and
modeling. Power delivery and dissipation are fast becoming limiting factors in
product design.

1.1 The Microelectronics Era and Moore’s Law


There is hardly any other area of industry which has developed as fast as the
semiconductor industry in the last 40 years. Within this relatively short period,
microelectronics has become the key technology enabler for several industries
like information technology, telecommunication, medical equipment and
consumer electronics [1].
In 1965, Intel co-founder Gordon Moore observed that the total number of
devices on a chip doubled every 12 months [2]. He predicted that the trend
would continue in the 1970s but would slow in the 1980s, when the total number
of devices would double every 24 months. Known widely as “Moore’s Law,”
these observations made the case for continued wafer and die size growth, defect
density reduction, and increased transistor density as manufacturing matured and
technology scaled. Figure 1-1 plots the growth number of transistors per IC and
shows that the transistor count has indeed doubled every 24 months [3].

Figure 1-1: Transistor count per Integrated Circuit.


During these years, the die size increased at 7% per year, while the operating
frequency of leading microprocessors has doubled every 24 months [4], and is
1.2 Low Power CMOS Technology 5

well into the GHz range. Figure 1-1 also shows that the integration density of
memory ICs is consistently higher than logic chips. Memory circuits are highly
regular, allowing better integration with much less interconnect overhead.

1.2 Low Power CMOS Technology


The idea of CMOS Field Effect Transistor (FET) was first introduced by
Wanlass and Sah [5]. By the 1980’s, it was widely acknowledged that CMOS is
the dominant technology for microprocessors, memories and application specific
integrated circuits (ASICs), owing to its favorable properties over other IC
technologies. The biggest advantage of CMOS over NMOS and bipolar
technology is its significantly reduced power dissipation, since a CMOS circuit
has almost no static (DC) power dissipation.
1.2.1 CMOS Power Components
In order to understand the evolution of CMOS as one of the most popular
low-power design approaches, we first examine the sources of power dissipation
in digital CMOS circuit. The total power consumed by a static CMOS circuit
consists of three components, and is given by the following expression [6]:
Ptotal = Pdynamic + Pshort − circuit + Pstatic (1.1)
Pdynamic represents the dynamic or switching power, i.e., the power dissipated
in charging and discharging the physical load capacitance contributed by fan-out
gate loading, interconnect loading, and parasitic capacitances at the CMOS gate
outputs. CL represents this capacitance, lumped together as shown in Figure 1-2.
The dynamic power is given by Eq. 1.2, where fclk is the clock frequency with
which the gate switches, Vdd is the power supply, and α is the switching activity
factor, which determines how frequently the output switches per clock-cycle.
Pdynamic = α ⋅ f clk ⋅ CL ⋅ Vdd2 (1.2)
Pshort-circuit represents the short-circuit power, i.e., the power consumed during
switching between Vdd and ground. This short-circuit current (Isc) arises when
both PMOS and NMOS transistors are simultaneously active, conducting current
directly from Vdd to ground for a short period of time. Equation 1.3 describes the
short-circuit power dissipation for a simple CMOS inverter [7], where β is the
gain factor of the transistors, τ is the input rise/fall time, and VT is the transistor
threshold voltage (assumed to be same for both PMOS/NMOS).
β
Pshort −circuit = (Vdd − 2VT )3 ⋅τ ⋅ fclk (1.3)
12
6 Introduction

Vdd

PMOS
Iswitch
Network Vout
A1

AN Isc
NMOS CL
Network

Figure 1-2: Basic CMOS gate showing dynamic and short-circuit currents.
Its important to note that the switching component of power is independent of
the rise/fall times at the input of logic gates, but short-circuit power depends on
input signal slope. Short-circuit currents can be significant when the rise/fall
times at the input of the gate is much longer when compared to the output
rise/fall times.
The third component of power: Pstatic is due to leakage currents and is
determined by fabrication technology considerations, and consists of (1)
source/drain junction leakage current; (2) gate direct tunneling leakage; (3) sub-
threshold leakage through the channel of an OFF transistor and is summarized
in Figure 1-3.
(1) The junction leakage (I1) occurs from the source or drain to the substrate
through the reverse-biased diodes when a transistor is OFF. The magnitude of
the diode’s leakage current depends on the area of the drain diffusion and the
leakage current density, which, is in turn, determined by the process technology.
(2) The gate direct tunneling leakage (I2) flows from the gate thru the “leaky”
oxide insulation to the substrate. Its magnitude increases exponentially with the
gate oxide thickness and supply voltage. According to the 2005 International
Technology Roadmap for Semiconductors [8], high-K gate dielectric is required
to control this direct tunneling current component of the leakage current.
(3) The sub-threshold current is the drain-source current (I3) of an OFF
transistor. This is due to the diffusion current of the minority carriers in the
channel for a MOS device operating in the weak inversion mode (i.e., the sub-
threshold region.) For instance, in the case of an inverter with a low input
voltage, the NMOS is turned OFF and the output voltage is high. Even with VGS
at 0V, there is still a current flowing in the channel of the OFF NMOS transistor
since VGS = Vdd. The magnitude of the sub-threshold current is a function of the
temperature, supply voltage, device size, and the process parameters, out of
1.3 Technology scaling trends and challenges 7

which, the threshold voltage (VT) plays a dominant role. In today’s CMOS
technologies, sub-threshold current is the largest of all the leakage currents, and
can be computed using the following expression [9]:
( )
I SUB = K ⋅ 1 − e − v DS / vT ⋅ eVGS −VT +ηV DS / nvT (1.4)

where K and n are functions of the technology, and η is drain induced barrier
lowering (DIBL) coefficient, an effect that manifests short channel MOSFET
devices. The transistor sub-threshold swing coefficient is modeled with the
parameter n while νT represents the thermal voltage, given by KT/q (~33mv at
110°C).
Gate
I2
Source Drain

n+ I3 n+
I1

P-substrate

Figure 1-3: Main leakage components for a MOS transistor.


In the sub-threshold region, the MOSFET behaves primarily as a bipolar
transistor, and the sub-threshold is exponentially dependent on VGS (Eq. 1.4).
Another important figure of merit for low-power CMOS is the sub-threshold
slope, which is the amount of voltage required to drop the sub-threshold current
by one decade. The lower the sub-threshold slope, the better since the transistor
can be turned OFF when VGS is reduced below VT.

1.3 Technology scaling trends and challenges


The scaling of CMOS VLSI technology is driven by Moore’s law, and is the
primary factor driving speed and performance improvement of both
microprocessors and memories. The term “scaling” refers to the reduction in
transistor width, length and oxide dimensions by 30%. Historically, CMOS
technology, when scaled to the next generation, (1) reduces gate delay by 30%
allowing a 43% increase in clock frequency, (2) doubles the device density (as
shown in Figure 1-1), (3) reduces the parasitic capacitance by 30% and (4)
8 Introduction

reduces the energy and active energy per transition by 65% and 50%
respectively. The key barriers to continued scaling of supply voltage and
technology for microprocessors to achieve low-power and high-performance are
well documented in [10]-[13].
1.3.1 Technology scaling: Impact on Power
The rapid increase in the number of transistors on chips has enabled a
dramatic increase in the performance of computing systems. However, the
performance improvement has been accompanied by an increase in power
dissipation; thus, requiring more expensive packaging and cooling technology.
Figure 1-4 shows the power trends of Intel microprocessors, with the dotted line
showing the power trend with classic scaling. Historically, the primary
contributor to power dissipation in CMOS circuits has been the charging and
discharging of load capacitances, often referred to as the dynamic power
dissipation. As shown by Eq. 1.2, this component of power varies as the square
of the supply voltage (Vdd). Therefore, in the past, chip designers have relied on
scaling down Vdd to reduce the dynamic power dissipation. Despite scaling Vdd,
the total power dissipation of microprocessors is expected to increase
exponentially (in logarithmic scale) from 100W in 2005 to over 2KW by end of
the decade, with a substantial increase in sub-threshold leakage power [10].

100
P4
PPro
Pentium
Power (Watts)

10
486
8086 286
386
8085
1 8080
8008
4004

0.1
1971 1974 1978 1985 1992 2000
Year

Figure 1-4: Power trends for Intel microprocessors.


Maintaining transistor switching speeds (constant electric field scaling)
requires a proportionate downscaling of the transistor threshold voltage (VT) in
lock step with the Vdd reduction. However, scaling VT results in a significant
1.3 Technology scaling trends and challenges 9

increase of leakage power due to an exponential increase in the sub-threshold


leakage current (Eq. 1.4). Figure 1-5 illustrates the increasing leakage power
trend for Intel microprocessors in different CMOS technologies; with leakage
power exceeding 50% of the total power budget in 70nm technology node [13].
Figure 1-6 is a semi-log plot that shows estimated sub-threshold leakage
currents for future technologies as a function of temperature. Borkar in [11]
predicts a 7.5X increase in the leakage current and a 5X increase in total energy
dissipation for every new microprocessor chip generation!
250 120%
Active Power
Active Leakage 100%
200
Power (Watts)

Leakage > 50% 80%


150 of total power!
60%
100
40%

50 20%

0 0%
250nm 180nm 130nm 100nm 70nm
0.25µ 0.18µ 0.13µ 0.1µ 0.07µ
Technology

Figure 1-5: Dynamic and Static power trends for Intel microprocessors

10,000
45nm
1,000 65nm
Ioff (na/u)

90nm
100
µ
0.13µ
10 µ
0.18µ
µ
0.25µ
µ
0.25µ
1
20 40 60 80 100 120
T emp (C)

Figure 1-6: Sub-threshold leakage current as a function of temperature.


Note that it is possible to substantially reduce the leakage power, and hence
the overall power, by reducing the die temperature. Therefore better cooling
10 Introduction

techniques would be more critical in the advanced deep submicron technologies


to control both active leakage and total power. Leakage current reduction can
also be achieved by utilizing either process techniques and/or circuit techniques.
At process level, channel engineering is used to optimize the device doping
profile for reduced leakage. The dual-VT technology [14] is used to reduce total
sub-threshold leakage power, where NMOS and PMOS devices with both high
and low threshold voltages are made by selectively adjusting well doses. The
technique adjusts high performance critical path transistors with low-VT while
non-critical paths are implemented with high-VT transistors, at the cost of
additional process complexity. Over 80% reduction in leakage power has been
reported in [15], while meeting performance goals.
In the last decade, a number of circuit solutions for leakage control have been
proposed. One solution is to force a non-stack device to a stack of two devices
without affecting the input load [16], as shown in Figure 1-7(a). This
significantly reduces sub-threshold leakage, but will incur a delay penalty,
similar to replacing a low-VT device with a high-VT device in a dual-VT design.
Combined dynamic body bias and sleep transistor techniques for active leakage
power control have been described in [17]. Sub-threshold leakage can be
reduced by dynamically changing the body bias applied to the block, as shown
in Figure 1-7(b). During active mode, forward body bias (FBB) is applied to
increase the operating frequency. When the block enters idle mode, the forward
bias is withdrawn, reducing the leakage. Alternately, reverse body bias (RBB)
can be applied during idle mode for further leakage savings.
(a) Stack Effect (b) Body Bias (c) Sleep Transistor
Vbp
Vdd
+Ve

Equal Loading Logic Block

-Ve Vbn

Figure 1-7: Three circuit level leakage reduction techniques.


Supply gating or sleep transistor technique uses a high threshold transistor, as
shown in Figure 1-7(c), to cut off the supply to a functional block, when the
1.3 Technology scaling trends and challenges 11

design is in an “idle” or “standby” state. A 37X reduction in leakage power, with


a block reactivation time of less than two clock cycles has been reported in [17].
While this technique can reduce leakage by orders of magnitude, it causes
performance degradation and complicates power grid routing.
1.3.2 Technology scaling: Impact on Interconnects
In deep-submicron designs, chip performance is increasingly limited by the
interconnect delay. With scaling, the width and thickness of the interconnections
are reduced. As a result, the resistance increases, and as the interconnections get
closer, the capacitance increases, increasing RC delay. As a result, the cross-
coupling capacitance between adjacent wires is also increasing with each
technology generation [18]. New designs add more transistors on chip and the
average die size of a chip has been increasing over time. To account for
increased RC parasitics, more interconnect layers are being added. The thinner,
tighter interconnect layers are used for local interconnections, and the new
thicker and sparser layers are used for global interconnections and power
distribution. Copper metallization has been used to reduce the resistance of
interconnects and Fluorinated SiO2 as inter-level dielectric (ILD) to reduce the
dielectric constant (k=3.6). Figure 1-8 is a cross-section SEM image showing
the interconnect structure in the 65nm technology node [19].

M8

M7

M6

M5
M4
M3
M2
M1

Figure 1-8: Intel 65 nm, 8-metal copper CMOS technology in 2004.


Technology trends show that global on-chip wire delays are growing
significantly, increasing cross-chip communication latencies to several clock
cycles and rendering the expected chip area reachable in a single cycle to be less
than 1% in a 35nm technology, as shown in Figure 1-9 [20]. Figure 1-10 shows
the projected relative delay taken from the ITRS roadmap for local wires, global
wires (with and without repeaters), and logic gates in the near future. With gate
12 Introduction

delays (reducing 30%) and on-chip interconnect delay (increasing 30%) every
technology generation, the cost gap between computation and communication is
widening. Inductive noise and skin effect will get more pronounced as
frequencies are reaching multi-GHz levels. Both circuit and layout solutions will
be required to contain the inductive and capacitive coupling effects. In addition,
interconnects now dissipate an increasingly larger portion of total chip power
[22]. All of these trends indicate that interconnect delays and power in sub-
65nm technologies will continue to dominate the overall chip performance.

Figure 1-9: Projected fraction of chip reachable in one cycle with an 8 FO4
clock period [20].
Relative Delay

Process Technology Node (nm)

Figure 1-10: Current and projected relative delays for local and global
wires and for logic gates in nanometer technologies [25].
1.3 Technology scaling trends and challenges 13

The challenge for chip designers is to come up with new architectures that
achieve both a fast clock rate and high concurrency, despite slow global wires.
Shared bus networks (Figure 1-11a) are well understood and widely used in
SoCs, but have serious scalability issues as more bus masters (>10) are added.
To mitigate the global interconnect problem, new structured wire-delay scalable
on-chip communication fabrics, called network-on-chip (NoCs), have emerged
for use in SoC designs. The basic concept is to replace today’s shared buses with
on-chip packet-switched interconnection networks (Figure 1-11b) [22]. The
point-to-point network overcomes the scalability issues of shared-medium
networks.

(a) (b)

IP/compute core Link / Channels


Router (Switch)
Bus Network Interface

Figure 1-11: (a) Traditional bus-based communicaton, (b) An on-chip


point-to-point network.

NoC architectures may have a homogeneous or heterogeneous structure. The


NoC architecture, shown in Figure 1-12 is homogenous and consists of the basic
building block, the “network tile”. These tiles are connected to an on-chip
network that routes packets between them. Each tile may consist of one or more
compute cores (microprocessor) or memory cores and would also have routing
14 Introduction

logic responsible for routing and forwarding the packets, based on the routing
policy of the network. The dedicated point-to-point links used in the network are
optimal in terms of bandwidth availability, latency, and power consumption.
The structured network wiring of such a NoC design gives well-controlled
electrical parameters that simplifies timing and allows the use of high-
performance circuits to reduce latency and increase bandwidth. An excellent
introduction to NoCs, including a survey of research and practices is found in
[24]-[25]

Network Tile
Links

Router

IP Core

Figure 1-12: Homogenous NoC Architecture.

1.3.3 Technology scaling: Summary


We discussed trends in CMOS VLSI technology. The data indicates that
silicon performance, integration density, and power have followed the scaling
theory. The main challenges to VLSI scaling include lithography, transistor
scaling, interconnect scaling, and increasing power delivery and dissipation.
Modular on-chip networks are required to resolve interconnect scaling issues.
As the MOSFET channel length is reduced to 45nm and below, suppression of
the ever increasing off-state leakage currents becomes crucial, requiring
leakage-aware circuit techniques and newer bulk CMOS compatible device
structures. While these challenges remain to be overcome, no fundamental
barrier exists to scaling CMOS devices well into the nano-CMOS era, with a
physical gate length of under 10nm. Predictions from the 2005 IRTS roadmap
[8] indicate that “Moore’s law” should continue well into the next decade before
the ultimate device limits for CMOS are reached.
1.4 Motivation and Scope of this Thesis 15

1.4 Motivation and Scope of this Thesis


Very recently, chip designers have moved from a computation-centric view of
chip design to a communication-centric view, owing to the widening cost gap
between interconnect delay over gate delay in deep-submicron technologies. As
this discrepancy between device delay and on-chip communication latency
becomes macroscopic, the practical solution is to use scalable NoC
architectures. NoC architectures, with structured on-chip networks are emerging
as a scalable and modular solution to global communications within large
systems-on-chip. NoCs mitigate the emerging wire-delay problem and addresses
the need for substantial interconnect bandwidth by replacing today’s shared
buses with packet-switched router networks.
This work focuses on building blocks critical to the success of NoC designs.
Research into high performance, area and energy efficient router architectures
allows for larger router networks to be integrated on a single die with reduced
power consumption. We next turn our attention to identifying and resolving
computation throughput issues in conventional floating-point units by proposing
a new single-cycle FPMAC algorithm capable of a sustained multiply-add result
every cycle at GHz frequencies. The building blocks are integrated into a large
monolithic 80-tile teraFLOP NoC multiprocessor. Such a NoC prototype will
demonstrate that a computational fabric built using optimized building blocks
can provide high levels of performance in an energy efficient manner. In
addition, silicon results from such a prototype would provide in-depth
understanding of performance and energy tradeoffs in designing successful NoC
architectures well into the nanometer regime.
With on-chip communication consuming a significant portion of the chip
power (about 40% [23]) and area budgets, there is a compelling need for
compact, low power routers. While applications dictate the choice of the
compute core, the advent of multimedia applications, such as three-dimensional
(3D) graphics and signal processing, places stronger demands for self-contained,
low-latency floating-point processors with increased throughput. The thesis
details an integrated 80-Tile NoC architecture implemented in a 65-nm process
technology. The prototype is designed to deliver over 1.0 TFLOPS of
performance while dissipating less than 100W.
This thesis first presents a six-port four-lane 57 GB/s non-blocking router
core based on wormhole switching. The router features double-pumped crossbar
channels and destination-aware channel drivers that dynamically configure
based on the current packet destination. This enables 45% reduction in crossbar
channel area, 23% overall router area, up to 3.8X reduction in peak channel
power, and 7.2% improvement in average channel power. In a 150nm six-metal
16 Introduction

CMOS process, the 12.2mm2 router contains 1.9 million transistors and operates
at 1GHz at 1.2V.
We next present a new pipelined single-precision floating-point multiply
accumulator core (FPMAC) featuring a single-cycle accumulation loop using
base 32 and internal carry-save arithmetic, with delayed addition techniques. A
combination of algorithmic, logic and circuit techniques enable multiply-
accumulate operations at speeds exceeding 3GHz, with single-cycle throughput.
This approach reduces the latency of dependent FPMAC instructions and
enables a sustained multiply-add result (2 FLOPS) every cycle. The
optimizations allow removal of the costly normalization step from the critical
accumulation loop and conditionally powered down using dynamic sleep
transistors on long accumulate operations, saving active and leakage power. In a
90nm seven-metal dual-VT CMOS process, the 2mm2 custom design contains
230K transistors. Silicon achieves 6.2 GFLOPS of performance while
dissipating 1.2W at 3.1GHz, 1.3V supply.
We finally describe an integrated NoC architecture containing 80 tiles
arranged as an 8×10 2D array of floating-point cores and packet-switched
routers, both designed to operate at 4 GHz. Each tile has two pipelined single-
precision FPMAC units which feature a single-cycle accumulation loop for high
throughput. The five-port router combines 100GB/s of raw bandwidth with fall-
through latency under 1ns. The on-chip 2D mesh network provides a bisection
bandwidth of 2 Tera-bits/s. The 15-FO4 design employs mesochronous
clocking, fine-grained clock gating, dynamic sleep transistors, and body-bias
techniques. In a 65-nm eight-metal CMOS process, the 275 mm2 custom design
contains 100-M transistors. The fully functional first silicon achieves over
1.0TFLOPS of performance on a range of benchmarks while dissipating 97 W at
4.27 GHz and 1.07-V supply.

1.5 Organization of this Thesis


This thesis is organized into four parts:
• Part I – Introduction
• Part II – NoC Building Blocks
• Part III – An 80-Tile TeraFLOPS NoC
• Part IV– Papers
Part I, provides the necessary background for the concepts used in the papers.
This chapter reviews trends in Silicon CMOS technology and highlights the
main challenges involved in keeping up with Moore’s law. Properties of CMOS
devices, sources of leakage power and the impact of scaling on power and
1.6 References 17

interconnect are discussed. This is followed by a brief discussion of NoC


architectures.
In Part II, we describe the two NoC building blocks in detail. Introductory
concepts specific to on-die interconnection networks, including a generic router
architecture is presented in Chapter 2. A more detailed description of the six-
port four-lane 57 GB/s non-blocking router core design (Paper 1) is also given.
Chapter 3 presents basic concepts involved in floating-point arithmetic, reviews
conventional floating-point units (FPU), and describes the challenges in
accomplishing single-cycle accumulation on today’s FPUs. A new pipelined
single-precision FPMAC design, capable of single-cycle accumulation, is
described in Paper 2 and Paper 3. Paper 2 first presents the FPMAC algorithm,
logic optimizations and preliminary chip simulation results. Paper 3 introduces
the concept of “Conditional Normalization”, applicable to floating point
units, for the first time. It also describes a 6.2 GFLOPS Floating Point Multiply-
Accumulator, enhanced with conditional normalization pipeline and with
detailed results from silicon measurement.
In Part III Chapter 4, we pull in all our work to realize the 80-Tile
TeraFLOPS NoC processor. The motivation and applications space for tera-
scale computers are also discussed. Paper 4 first introduces the 80-tile NoC
architecture details with initial chip simulation results. In Paper 5, we focuses
on the on-chip mesh network, router and mesochronous links used in the
TeraFLOPS NoC. Chapter 4 builds on both papers and describes the NoC
prototype in significant detail and includes results from chip measurement. We
conclude in Chapter 5 with final remarks and directions for future research.
Finally in Part IV the papers, included in this thesis, are presented in full.

1.6 References
[1]. R. Smolan and J. Erwitt, One Digital Day – How the Microchip is
Changing Our World, Random House, 1998.

[2]. G.E. Moore, “Cramming more components onto integrated circuits”, in


Electronics, vol. 38, no. 8, 1965.

[3]. http://www.icknowledge.com, April. 2006.

[4]. http://www.intel.com/technology/mooreslaw/index.htm, April. 2006.

[5]. F. Wanlass and C. Sah, “Nanowatt logic using field-effect metal-oxide


semiconductor triodes,” ISSCC Digest of Technical Papers, Volume VI,
pp. 32 – 33, Feb 1963.
18 Introduction

[6]. A. Chandrakasan and R. Brodersen, “Low Power Digital CMOS Design”,


Kluwer Academic Publishers, 1995.

[7]. H.J.M. Veendrick, “Short-Circuit Dissipation of Static CMOS Circuitry


and its Impact on the Design of Buffer Circuits”, in IEEE Journal of Solid-
State Circuits, vol. 19, no. 4, pp. 468-473, August 1984.

[8]. “International Technology Roadmap for Semiconductors.”


http://public.itrs.net, April 2006.

[9]. A. Ferre and J. Figueras, “Characterization of leakage power in CMOS


technologies,” in Proc. IEEE Int. Conf. on Electronics, Circuits and
Systems, vol. 2, 1998, pp. 85–188.

[10]. S. Borkar, “Design challenges of technology scaling”, IEEE Micro,


Volume 19, Issue 4, pp. 23 – 29, Jul-Aug 1999

[11]. V. De and S. Borkar, “Technology and Design Challenges for Low Power
and High-Performance”, in Proceedings of 1999 International Symposium
on Low Power Electronics and Design, pp. 163-168, 1999.

[12]. S. Rusu, “Trends and challenges in VLSI technology scaling towards


100nm”, Proceedings of the 27th European Solid-State Circuits
Conference, pp. 194 – 196, Sept. 2001.

[13]. R. Krishnamurthy, A. Alvandpour, V. De, and S. Borkar, “High


performance and low-power challenges for sub-70-nm microprocessor
circuits,” in Proc. Custom Integrated Circuits Conf., 2002, pp. 125–128.

[14]. J. Kao and A. Chandrakasan, “Dual-threshold voltage techniques for low-


power digital circuits,” IEEE J. Solid-State Circuits, vol. 35, pp. 1009–
1018, July 2000.

[15]. L. Wei, Z. Chen, Z, K. Roy, M. Johnson, Y. Ye, and V. De,” Design and
optimization of dual-threshold circuits for low-voltage low-power
applications”, IEEE Transactions on VLSI Systems, Volume 7, pp. 16–
24, March 1999.

[16]. S. Narendra, S. Borkar, V. De, D. Antoniadis and A. Chandrakasan,


“Scaling of stack effect and its application for leakage reduction”,
Proceedings of ISLPED '01, pp. 195 – 200, Aug. 2001.
1.6 References 19

[17]. J.W. Tschanz, S.G. Narendra, Y. Ye, B.A. Bloechel, S. Borkar and V. De,
“Dynamic sleep transistor and body bias for active leakage power control
of microprocessors” IEEE Journal of Solid-State Circuits, Nov. 2003, pp.
1838–1845.

[18]. J.D. Meindl, “Beyond Moore’s Law: The Interconnect Era”, in


Computing in Science & Engineering, vol. 5, no. 1, pp. 20-24, January
2003.

[19]. P. Bai, et al., "A 65nm Logic Technology Featuring 35nm Gate Lengths,
Enhanced Channel Strain, 8 Cu Interconnect Layers, Low-k ILD and 0.57
um2 SRAM Cell," International Electron Devices Meeting, 2004, pp. 657
– 660.

[20]. S. Keckler, D. Burger, C. Moore, R. Nagarajan, K. Sankaralingam, V.


Agarwal, M. Hrishikesh, N. Ranganathan, and P. Shivakumar, “A wire-delay
scalable microprocessor architecture for high performance systems,” ISSCC
Digest of Technical Papers, Feb. 2003, pp.168–169.

[21]. R. Ho, K. W. Mai, and M. A. Horowitz, “The Future of Wires,” in


Proceedings of the IEEE, pp. 490–504, 2001.

[22]. W. J. Dally and B. Towles, “Route Packets, Not Wires: On-Chip


Interconnection Networks,” in Proceedings of the 38th Design Automation
Conference, pp. 681-689, June 2001.

[23]. H. Wang, L. Peh, S. Malik, “Power-driven design of router


microarchitectures in on-chip networks,” MICRO-36, Proceedings of 36th
Annual IEEE/ACM International Symposium on Micro architecture, pp.
105-116, 2003.

[24]. L. Benini and G. D. Micheli, “Networks on Chips: A New SoC


Paradigm,” IEEE Computer, vol. 35, no. 1, pp. 70–78, 2002.

[25]. T. Bjerregaard and S. Mahadevan, "A survey of research and practices of


Network-on-chip", ACM Computing Surveys (CSUR), Vol. 38, Issue 1,
Article No. 1, 2006.
20 Introduction
Part II

NoC Building Blocks

21
Chapter 2

On-Chip Interconnection Networks

As VLSI technology scales, and processing power continues to improve,


inter-processor communication becomes a performance bottleneck. On-chip
networks have been widely proposed as the interconnect fabric for high-
performance SoCs [1] and the benefits demonstrated in several chip
multiprocessors (CMPs) [2]–[3]. Recently, NoC architectures are emerging as
the candidate for highly scalable, reliable, and modular on-chip communication
infrastructure platform. The NoC architecture uses layered protocols and packet-
switched networks which consist of on-chip routers, links, and well defined
network interfaces. With the increasing demand for interconnect bandwidth, on-
chip networks are taking up a substantial portion of system power budget. A
case in point: the MIT Raw [2] on-chip network which connects 16 tiles of
processing elements consumes 36% of total chip power, with each router
dissipating 40% of individual tile power. The routers and the links of the Alpha
21364 microprocessor consume about 20% of the total chip power [4].These
numbers indicate the significance of managing the interconnect power
consumption. In addition, any NoC architecture should fit into the limited
silicon budget with an optimal choice of NoC fabric topology providing high
bisection bandwidth, efficient routing algorithms and compact low-power router
implementations.

23
24 On-Chip Interconnection Networks

This chapter first presents introductory concepts specific to on-chip networks.


Topics include network topologies, crossbar switches, and message switching
techniques. The internal functional blocks for a canonical router architecture are
also described. A more detailed description of a six-port four-lane 57 GB/s non-
blocking router core design (Paper 1), with area and peak-energy benefits, is
also presented.

2.1 Interconnection Network Fundamentals


An interconnection network consists of nodes (or routers) and links (or
fabric), and can be broadly classified into direct or indirect networks [5]. A
direct network consists of a set of nodes, each one being directly connected to a
(usually small) subset of other nodes in the network. A common component of
each node in such a network is a dedicated router, which handles all message
communication between nodes, as shown in Figure 2-1. Usually two
neighboring nodes are connected by a pair of uni-directional links in opposite
directions. As the number of nodes in the system increases, the total
communication bandwidth also increases. This excellent scaling property has
made direct networks very popular in constructing large-scale NoC designs. The
MIT Raw [2] design is an example of direct network. In an indirect network,
communication between any two nodes must go through a set of switches.
Direct Network Indirect Network

P P P P

P P P P
Network Switching element
interface/switch Processing node

Figure 2-1: Direct and indirect networks


All modern day on-chip networks are buffered, i.e., the routers contain
storage for buffering messages when they are unable to obtain an outgoing link.
2.1 Interconnection Network Fundamentals 25

An interconnection network can be defined by four parameters: (1) its topology


(2) the routing algorithm governing it, (3) the message switching protocol, and
(4) the router micro-architecture.
2.1.1 Network Topology
The topology of a network concerns how the nodes and links are connected. A
topology dictates the number of alternate paths between nodes, and thus how
well the network can handle contention and different traffic patterns. Scalability
is an important factor in the selection of the right topology for on-chip networks.
Figure 2-2 shows various popular on topologies which have been commercially
adopted [5].
Shared bus networks (Figure 2-2a) are the simplest, consist of a shared link
common to all nodes. All nodes have to compete for exclusive access to the bus.
While simple, bus-based systems scale very poorly as mode nodes are added. In
a ring (or 1D torus) every node has exactly two neighbors and allows more
concurrent transfers. Since all nodes are not directly connected, messages will
have to hop along intermediate nodes until they arrive at the final destination.
This causes the ring to saturate at a lower network throughput for most traffic
patterns. The 2D mesh/torus, crossbar and hypercube topologies are examples of
direct networks, and thus provide tremendous improvement in performance, but
at a cost, typically increasing as the square of the number of nodes.
Node Fabric

(a) Shared Bus (b) 1D Torus or Ring (c) 2D Mesh

(d) 2D Torus (e) Crossbar (f) Hypercube

Figure 2-2: Popular on-chip fabric topologies

A crossbar (Figure 2-2e) is a fully connected topology, i.e., the


interconnection allows any node to directly communicate with any other node.
26 On-Chip Interconnection Networks

A crossbar network example connecting p processors to b memory banks is


shown in Figure 2-3. Note that it is a non-blocking network, i.e., a connection of
one processor to a given memory bank does not block a connection of another
processor to a different memory bank.
Memory Banks

b-1
0

5
1

4
Switching
0
element
1
Processing Elements

p-1

Figure 2-3: A fully connected non-blocking crossbar switch.

The above crossbar uses p × b switches. The hardware cost for crossbar is
high, at least O(p2), if b = p. An important observation is that the crossbar is not
very scalable and cost increases quadratically, as mode nodes are added.
2.1.2 Message Switching Protocols
The message switching protocol determines when a message gets buffered,
and when it gets to continue. The goal is to effectively share the network
resources among the many messages traversing the network, so as to approach
the best latency-throughput performance.
(1) Circuit vs. packet switching: In circuit switching a physical path from
source to destination is reserved prior to the transmission of data, as in telephone
networks. The complete message is transmitted along the pre-reserved path that
is established. While circuit switching reduces network latency, it does so at the
expense of network throughput. Alternatively, a message can be decomposed
into packets which share channels with other packets. The first few bytes of a
packet, called the packet header, contain routing and control information. Packet
2.1 Interconnection Network Fundamentals 27

switching improves channel utilization and extends network throughput. Packet-


based flow control methods dominate interconnection networks today because in
interconnection networks, buffers and channels are limited resources that need
to be efficiently allocated.
(2) Virtual Cut-Through (VCT) Switching: To reduce packet delay at each
hop, virtual cut-through flow switching allows transmission of a packet to begin
before the entire packet is received. Latency experienced by a packet is thus
drastically reduced. However, bandwidth and storage are still allocated in
packet-sized units. Packets can move forward only if there is enough storage to
hold the entire packet. Examples of routers adopting VCT switching are the
IBM SP/2 switch [6] and Mercury router described in [7].
(3) Wormhole Switching: In a wormhole routing scheme, a packet is divided
into “FLIT’s” or “FLow control unIT’s”, for transmission. A FLIT is the
minimum divisible amount of data transmitted within a packet that the control
logic in the router can process. As the header flit containing routing information
advances along the network, the message flits follow in a pipelined manner. The
lower packet latency and greater throughput of wormhole routers increases the
efficiency of inter-processor communication. In addition, wormhole routers have
substantially reduced buffering requirements [8], thus enabling small, compact
and fast router designs. The simplicity, low cost, and distance-insensitivity of
wormhole switching are the main reasons behind its wide acceptance by
manufacturers of commercial parallel machines. Several high-bandwidth low-
latency routers based on wormhole switching have been presented in [8]–[10].
2.1.3 Virtual Lanes
Virtual-lanes (or logical channels) [11] improves upon the channel utilization
of wormhole flow control, allowing blocked packets to be passed by other
packets. This is accomplished by associating several virtual channels, each with
a separate flit queue, with each physical channel. Virtual channels arbitrate for
physical channel bandwidth on a flit-by-flit basis. When a packet holding a
virtual channel gets blocked, other packets can still traverse the physical channel
through other virtual channels. Virtual lanes were originally introduced to solve
the deadlock avoidance problem, but they can be also used to improve network
latency and throughput, as illustrated in Figure 2-4. Assume P0 arrived earlier
and acquired the channel between the two routers first. In absence of virtual
channels, packet P1 arriving later would be blocked until the transmission of P0
has been completed. Assume that the physical channels implement two virtual
channels. Upon arrival of P1, the physical channel is multiplexed between them
on a flit-by-flit basis and both packets proceed with half speed, and both
messages continue to make progress. In effect, virtual lanes decouples the
28 On-Chip Interconnection Networks

physical channels from message buffers allowing multiple messages to share the
same physical channel in the same manner that multiple programs may share a
central processing unit (CPU).

Figure 2-4: Reduction in header blocking delay using virtual channels.


On the downside, the use of virtual channels, while reducing header blocking
delays in the network, increases design complexity of the link controller and
flow control mechanisms.
2.1.4 A Generic Router Architecture
Router architecture is largely determined by the choice of switching
technique. Most modern routers used in high performance multiprocessor
architecture utilize packet switching or cut-through switching techniques
(wormhole and virtual cut-through) with either a deterministic or an adaptive
routing algorithm. Several high-bandwidth low-latency routers have been
designed [12]–[14]. A generic wormhole switched router (Figure 2-5) has the
following key components [5]:
1. Link Controller (LC). This block implements flow control across the physical
channel between adjacent routers. The link controllers on either side of the link
coordinate to transfer flow control units. In the presence of virtual channels, this
unit also decodes the destination channel of the received flit.
2. Buffers. These are first-in-first out (FIFO) buffers for storing packets in
transit and are associated with input and output physical links. Routers may have
buffers only on inputs (input buffered) or outputs (output buffered). Buffer size
is critical and must account for link delays in propagation of data and flow
control signals.
3. Crossbar switch. This component is responsible for connecting router input
buffers to the output buffers. High-speed routers utilize crossbar networks with
full connectivity, enabling multiple and simultaneous message transfers. In this
thesis, we also use the term crossbar channel to represent the crossbar switch.
2.2 A Six-Port 57GB/s Crossbar Router Core 29

4. Routing and arbitration unit. This unit implements the routing function and
the task of selecting an output link for an incoming message. Conflicts for the
same output link must be arbitrated. Fast arbitration policies are crucial to
maintaining a low latency through the switch. This unit may also be replicated to
expedite arbitration delay.
To/From Local Processor

LC
LC
Input Queues Output Queues

Physical Output Links


Physical Input Links

(Virtual channels) (Virtual channels)

LC VC LC
Crossbar
Switch
LC VC LC

Routing &
Arbitration

Figure 2-5: Canonical router architecture.


5. Virtual channel controller (VC). This unit is responsible for multiplexing the
contents of the virtual channels onto the physical links. Several efficient
arbitration schemes have been proposed [15].
6. Processor Interface. This block implements a physical link interface to the
processor and contains one or more injection or ejection channels to the
processor.

2.2 A Six-Port 57GB/s Crossbar Router Core


We now describe a six-port four-lane 57GB/s router core that features double-
pumped crossbar channels and destination-aware channel drivers that
dynamically configure based on the current packet destination. This enables
45% reduction in channel area, 23% overall chip area, up to 3.8X reduction in
peak channel power, and 7.2% improvement in average channel power, with no
performance penalty. In a 150nm six-metal CMOS process, the 12.2mm2 core
contains 1.9 million transistors and operates at 1GHz at 1.2V.
30 On-Chip Interconnection Networks

2.2.1 Introduction
As described in Section 2.1.2, crossbar routers, based on wormhole switching
are a popular choice to interconnect large multiprocessor systems. Although
small crossbar routers are easily implemented on a single chip, area limitations
have constrained cost-effective single chip realization of larger crossbars with
bigger data path widths [16]–[17], since switch area increases as a square
function of the total number of I/O ports and number of bits per port. In
addition, larger crossbars exacerbate the amount of on-chip simultaneous-
switching noise [18], because of increased number of data drivers. This work
focuses on the design and implementation of an area and peak power efficient
multiprocessor communication router core.
The rest of this chapter is organized as follows. The first three sections
(2.2.2–2.2.4) present the router architecture including the core pipeline, packet
structure, routing protocol and flow control descriptions. Section 2.2.5 describes
the design and layout challenges faced during the implementation of an earlier
version of the crossbar router, which forms the motivation for this work. Section
2.2.6 presents two specific design enhancements to the crossbar router. Section
2.2.7 describes the double-pumped crossbar channel, and its operation. Section
2.2.8 explains details of the location-based channel drivers used in the crossbar.
These drivers are destination-aware and have the ability to dynamically
configure based on the current packet destination. Section 2.2.9 presents chip
results and compares this work with prior crossbar work in terms of area,
performance and energy. We conclude in Section 2.3 with some final remarks
and directions for future research.
2.2.2 Router Micro-architecture
The architecture of the crossbar router is shown in Figure 2-6, which is based
on the router described in [19], and forms the reference design for this work.
The design is a wormhole switched, input-buffered, fully non-blocking router.
Inbound packets at any of the six input ports can be routed to any output ports
simultaneously. The design uses vector-based routing. Each time a packet enters
a crossbar port, a 3-bit destination ID (DID) field at the head of the packet
determines the exit port. If more than one input contends for the same output
port, then arbitration occurs to determine the binding sequence of the input port
to an output. To support deadlock-free routing, the design implements four
logical lanes. Notice that each physical port is connected to four identical lanes
(0–3). These logical lanes, commonly referred to as “virtual channels”, allow the
physical ports to accommodate packets from four independent data streams
simultaneously.
2.2 A Six-Port 57GB/s Crossbar Router Core 31

Port 0 IN Queue Route Port


Arb
Port 1 IN

itch
Queue Route Port
Arb Out 0
Port 2 IN

Sw
Queue Route Port
Arb Out 1

sbar
Port 3 IN Queue Port
Route
Arb Out 2
Port 4 IN

Cros
Queue Route Port
Arb Out 3
Port 5 IN Queue Route Port Lane
Arb Out 4
Arb
Lane 0 (L0) and
Lane 1 (L1) Mux Out 5
Lane 2 (L2)
Port Inputs Lane 3 (L3) Port Outputs
6 X 76 6 X 76

Figure 2-6: Six-port four-lane router block diagram.


Data flow through the crossbar switch is as follows: data input is first
buffered into the queues. Each queue is basically a first-in-first out (FIFO)
buffer. Data coming out of the queues is examined by routing and sequencing
logic to generate output binding requests based on packet header information.
With six ports and 4 lanes, the crossbar switch requires a 24-to-1 arbiter and is
implemented in 2 stages, a 6-to-1 port Arbiter followed by a 4-to-1 lane Arbiter.
The first stage of arbitration within a particular lane essentially binds one input
port to an output port, for the entire duration of the packet. If other packets in
different lanes are contending for the same physical output resource, then a
second level of arbitration occurs. This level of arbitration occurs at a FLIT
level, a finer level of granularity than packet boundaries. It should be noted that
both levels of arbitration are accomplished using a balanced, non-weighted,
round-robin algorithm.
Figure 2-7 shows the 6-stage data and control pipelines, including key signals
used in the router crossbar switch. The pipe stages use rising edge-triggered
sequential devices. Stages one, two and three are used for the queue. Stage four,
for the route decoding and sequencing logic. Stage five, for data in the port
multiplexer and port arbitration. Stage six, for data in the lane multiplexer and
lane arbitration. The shaded portion between stages four and five represents the
primary physical crossbar routing channel. Here, data and control buses are
connected between the six ports within a lane. The layout area of this routing
32 On-Chip Interconnection Networks

channel increases as a square function of the total number of ports and number
of bits per port.

Queue Route Port Mux Lane Mux


1 2 3 Stg 4 Stg 5 6
Data Data
In Out
6:1 4:1
76 76

Adv LnGnt
PortGnt

Control
Path PortRq

4 5 6
Sequencer Port Arbiter Lane Arbiter
Crossbar channel

Figure 2-7: Router data and control pipeline diagram.


The shaded portion between stages five and six represents a second physical
routing channel. In this case, data and control buses are connected between the
four corresponding lanes. The total die area of this routing channel increases
only linearly as a function to the total number of bits per port. These two main
routing channels run perpendicular to each other on the die. For high
performance and symmetric layout, this basic unit of logic in Figure 2-7 is
duplicated 24 times throughout the crossbar, six within each lane and across four
lanes.
2.2.3 Packet structure and routing protocol
Figure 2-8 shows the packet format used in the router. Each packet is
subdivided into FLITs. Each FLIT contains six control bits and 72 data bits. The
control bits are made up of packet header (H), valid (V) and tail (T) bits, two
encoded lane ID bits which indicate one of four lanes, and flow control (FC)
information. The flow control bits contain receiver queue status and helps create
“back pressure” in the network to help prevent queue overflows.
The crossbar uses vector based routing. Data fields in the “header” flit of a
packet are reserved for routing destination information, required for each hop. A
“hop” is defined as a flit entering and leaving a router. The router uses the 3-bit
2.2 A Six-Port 57GB/s Crossbar Router Core 33

destination ID [DID] field as part of the header to determine the exit port. The
least significant 3 bits of the header indicates the current destination ID and is
updated by shifting out the current DID after each hop. Each header flit supports
a maximum of 24 hops. The minimum packet size the protocol allows for is two
flits. This was chosen to simplify the sequencer control logic and allow back-to-
back packets to be passed through the router core without having to insert idle
flits between them. The router architecture places no restriction on the maximum
packet size.
Control, 6 bits 3 bit Destination IDs (DID), 24 hops

FLIT 0 FC L1 L0 V T H DID[2:0]
1 Packet
FLIT 1 FC L1 L0 V T H d71 d0

Data, 72 bits

Figure 2-8: Router packet and FLIT format.

2.2.4 FIFO buffer and flow control


The queue acts as a FIFO buffer and is implemented as a large signal 1-read
1-write register file. Each queue is 76 bits wide and 64 flits deep. There are four
separate queues [20]–[21] per physical port to support virtual channels. Reads
and writes to two different locations in the queue occur simultaneously in a
single clock cycle. The circuits for reading and writing are implemented in a
single-ended fashion. Read sensing is implemented using a two level local and
global bit-line approach. This approach helps to reduce overall bit-line loading
as well as speeding up address decoding. In addition, data-dependent, self-timed
pre-charge circuits are also used to lower both clock and total power.
Flow control and buffer management between routers is debit-based using
almost-full bits (Figure 2-9), which the receiver FIFO signals via the flow
control bits (FC) bits, when its buffers reach a specified threshold. This
mechanism is chosen over conventional credit and counter based flow control
schemes for improved reliability and recovery in the presence of signaling
errors. The protocol is designed to allow all four receiving queues across four
lanes to independently signal the sender when they are almost full regardless of
the data traffic at that port. The FIFO buffer almost-full threshold can be
programmed via software.
34 On-Chip Interconnection Networks

Router A Router B
Data
Almost-full

Figure 2-9: Flow control between routers.

2.2.5 Router Design Challenges


This work was motivated to address the design challenges and issues faced
while implementing an earlier version of the router built in six-metal, 180nm
technology node (Leff = 150nm) and reported in [19]. Figure 2-10(a) shows the
internal organization of the six ports within each lane. We focused our attention
on the physical crossbar channel between pipe stages four and five where data
and control busses are connected between the six ports. The corresponding
reference layout for one (of four lanes) is given in Figure 2-10(b). The crossbar
channel interconnect structure used in the original full custom design is also
shown in Figure 2-10(c). Each wire in the channel is routed using metal 3
(vertical) and metal 4 (horizontal) layers with a 1.1µm pitch.

0 0 53%
M4
1 1
µm
0.62µ µm
0.62µ
3.3mm

2 2 M3 D[0] D[1] D[2]

3 3 µm
0.48µ

M2
4 4

5 5
1.2mm
(a) (b) (c)

Figure 2-10: Crossbar router in [19] (a) Internal lane organization. (b)
Corresponding layout. (c) Channel interconnect structure.
In each lane, 494 signals (456 data, 38 control) traverse the 3.3mm long wire-
dominated channel. More importantly, the crossbar channel consumed 53% of
2.2 A Six-Port 57GB/s Crossbar Router Core 35

core area and presented a significant physical design challenge. To meet GHz
operation, the entire core used hand-optimized data path macros. In addition, the
design used large bus drivers to forward data across the crossbar channel, sized
to meet the worst-case routing and timing requirements and replicated at each
port. Over 1800 such channel drivers were used across all four lanes in the
router core. When a significant number of these drivers switch simultaneously, it
places a substantial transient current demand on the power distribution system.
The magnitude of the simultaneous-switching noise spike can be easily
computed using the expression:
∆I
∆ V = LM (2.1)
∆t
Where L is the effective inductance, M the number of drivers active
simultaneously and ∆I the current switched by each driver over the transition
interval ∆t. For example, with M = 200, a power grid with a parasitic
inductance of 1nH, and a current change of 2mA per driver with an overly
conservative edge rate of half a nanosecond, the ∆V noise is approximately 0.8
volts. This peak noise is significant in today’s deep sub-micron CMOS circuits
and causes severe power supply fluctuations and possible logic malfunction. It is
important to note that the amount of peak switching noise, while data activity
dependent, is independent of router traffic patterns and the physical spatial
distance of an outbound port from an input port. Several techniques have been
proposed to reduce this switching noise [17]–[18], but the work has not been
addressed in the context of packet-switched crossbar routers.
In this work, we present two specific design enhancements to the crossbar
channel and a more detailed description of the techniques described in Paper 1.
To mitigate crossbar channel routing area, we propose double-pumping the
channel data. To reduce peak power, we propose a location based channel driver
(LBD) which adjusts driver strength as a function of current packet destination.
Reducing peak currents has the benefit of reduced demand on the power grid
and decoupling capacitance area, which in-turn results in lower gate leakage.
2.2.6 Double-pumped Crossbar Channel
Now we describe design details of the double pumped crossbar. The crossbar
core is a fully synchronous design. A full clock cycle is allocated for
communication between stages 4 and 5 of the core pipeline. As shown in Figure
2-11, one clock phase is allocated for data propagation across the channel. At
pipe-stage 4, the 76-bit data bus is double-pumped by interleaving alternate data
bits using dual edge -triggered flip-flops. The master latch M0 for data input (i0)
is combined with the slave latch S1 for data input (i1) and so on. The slave latch
S0 is moved across the crossbar channel to the receiver. A 2:1 multiplexer,
36 On-Chip Interconnection Networks

enabled using clock is used to select between the latched output data on each
clock phase.
clk clkb

Stg 4 2:1 Mux


M0 S0 6:1 Stg 5
clkb clk clk q0
clkb FF
clk n1 n5
i0 Driver
50% clkb

clk
clkb

clk n3 n4
S1
clk 3.3mm 6:1 Stg 5
clkb n2 crossbar
i1 channel q1
(228) FF
clkb clk

clk
Td < 500ps

Figure 2-11: Double-pumped crossbar channel.

This double pumping effectively cuts the number of channel data wires in
each lane by 50% from 456 to 228 and in all four logical lanes from 1824 to 912
nets. There is additional area savings due to a 50% reduction in the number of
channel data drivers. To avoid timing impact, the 38 control (request + grant)
wires in the channel are left un-touched.
2.2.7 Channel operation and timing
The timing diagram in Figure 2-12 summarizes the communication of data
bits D[0-3] across the double-pumped channel. The signal values in the diagram
match the corresponding node names in the schematic in Figure 2-11. To
transmit, the diagram shows in cycle T2 and T3, master latch M0 latching input
data bits D[0] and D[2], while the slave latch S1 retains bits D[1] and D[3], The
2:1 multiplexer, enabled by clock is used to select between the latched output
data on each clock phase, thus effectively double-pumping the data. A large
driver is used to forward the data across the crossbar channel.
At the receiving end of the crossbar channel, the delayed double-pumped data
is first latched by slave latch (S0), which retains bits D[0] & D[2] prior to
capture by pipe stage 5. Data bits D[1] and D[3] are easily extracted using a
rising edge flip-flop. Note that the overall data delay from Stage 4 to Stage 5 of
the pipeline is still one cycle, and there is no additional latency impact due to
2.2 A Six-Port 57GB/s Crossbar Router Core 37

double pumping. This technique, however, requires a close to 50% duty cycle
requirement on the clock generation and distribution.

Figure 2-12: Crossbar channel timing diagram.

2.2.8 Location-based Channel Driver


To reduce peak power, the crossbar channel uses location based drivers as a
drop-in replacement to conventional bus drivers. Figure 2-13 shows the LBD
driver and the associated control logic highlighted. A 3-bit unique hard-wired
location ID (LID) is assigned to provide information about the spatial location of
the driver in the floor-plan, at a port level granularity. By comparing LID bits
with the DID bits in the header flit, an indication of the receiver distance from
the driver is obtained. A 3-bit subtractor accomplishes this task. The subtraction
result is encoded to control the location based drivers. To minimize delay
penalty to the crossbar channel data, the encoder output control signals (en[2:0])
are re-timed to pipe stage 4.
The LBD schematic Figure 2-14(a) shows a legged driver with individual
enable controls for each leg, which is binary weighted. The LBD encodings are
carefully chosen to allow full range of driver sizes and communication
distances, without exceeding the path data delay budget.
38 On-Chip Interconnection Networks

LID[2:0] (hardwired) Stg 4


3-bit
encoder FF
subtract
DID[2:0]

clk
M0
en[2:0]
clkb clk
clk
i0

clkb
clk
S1 LBD
clk 3.3mm
clkb crossbar
channel
i1
(228)
clkb clk

Figure 2-13: Location-based channel driver and control logic.

(a)

(b)

LID[2:0] - DID[2:0] en[2:0]


000 001
001 010
010 011
011 101
100 110
101 111

Figure 2-14: (a) LBD Schematic. (b) LBD encoding summary.


LBD encoding is summarized in Figure 2-14(b), and is chosen in such a way
that the smallest leg of the driver is turned on when a packet is headed to its own
port, a feature called “loopback”. A larger leg is on if the packet is destined to
the adjacent ports. All legs are enabled only for the farthest routable distances in
2.2 A Six-Port 57GB/s Crossbar Router Core 39

the channel, specifically ports 0  5 or 5  0. Hard-wiring of the LID bits


enables a modular design and layout that is reusable at each port. It is important
to note that the LBD enable bits change only at packet boundaries and not at flit
boundaries. The loopback capability, where a packet can be routed from a node
to itself is supported to assist with diagnostics and driver development.
2.2.9 Chip Results
A 6-port four-lane router core utilizing the proposed double-pumped crossbar
channel with LBD has been designed in a 150nm six-metal CMOS process. All
parasitic capacitance and resistance data from layout was extracted and included
in the circuit simulations, and the results are compared to reference work [19] in
terms of area, performance and power and summarized in Figure 2-15. It is
important to note that we compare the designs on the same process generation.
600 560 18 3700
Channel Width (microns)

15.8
16 23% 3580

Clock load (microns)


500
45%
Chip Area (sq. mm)

3600
14 12.2 4.4%
400 12 3500
310 3427
10
300 3400
8
200 6 3300
4
100 3200
2
0 0 3100

340 333 500 520 512


483 6%
Channel Power (mW)

2:1 Mux Penalty (ps)

7.2% 480 500


483
Total Delay (ps)

320
309
7.7%
480
460
446 460
300
440
440
280
420
420

260 400 400


ISSCC This ISSCC This ISSCC This
2001 Work 2001 Work 2001 Work

Figure 2-15: Double-pumped crossbar channel results compared to work


reported in [19].
Double-pumping the crossbar channel reduces all four channel widths from
560µm to 310µm, and full-chip area from 15.8 mm2 to 12.2mm2, enabling a
45% reduction in channel area and 23% overall chip area, with no latency
penalty over the original design. The smaller channel enables an 8.3% reduction
40 On-Chip Interconnection Networks

in worst-case interconnect length and a 7.7% improvement in total signal


propagation delay due to reduced capacitance. This also results in a 7.2%
improvement in average channel power, independent of traffic patterns. The
technique however increases the clock load by 4% primarily because of the 2:1
multiplexer. Note that this is reported as a percentage of the clock load seen in
Stage 4 and 5 of the router pipeline. This penalty relative to the clock load of the
entire core is less than 1%. In addition, the 2:1 multiplexer increases signal
propagation delay by 6%, which could be successfully traded for the 7.7%
improvement in channel delay due to reduced double- pumped interconnect
length.
Figure 2-16 plots the simulated logic delay, the wire delay and the total
crossbar channel signal propagation delay as a function of distance of the
receiver from the driver. The logic delay includes a fixed logic delay
component, and a varying location-based driver delay. A port distance of “6”
indicates the farthest routable port and “1” indicates packet loop-back on the
same port. Notice that the total crossbar signal delay remains roughly constant
over the range of port-distances. This is because the location based driver
dynamically adjusts driver logic delay and wire delay to the specific destination
port, without exceeding the combined delay budget of 500ps.
delay budget
0.50

0.40
Delay (ns)

0.30

0.20

0.10

0.00
1 2 3 4 5 6
Port distance

Logic delay Wire delay Logic + wire

Figure 2-16: Simulated crossbar channel propagation delay as function of


port distance.
Figure 2-17 shows the improvements in peak channel driver currents with
LBD usage for various router traffic patterns. All 1824 channel drivers across all
2.2 A Six-Port 57GB/s Crossbar Router Core 41

four lanes were replaced by location based drivers. The graph plots the peak
current per driver (in mA) for varying port distance and shows the current
change from 5.9mA for the worst case port distance of 6 down to 1.55mA for
the best case port distance of 1. The LBD technique achieves up to 3.8X
reduction in peak channel currents, depending on the packet destination. As a
percentage of logic in pipe stage 4, LBD control layout over head is 3%.
6.0
5.9mA
5.5
Peak current per LBD (mA)

5.0
4.5
4.0
3.8X
3.5
3.0
2.5 1.55mA
2.0
1.5
1.0
1 2 3 4 5 6
Port distance

Figure 2-17: LBD peak current as function of port distance.

L0 L1 L2 L3
Port 0 Router area 3.7mm x 3.3 mm
Port 1 Technology 150nm CMOS
Port 2 Transistors 1.92 Million
Interconnect 1 poly, 6 metal
Port 3
Core frequency 1GHz @ 1.2V
Port 4 Channel area savings 45.4%
Port 5 Core area savings 23%

Figure 2-18: Router layout and characteristics.


The router core layout and a summary of chip characteristics are shown in
Figure 2-18. In a 150nm six-metal process, the 12.2mm2 full custom core
contains 1.92 million transistors and operates at 1GHz at 1.2V. The tiled nature
of the physical implementation is easily visible. The six ports within each lane
set run vertically, and the four lane sets are placed side by side. Note the
placement of the four lanes, which are mirrored to reduce the number of routing
42 On-Chip Interconnection Networks

combinations across lanes. Also note the locations of the inbound queues in the
crossbar and their relative area compared to the routing channel in the crossbar.
De-coupling capacitors fill the space below the channel and occupy
approximately 30% of the die area.

2.3 Summary and future work


This work has demonstrated the area and peak energy reduction benefits of a
six port four lane 57 GB/sec communications router core. Design enhancements
include four double-pumped crossbar channels and destination-aware crossbar
channel drivers that dynamically configure based on the current packet
destination. Combined application of both techniques enables a 45% reduction
in channel area, 23% overall core area, up to 3.8X reduction in peak crossbar
channel power, and 7.2% improvement in average channel power, with no
performance penalty, when compared to an identical reference design on the
same process. The crossbar area and peak power reduction also allows for larger
crossbar networks to be integrated on a single die. This router design, when
configured appropriately in a multiple-node system, yields a total system
bandwidth in excess of one Terabyte/sec.
As future work we propose the application of low-swing signaling techniques
[22] to router links and internal crossbar to significantly alleviate energy
consumption. In addition, encoding techniques can also be applied to the
crossbar channel data buses to further mitigate peak power based on [23]–[24].

2.4 References
[1]. L. Benini and G. D. Micheli, “Networks on Chips: A New SoC Paradigm,”
IEEE Computer, vol. 35, no. 1, pp. 70–78, 2002.

[2]. M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald,


H. Hoffmann, P. Johnson, J. W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski,
N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal,
“The Raw Microprocessor: A Computational Fabric for Software Circuits
and General Purpose Programs,” IEEE Micro, vol.22, no.2, pp. 25–35,
2002.

[3]. K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, S.W.


Keckler, and C. R. Moore, “Exploiting ILP, TLP, and DLP with The
Polymorphous TRIPS Architecture,” in Proceedings of the 30th Annual
International Symposium on Computer Architecture, pp. 422–433, 2003.
2.4 References 43

[4]. H. Wang, L. Peh, S. Malik, “Power-driven design of router


microarchitectures in on-chip networks,” MICRO-36, Proceedings of 36th
Annual IEEE/ACM International Symposium on Micro architecture, pp.
105-116, 2003.

[5]. J. Duato, S. Yalamanchili and L. Ni, "Interconnection Networks: An


Engineering Approach, 1st edition", IEEE Computer Society Press, Los
Alamitos, CA, 1997.

[6]. Craig B. Stunkel, “The SP2 High-Performance Switch”, IBM System


Journal, vol. 34, no. 2, pp. 185-204, 1995.

[7]. W. D. Weber et. al., "The Mercury Interconnect Architecture: A Cost-


effective Infrastructure for High-Performance Servers", In Proceedings of
the Twenty-Fourth International Symposium on Computer Architecture,
Denver, Colorado, pp. 98-107, June 1997.

[8]. William J. Dally and Charles L. Seitz, “The Torus Routing Chip”,
Distributed Computing, Vol. 1, No. 3, pp. 187-196, 1986.

[9]. J. Carbonaro and F. Verhoorn, “Cavallino: The Teraflops Router and


NIC,” Hot Interconnects IV Symposium Record. Aug. 1996, pp. 157-160.

[10]. R. Nair, N.Y. Borkar, C.S. Browning, G.E. Dermer, V. Erraguntla, V.


Govindarajulu, A. Pangal, J.D. Prijic, L. Rankin, E. Seligman, S. Vangal,
and H.A. Wilson, “A 28.5 GB/s CMOS non-blocking router for terabit/s
connectivity between multiple processors and peripheral I/O nodes”,
ISSCC Digest of Technical Papers, Feb. 2001, pp. 224-225.

[11]. W.J. Dally, “Virtual Channel Flow Control,” IEEE Trans. Parallel and
Distributed Systems, vol. 3, no. 2, Feb. 1992, pp. 194–205.

[12]. J.M. Orduna and J. Duato, "A high performance router architecture for
multimedia applications", Proc. Fifth International Conference on
Massively Parallel Processing, June 1998, pp. 142 – 149.

[13]. N. McKeown, M. Izzard, A. Mekkittikul, W. Ellersick and M. Horowitz,


"Tiny Tera: a packet switch core", IEEE Micro, Volume 17, Issue 1, Jan. -
Feb. 1997, pp. 26 – 33.

[14]. M. Galles, "Spider: a high-speed network interconnect", IEEE Micro,


Volume 17, Issue 1, Jan. - Feb. 1997, Page(s):34 – 39.
44 On-Chip Interconnection Networks

[15]. Y. Tamir, H. C. Chi, "Symmetric crossbar arbiters for VLSI


communication switches", IEEE Transactions on Parallel and Distributed
Systems, Volume 4, Issue 1, Jan. 1993, pp. 13 – 27.

[16]. Naik, R and Walker, D.M.H., “Large integrated crossbar switch”, in Proc.
Seventh Annual IEEE International Conference on Wafer Scale
Integration, Jan. 1995, pp. 217 – 227.

[17]. J. Ghosh and A. Varma, "Reduction of simultaneous-switching noise in


large crossbar networks", IEEE Transactions on Circuits and Systems,
Volume 38, Issue 1, Jan. 1991, pp. 86 – 99.

[18]. K.T. Tang and E.G. Friedman, "On-chip ∆I noise in the power
distribution networks of high speed CMOS integrated circuits", Proc. 13th
Annual IEEE International ASIC/SOC Conference, Sept. 2000, pp. 53 –
57.

[19]. H. Wilson and M. Haycock., “A six-port 30-GB/s non-blocking router


component using point-to-point simultaneous bidirectional signaling for
high-bandwidth interconnects”, IEEE JSSC, vol. 36, Dec. 2001, pp. 1954 –
1963.

[20]. Y. Tamir and G. Frazier, “High Performance Multiqueue Buffers for


VLSI Communication Switches,” Proc. 15th Annual Symposium on
Computer Architecture, IEEE Computer Society Press, Los Alamitos,
Calif., 1988, pp. 343 – 354.

[21]. B. Prabhakar, N. McKeown and R. Ahuja,, “Multicast Scheduling for


Input-Queued Switches,” IEEE J. Selected Areas in Communications, Vol.
15, No. 5, Jun. 1997, pp. 855-866.

[22]. R. Ho, K. W. Mai, and M. A. Horowitz, “The Future of Wires,” in


Proceedings of the IEEE, pp. 490–504, 2001.

[23]. M. R. Stan and W.P. Burleson, “Low-power encodings for global


communication in CMOS VLSI”, IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, Volume: 5, Issue: 4, Dec. 1997, pp. 444 – 455.

[24]. H. Kaul, D. Sylvester, M. Anders, R. Krishnamurthy, “Spatial encoding


circuit techniques for peak power reduction of on-chip high-performance
buses”, Proceedings of ISLPED '04, Aug. 2004, pp. 194 – 199.
Chapter 3

Floating-point Units

Scientific and engineering applications require high performance floating-point


units (FPU). The advent of multimedia applications, such as 3D graphics and
signal processing, places stronger demands for self-contained, low-latency FPUs
with increased throughput. This chapter presents basic concepts involved in
floating-point arithmetic, reviews conventional floating-point units (FPU), and
describes the challenges in accomplishing single-cycle accumulate on today’s
FPUs. A new pipelined single-precision FPMAC design, capable of single-cycle
accumulation, is described in Paper 2 and Paper 3. Special flip-flop circuits
used in the FPMAC design and test circuits are also presented.

3.1 Introduction to floating-point arithmetic


The design of FPUs is considered more difficult than most other arithmetic
units due to the relatively large number of sequentially dependent steps required
for a single FP operation and the extra circuits to deal with special cases such as
infinity arithmetic, zeros and NaNs (Not a Number), as demanded by the IEEE-
754 standard [3]. As a result, there is opportunity for exploring algorithms, logic
and circuit techniques that enable faster FPU implementations. An excellent
introduction to FP arithmetic can be found in Omondi [1] and Goldberg [2],

45
46 Floating-point Units

which describe logic principles behind FP addition, multiplication and division.


Since the mid 1980s, almost all commercial FPU designs have followed the
IEEE-754 standard [3], which specifies two basic floating-point formats: single
and double. The IEEE single format consists of three fields: a 23-bit fraction, f;
an 8-bit biased exponent, e; and a 1-bit sign, s. These fields are stored
contiguously in one 32-bit word, as shown in Figure 3-1.

Figure 3-1: IEEE single precision format. The exponent is biased by 127.
If the exponent is not 0 (normalized form), mantissa = 1.fraction. If exponent
is 0 (denormalized forms), mantissa = 0.fraction. The IEEE double format has a
precision of 53 bits, and 64 bits overall.
3.1.1 Challenges with floating-point addition and accumulation
FP addition is based on the sequence of mantissa operations: swap, shift, add,
normalize and round. A typical floating-point adder (Figure 3-2) first compares
the exponents of the two input operands, swaps and shifts the mantissa of the
smaller number to get them aligned. It is necessary to adjust the sign if either of
the incoming number is negative. The two mantissas are then added; with the
result requiring another sign adjustment if negative. Finally, the adder
renormalizes the sum, adjusts the exponent accordingly, and truncates the
resulting mantissa using an appropriate rounding scheme [4]. For extra speed,
FP adders use leading zero anticipatory (LZA) logic to carry out predecoding for
normalization shifts in parallel with the mantissa addition. Clearly, single-cycle
FP addition performance is impeded by the speed of the critical blocks:
comparator, variable shifter, carry-propagate adder, normalization logic and
rounding hardware. To accumulate a series of FP numbers (e.g., ΣAi), the
current sum is looped back as an input operand for the next addition operation,
as shown by the dotted path in Figure 3-2. Over the past decade, a number of
high-speed FPU designs have been presented [5]–[7]. Floating-point (FP) is
dominated by multiplication and addition operations, justifying the need for
fused multiply-add hardware. Probably the most common use of FPUs is
performing matrix operations, and the most frequent matrix operation is a matrix
multiplication, which boils down to computing an inner product of two vectors.
n −1
X i = ∑ Ai × Bi (3.1)
i =0
3.1 Introduction to floating-point arithmetic 47

Exponents (A, B) Mantissas (A, B)

compare swap swap

SHR
-
sign

ADD + LZA

+ normalize
SHL

+ round

Exponent (C) Mantissa (C)

Figure 3-2: Single-cycle FP adder with critical blocks high-lighted.


Computing the dot product requires a series of multiply-add operations.
Motivated by this, the IBM RS/6000 first introduced a single instruction that
computes the fused multiply-add (FPMADD) operation.
X i = ( A × B) + C (3.2)
Although this requires a three operand read in a single instruction, it has the
potential for improving the performance of computing inner products. In such
cases, one option is to design the FP hardware to accept up to three operands for
executing MADD instructions as shown in Figure 3-3(a), while other FP
instructions requiring fewer than three operands may utilize the same hardware
by forcing constants into the unused operands. For example, to execute the
floating-point add instruction, T = A + C, the B operand is forced to the constant
1.0. Similarly, a multiply operation would require forcing operand C to zero.
Examples of fused multiply-add FPUs in this category include the CELL
processing element, and are described in [8]–[10]. A second option is to
optimize the FP hardware for dot product accumulations by accepting two
operands, with an implicit third operand (Figure 3-3b). We define a fused
multiply-accumulate (FPMAC) instruction as:
X i = ( A × B ) + X i −1 (3.3)
48 Floating-point Units

C A B A B

x x

+ +

normalize & normalize &


round round
Xi Xi
(a) (b)

Figure 3-3: FPUs optimized for (a) FPMADD (b) FPMAC instruction.
An important figure of merit for FPUs is the initiation (or repeat) interval [2],
which is the number of cycles that must elapse between issuing two operations
of a given type. An ideal FPU design would have an initiation interval of one.
3.1.2 Scheduling issues with pipelined FPUs
To enable GHz operation, all state-of-art FPUs are pipelined, resulting in
multi-cycle latencies on most FP instructions. An example of a five stage pipe-
lined FP adder [11] is shown in Figure 3-4(a) and a six-stage multiply-add FPU
[10] is given in Figure 3-4 (b). An important observation is that neither of these
implementations are optimal for accomplishing a steady stream of dot product
accumulations. This is because of pipeline data hazards. For example, the six-
stage FPU in Figure 3-4 (b) has an initiation interval of 6 for FPMAC
instructions i.e., once a FPMAC instruction is issued, a second one can only be
issued a minimum of six cycles apart due to read after write (RAW) hazard [2].
In this case, a new accumulation can only start after the current instruction is
complete, thus increasing overall latency. Schedulers of multi-cycle FPUs
detect such hazards and place code scheduling restrictions between consecutive
FPMAC instructions, resulting in throughput loss. The goal of this work is to
eliminate such limitations by demonstrating single-cycle accumulate operation
and a sustained FPMAC (Xi) result every cycle at GHz frequencies.
Several solutions have been proposed to minimize data hazards and stalls due
to pipelining. Result forwarding (or bypassing) is a popular technique to reduce
the effective pipeline latency. Notice that the result bus is forwarded as a
possible input operand by the CELL processor FPU in Figure 3-4 (b).
3.2 Single-cycle Accumulation Algorithm 49

(a) (b)

Figure 3-4: (a) Five-stage FP adder [11] (b) Six-stage CELL FPU [10].
Multithreading has emerged as one of the most promising options to hide multi-
cycle FPU latency by exploiting thread-level parallelism (TLP). Multithreaded
systems [12] provide support for multiple contexts and fast context switching
within the processor pipeline. This allows multiple threads to share the same
FPU resources by dynamically switching between threads. Such a processor has
the ability to switch between the threads, not only to hide memory latency but
also to avoid stalls resulting from data hazards. As a result, multithreaded
systems tend to achieve better processor utilization and improved performance.
These benefits come with the disadvantage that multithreading can add
significant complexity and area overhead to the architecture, thus increasing
design time and cost.

3.2 Single-cycle Accumulation Algorithm


In an effort to achieve fast single-cycle accumulate operation, we first
analyzed each of the critical operations involved (Figure 3-2) in conventional
FPUs, with the intent of eliminating, reducing and/or deferring the amount of
logic operations inside the accumulate loop. The proposed FPMAC algorithm is
given in Figure 3-5, and employs the following optimizations:
50 Floating-point Units

(1) Swap and Shift: To minimize the interaction between the incoming
operand and the accumulated result, we choose to self-align the incoming
number every cycle rather than the conventional method of aligning it to the
accumulator result. The incoming number is converted to base 32 by shifting the
mantissa left by an amount given by the five least significant exponent bits
(Exp[4:0]) thus extending the mantissa width from 24-bits to 55-bits. This
approach has the benefit of removing the need for a variable mantissa alignment
shifter inside the accumulation loop. The reduction in exponent from 8 to 3 bits
(Exp[7:5]) expedites exponent comparison.
(2) Addition: Accumulation is performed in base 32 system. The accumulator
retains the multiplier output in carry-save format and uses an array of 4-2 carry-
save adders to “accumulate” the result in an intermediate format [13], and delay
addition until the end of a repeated calculation such as accumulation or dot
product. This idea is frequently used in multipliers, built using Wallace trees
[14], and removes the need for an expensive carry propagate adder in the critical
path. Expensive variable shifters in the accumulate loop are replaced with
constant shifters, and are easily implemented using a simple multiplexer circuit.
EA EB MA MB

Pipelined
EA + EB -127 MA X MB Wallace-tree
Multiplier

Make 5 bits of LSB 0 Shift mantissa left by value of 5 bits of exp Operand
If negative, take 2’s complement Self-align

FE ME MM FM
3 3 55 55

Choose > exp Choose mantissa with > exp if exp1 ≠ exp2 Single-cycle
If overflow, add 32 If exp1 = exp2, add mantissas Accumulate
If overflow, shift result right 32

Pipelined addition,
Normalize/round Final Adder, normalize/round Normalize/round

EX MX

Figure 3-5: Single-cycle FPMAC algorithm.


(3) Normalize and round: The costly normalization step is moved outside
the accumulate loop, where the accumulation result in carry-save is added, the
sum normalized and converted back to base 2, with necessary rounding.
3.3 CMOS prototype 51

With a single-cycle accumulate loop, we have successfully eliminated RAW


data hazards due to accumulation and enabled scheduling of FPMAC
instructions every cycle. Note that we could still have a RAW data hazard
between consecutive FPMAC instructions if either input operands to the
multiplier (A, B) as part of the current FPMAC instruction, is dependent on the
previous accumulated result. In such cases, the second FPMAC instruction issue
must be delayed (in cycles) by the amount of the entire FPMAC pipeline depth.
A better option is to have a compiler re-order the instruction stream on such
dependent operations and improve throughput. With this exception, the
proposed design allows FPMAC instructions with an initiation interval of 1,
significantly increasing throughput. A detailed description of the work is
presented in Paper 2 and Paper 3.

3.3 CMOS prototype


A test chip using the proposed FPMAC architecture has been designed and
fabricated in a 90nm communication technology [15]. The 2mm2 full-custom
design contains 230K transistors, with the FPMAC core accounting for 151K
(67%) of the device count, and 0.88mm2 (44%) of total layout area.
3.3.1 High-performance Flip-Flop Circuits
To enable fast performance, the FPMAC core uses implicit-pulsed semi-
dynamic flip-flops [16]–[17], with fast clock-to-Q delay and high skew
tolerance. When compared to a conventional static master-slave flip-flop, the
semi-dynamic flip-flops provide both shorter latency and the capability of
incorporating logic functions with minimum delay penalty, properties which
make them very attractive for high-performance digital designs. Critical pipe
stages like the FPMAC accumulator registers (Figure 3-5), are built using
inverting rising edge-triggered semi-dynamic flip-flop with synchronous reset.
The flip-flop (Figure 3-6) has a dynamic master stage coupled to a pseudo-static
slave stage. As is shown in the schematic, the flip flops are implicitly pulsed,
with several advantages over non-pulsed designs. One main benefit is that they
allow time-borrowing across cycle boundaries due to the fact that data can arrive
coincident with, or even after, the clock edge. Thus negative setup time can be
taken advantage of in the logic. Another benefit of negative setup time is that the
flip-flop becomes less sensitive to jitter on the clock when the data arrives after
clock. They thus offer better clock-to-output delay and clock skew tolerance
than conventional master-slave flops. However, pulsed flip-flops have some
important disadvantages. The worst-case hold time of this flip-flop can exceed
clock-to-output delay because of pulse width variations across process, voltage,
52 Floating-point Units

and temperature conditions. A selectable pulse delay option is available, as


shown in Figure 3-6, to avoid failures due to pulse-width variations over process
corners and consequent min-delay failures.

clkb
q
db rst

clkb
CK

SP
Delay
Figure 3-6: Semi-dynamic resetable flip flop with selectable pulse width.

3.3.2 Test Circuits


To aid with testing the FPMAC core, the design includes three 32-bit wide,
32-deep first-in first-out (FIFO) buffers, operating at core speed (Figure 3-7).
Scan Reg
FPMAC Scan Out
32
32
FIFO FIFO FIFO
32 Vectors Deep

x A B C

FIFO
+ Accumulator Control

Final add &


normalize

32 Scan Reg
Scan In

Figure 3-7: Block diagram of FPMAC core and test circuits.

FIFOs A and B provide the input operands to the FPMAC and FIFO C
captures the results. A 67-bit scan chain feeds the data and control words.
3.3 CMOS prototype 53

Output results are scanned out using a 32-bit scan chain. A control block
manages operations of all three FIFOs and scan logic on chip.
Each FIFO is built using a register file unit that is 32-entry by 32b, with
single read and write ports (Figure 3-8). The design is implemented as a large
signal memory array [18]. A static design was chosen to reduce power and
provide adequate robustness in the presence of large amounts of leakage. The
RF design is organized in four identical 8-entry, 32b banks. For fast, single-
cycle read operation, all four banks are simultaneously accessed and multiplexed
to obtain the desired data. An 10-transistor, leakage-tolerant, dual-VT optimized
RF cell with 1-read/1-write ports is used. Reads and writes to two different
locations in the RF occur simultaneously in a single clock cycle. To reduce the
routing and area cost, the circuits for reading and writing registers are
implemented in a single-ended fashion. Local bit lines are segmented to reduce
bit-line capacitive loading and leakage. As a result, address decoding time, read
access time, as well as robustness improve. RF read and write paths are dual-VT
optimized for best performance with minimum leakage. The RF RAM latch and
access devices in the write path are made high-VT to reduce leakage power.
Low-VT devices are used everywhere else to improve critical read delay by 21%
over a fully high-VT design.

Write data
wen
wd
B2 B3
ren
[0]
Address
Decode
SDFF

4:1
Read [7]
data
ck [0] To 4:1 mux
B1 B0 [31]

Write data High VT


Low VT

Figure 3-8: 32-entry x 32b dual-VT optimized register file.


54 Floating-point Units

3.4 References
[1]. A. Omondi. Computer Arithmetic Systems, Algorithms, Architecture and
Implementations. Prentice Hall, 1994.

[2]. D.A. Patterson, J.L. Hennessy, and D. Goldberg, “Computer Architecture,


A Quantitative Approach”, Appendix A and H, 3rd edition, Morgan
Kaufmann, May 2002.

[3]. IEEE Standards Board, “IEEE Standard for Binary Floating-Point


Arithmetic,” Technical Report ANSI/IEEE Std. 754-1985, IEEE, New
York, 1985.

[4]. W.-C. Park, S.-W. Lee, O.-Y. Kown, T.-D. Han, and S.-D. Kim. Floating
point adder/subtractor performing ieee rounding and addition/subtraction
in parallel. IEICE Transactions on Information and Systems, E79-
D(4):297–305, Apr. 1996.

[5]. H. Yamada, T. Hotta, T. Nishiyama, F. Murabayashi, T. Yamauchi and H.


Sawamoto, “A 13.3ns double-precision floating-point ALU and
multiplier”, Proc. IEEE International Conference on Computer Design,
Oct. 1995, pp. 466–470.

[6]. S.F. Oberman, H. Al-Twaijry and M.J. Flynn, “The SNAP project: design
of floating point arithmetic units”, in Proc. 13th IEEE Symposium on
Computer Arithmetic, 1997, pp.156–165.

[7]. P. M. Seidel and G. Even, “On the design of fast IEEE floating-point
adders”, Proc. 15th IEEE Symposium on Computer Arithmetic, June 2001,
pp. 184–194.

[8]. Elguibaly, F ,”A fast parallel multiplier-accumulator using the modified


Booth algorithm”, IEEE Transactions on Circuits and Systems II, Sept.
2000, pp. 902–908.

[9]. Hokenek, E.; Montoye, R.K and Cook, P.W, “Second-generation RISC
floating point with multiply-add fused”, IEEE Journal of Solid-State
Circuits, Oct. 1990, pp. 1207–1213.

[10]. H. Oh, S. M. Mueller, C. Jacobi, K. D. Tran, S. R. Cottier, B. W Michael,


H. Nishikawa, Y. Totsuka, T. Namatame, N. Yano, T. Machida and S. H.
Dhong, “A fully-pipelined single-precision floating point unit in the
3.4 References 55

synergistic processor element of a CELL processor”, 2005 Symposium on


VLSI Circuits, Digest of Technical Papers, pp. 24–27.

[11]. H. Suzuki, H. Morinaka, H. Makino, Y. Nakase, K. Mashiko and T.


Sumi, “Leading-Zero Anticipatory Logic for High-speed Floating Point
Addition,” IEEE J. Solid State Circuits, Aug. 1996, pp. 1157–1164.

[12]. D. M. Tullsen, S. J. Eggers and H. M Levy, “Simultaneous


multithreading: Maximizing on-chip parallelism”, in Proceedings of 22nd
Annual International Symposium on Computer Architecture, 22-24 June
1995, Pages:392 – 403.

[13]. Z. Luo, and M. Martonosi, “Accelerating Pipelined Integer and Floating-


Point Accumulations in Configurable Hardware with Delayed Addition
Techniques,” IEEE Trans. on Computers, Mar. 2000, pp. 208–218.

[14]. C.S. Wallace, “Suggestions for a Fast Multiplier,” IEEE Trans. Electronic
Computers, vol. 13, pp. 114-117, Feb. 1964.

[15]. K. Kuhn, M. Agostinelli, S. Ahmed, S. Chambers, S. Cea, S. Christensen,


P. Fischer, J. Gong, C. Kardas, T. Letson, L. Henning, A. Murthy, H.
Muthali, B. Obradovic, P. Packan, S. Pae, I. Post, S. Putna, K. Raol, A.
Roskowski, R. Soman, T. Thomas, P. Vandervoorn, M. Weiss and I.
Young, “A 90nm Communication Technology Featuring SiGe HBT
Transistors, RF CMOS,” IEDM 2002, pp. 73–76.

[16]. F. Klass, "Semi-Dynamic and Dynamic Flip-Flops with Embedded


Logic,” 1998 Symposium on VLSI Circuits, Digest of Technical Papers,
pp. 108–109.

[17]. J. Tschanz, S. Narendra, Z. Chen, S. Borkar, M. Sachdev, V. De,


“Comparative Delay and Energy of Single Edge-Triggered & Dual Edge-
Triggered Pulsed Flip-Flops for High-Performance Microprocessors,”
ISLPED 2001, pp. 147-151.

[18]. S. Vangal, M. Anders, N. Borkar, E. Seligman, V. Govindarajulu, V.


Erraguntla, H. Wilson, A. Pangal, V. Veeramachaneni, J. Tschanz, Y. Ye,
D. Somasekhar, B. Bloechel, G. Dermer, R. Krishnamurthy, K.
Soumyanath, S. Mathew, S. Narendra, M. Stan, S. Thompson, V. De and
S. Borkar, “5-GHz 32-bit integer execution core in 130-nm dual-VT
CMOS,” IEEE Journal of Solid-State Circuits, Volume 37, Issue 11,
Nov. 2002, pp. 1421- 1432.
56 Floating-point Units
Part III

An 80-Tile TeraFLOPS NoC

57
58
Chapter 4
An 80-Tile Sub-100W TeraFLOPS NoC
Processor in 65-nm CMOS

We now present an integrated network-on-chip architecture containing 80


tiles arranged as an 8×10 2D array of floating-point cores and packet-switched
routers, both designed to operate at 4 GHz (Paper 4). Each tile has two
pipelined single-precision floating-point multiply accumulators (FPMAC) which
feature a single-cycle accumulation loop for high throughput. The on-chip 2D
mesh network (Paper 5) provides a bisection bandwidth of 2 Tera-bits/s. The
15-FO4 design employs mesochronous clocking, fine-grained clock gating,
dynamic sleep transistors, and body-bias techniques. In a 65-nm eight-metal
CMOS process, the 275 mm2 custom design contains 100-M transistors. The
fully functional first silicon achieves over 1.0TFLOPS of performance on a
range of benchmarks while dissipating 97 W at 4.27 GHz and 1.07-V supply.

4.1 Introduction
The scaling of MOS transistors into the nanometer regime opens the
possibility for creating large scalable Network-on-Chip (NoC) architectures [1]
containing hundreds of integrated processing elements with on-chip
communication. NoC architectures, with structured on-chip networks are
emerging as a scalable and modular solution to global communications within
large systems-on-chip. The basic concept is to replace today’s shared buses with
on-chip packet-switched interconnection networks [2]. NoC architectures use

59
60 The TeraFLOPS NoC Processor

layered protocols and packet-switched networks which consist of on-chip


routers, links, and well defined network interfaces. As shown in Figure 4-1, the
NoC architecture basic building block is the “network tile”. The tiles are
connected to an on-chip network that routes packets between them. Each tile
may consist of one or more compute cores and include logic responsible for
routing and forwarding the packets, based on the routing policy of the network.
The structured network wiring of such a NoC design gives well-controlled
electrical parameters that simplifies timing and allows the use of high-
performance circuits to reduce latency and increase bandwidth. Recent tile-
based chip multiprocessors include the RAW [3], TRIPS [4], and ASAP [5]
projects. These tiled architectures show promise for greater integration, high
performance, good scalability and potentially high energy efficiency.

Network Tile
Links

Router

IP Core

Figure 4-1: NoC architecture.

With the increasing demand for interconnect bandwidth, on-chip networks are
taking up a substantial portion of system power budget. The 16-tile MIT RAW
on-chip network consumes 36% of total chip power, with each router dissipating
40% of individual tile power [6]. The routers and the links of the Alpha 21364
microprocessor consume about 20% of the total chip power. With on-chip
communication consuming a significant portion of the chip power and area
budgets, there is a compelling need for compact, low power routers. At the
same time, while applications dictate the choice of the compute core, the advent
of multimedia applications, such as three-dimensional (3D) graphics and signal
processing, places stronger demands for self-contained, low-latency floating-
point processors with increased throughput. A computational fabric built using
these optimized building blocks is expected to provide high levels of
4.2 Goals of the TeraFLOPS NoC Processor 61

performance in an energy efficient manner. This paper describes design details


of an integrated 80-Tile NoC architecture implemented in a 65-nm process
technology. The prototype is designed to deliver over 1.0TFLOPS of average
performance while dissipating less than 100W.
The remainder of the chapter is organized as follows. Section 4.2 lists the
motivation for the tera-FLOP processor. Section 4.3 gives an architectural
overview of the 80-tile NoC and describes the key building blocks. The section
also explains the FPMAC unit pipeline and design optimizations used to
accomplish single-cycle accumulation. Router architecture details, NoC
communication protocol and packet formats are also described. Section 4.5
describes chip implementation details, including the high-speed mesochronous
clock distribution network used in this design. Details of the circuits used for
leakage power management in both logic and memory blocks are also discussed.
Section 4.6 presents the chip measurement results. We present future tera-scale
applications in Section 4.7. Section 4.8 concludes by summarizing the NoC
architecture along with key performance and power numbers.

4.2 Goals of the TeraFLOPS NoC Processor


A primary motivation for this work is to demonstrate the industry’s first
programmable processor capable of delivering over one trillion mathematical
calculations per second (1.0TFLOPS) of performance while dissipating less than
100W. The TeraFLOPS processor extends our research work on key NoC
building blocks and integrates them into a large 80-core NoC design using an
effective tiled-design methodology. The number “80” is a balance between the
performance/watt required and the available die area. The NoC uses a 2D mesh
interconnect topology, fast enough to provide terabits of connectivity between
the tiles. The mesh network is also attractive from a resiliency or reliability
perspective because of its ability to re-route traffic in the event of congestion or
link failure.
This research chip is designed to provide specific insights in new silicon
design methodologies for large-scale NoCs (100+ cores), high-bandwidth
interconnects, scalable clocking solutions and effective energy management
techniques. The eventual goal is to develop highly scalable, multi-core
architectures with an optimal mix of general and special purpose processing
cores, and scalable, reliable on-chip networks, and exploiting state-of-the art
process technology and packaging. An additional goal is to motivate research in
the area of parallel programming by developing new programming tools that
make highly-threaded and data-parallel applications easier to develop and
debug.
62 The TeraFLOPS NoC Processor

4.3 NoC Architecture


The NoC architecture (Figure 4-2) contains 80 tiles arranged as an 8×10 2D
mesh network that is designed to operate at 4GHz [7]. Each tile consists of a
processing engine (PE) connected to a 5-port router with mesochronous
interfaces (MSINT), which forwards packets between the tiles. The 80-tile on-
chip network enables a bisection bandwidth of 2Tera-bits/s. The PE contains
two independent fully-pipelined single-precision floating-point multiply-
accumulator (FPMAC) units, 3KB single-cycle instruction memory (IMEM),
and 2KB of data memory (DMEM). A 96-bit Very Long Instruction Word
(VLIW) encodes up to eight operations per cycle. With a 10-port (6-read, 4-
write) register file, the architecture allows scheduling to both FPMACs,
simultaneous DMEM load and stores, packet send/receive from mesh network,
program control, and dynamic sleep instructions. A router interface block (RIB)
handles packet encapsulation between the PE and router. The fully symmetric
architecture allows any PE to send (receive) instruction and data packets to
(from) any other tile. The 15 fan-out-of-4 (FO4) design uses a balanced core and
router pipeline with critical stages employing performance setting semi-dynamic
flip-flops. In addition, a scalable low power mesochronous clock distribution is
employed in a 65-nm eight-metal CMOS process that enables high integration
and single chip realization of the teraFLOPS processor.
Mesochronous
MSINT Interface
MSINT
MSINT

39
Crossbar 32GB/s Links
Router I/O
39
MSINT
2KB Data memory (DMEM)
64
96 64 64
RIB

32 32
3KB Inst. memory (IMEM)

6-read, 4-write 32 entry RF


32 32
x x
PLL JTAG
96
+ 32 + 32

Normalize Normalize

FPMAC0 FPMAC1
Processing Engine (PE)

Figure 4-2: NoC block diagram and tile architecture.


4.4 FPMAC Architecture 63

4.4 FPMAC Architecture


The 9-stage pipelined FPMAC architecture (Figure 4-3) uses a single-cycle
accumulate algorithm [8] with base 32 and internal carry-save arithmetic with
delayed addition. The FPMAC contains a fully pipelined multiplier unit (pipe
stages S1-S3), and a single-cycle accumulation loop (S5), followed by pipelined
addition and normalization units (S6-S8). Operands A and B are 32-bit inputs in
IEEE-754 single-precision format [9]. The design is capable of sustained
pipelined performance of one FPMAC instruction every 250ps. The multiplier is
designed using a Wallace tree of 4-2 carry-save adders. The well-matched
delays of each Wallace tree stage allow for highly efficient pipelining (S1-S3).
Four Wallace tree stages are used to compress the partial product bits to a sum
and carry pair. Notice that the multiplier does not use a carry propagate adder at
the final stage. Instead, the multiplier retains the output in carry-save format and
converts the result to base 32 (at stage S3), prior to accumulation.

Operand A (IEEE-754 single precision inputs) Operand B


SA EA MA SB EB MB
S0
Sign
Control EX=EB-Bias Partial Product Encoding

4-2 4-2 4-2 4-2 Wallace Stage 1


S1
EA+EX 4-2 4-2 4-2
Wallace Stage 2, 3
4-2 4-2
S2
4-2 Wallace Stage 4
Exponent Sum (S) Carry (C)
control Shift Base 32 conversion
S3
S,C 3-2 Sign Inversion
Mux
S4
Critical path

Mux 4-2 Single cycle


S,C
Exponent Accumulate
compare
Mux
S5
Final adder, Leading-
+ LZA Zero Anticipator (LZA)
S6 _ Normalize
Shift
S7
+ Shift Base 2 conversion
S8
SC EC MC

Figure 4-3: FPMAC 9-stage pipeline with single-cycle accumulate loop.


64 The TeraFLOPS NoC Processor

In an effort to achieve fast single-cycle accumulation, we first analyzed each


of the critical operations involved in conventional FPUs with the intent of
eliminating, reducing or deferring the logic operations inside the accumulate
loop and identified the following three optimizations [8].
1) The accumulator (stage S5) retains the multiplier output in carry-save
format and uses an array of 4-2 carry save adders to “accumulate” the result in
an intermediate format. This removes the need for a carry-propagate adder in the
critical path.
2) Accumulation is performed in base 32 system, converting the expensive
variable shifters in the accumulate loop to constant shifters.
3) The costly normalization step is moved outside the accumulate loop, where
the accumulation result in carry-save is added (stage S6), the sum normalized
(stage S7) and converted back to base 2 (stage S8).
These optimizations allow accumulation to be implemented in just 15 FO4
stages. This approach also reduces the latency of dependent FPMAC
instructions and enables a sustained multiply-add result (2 FLOPS) every cycle.
Careful pipeline re-balancing allows removal of 3 pipe-stages resulting in a 25%
latency improvement over work in [8]. The dual FPMACs in each PE provide
16GFLOPS of aggregate performance and are critical to achieving the goal of
teraFLOPS performance.

4.4.1 Instruction Set


The architecture defines a 96-bit VLIW which allows a maximum of up to
eight operations to be issued every cycle. The instructions fall into one of the six
categories (Table 4-1): Instruction issue to both floating-point units,
simultaneous data memory load and stores, packet send/receive via the on-die
mesh network, program control using jump and branch instructions,
synchronization primitives for data transfer between PEs and dynamic sleep
instructions. The data path between the DMEM and the register file supports
transfer of two 32-bit data words per cycle on each load (or store) instruction.
The register file issues four 32-bit data words to the dual FPMACs per cycle,
while retiring two 32-bit results every cycle. The synchronization instructions
aid with data transfer between tiles and allow the PE to stall while waiting for
data (WFD) to arrive. To aid with power management, the architecture provides
special instructions for dynamic sleep and wakeup of each PE, including
independent sleep control of each floating-point unit inside the PE. The
architecture allows any PE to issue sleep packets to any other tile or wake it up
for processing tasks. With the exception of FPU instructions which have a
pipelined latency of 9 cycles, most instructions execute in 1-2 cycles.
4.4 FPMAC Architecture 65

Instruction Type Latency (cycles)


FPU 9
LOAD/STORE 2
SEND/RECEIVE 2
JUMP/BRANCH 1
STALL/WFD N/A
SLEEP/WAKE 1-6
Table 4-1: Instruction types and latency.
4.4.2 NoC Packet Format

Figure 4-4 describes the NoC packet structure and routing protocol. The on-
chip 2D mesh topology utilizes a 5-port router based on wormhole switching,
where each packet is subdivided into “FLITs” or “Flow control unITs”. Each
FLIT contains six control signals and 32 data bits. The packet header (FLIT_0)
allows for a flexible source-directed routing scheme, where a 3-bit destination
ID field (DID) specifies the router exit port. This field is updated at each hop.

Control, 6 bits 3 bit Destination IDs (DID), 10 hops

FLIT_0 FC0 FC1 L V T H CH DID[2:0]


1 Packet
FLIT_1 FC0 FC1 L V T H NPC PCA SLP REN ADDR
PE Control Information

FLIT_2 FC0 FC1 L V T H d31 d0

Data, 32 bits
L: Lane ID T: Packet tail
V: Valid FLIT H: Packet header
FC: Flow control (Lanes 1:0) CH: Chained header
NPC: New PC address enable PCA: PC address
SLP: PE sleep/wake REN: PE execution enable
ADDR: IMEM/DMEM write address

Figure 4-4: NoC protocol: packet format and FLIT description.


66 The TeraFLOPS NoC Processor

Flow control and buffer management between routers is debit-based using


almost-full bits, which the receiver queue signals via two flow control bits (FC0-
FC1) when its buffers reach a specified threshold. Each header FLIT supports a
maximum of 10 hops. A chained header (CH) bit in the packet provides support
for larger number of hops. Processing engine control information including
sleep and wakeup control bits are specified in the FLIT_1 that follows the
header FLIT. The minimum packet size required by the protocol is two FLITs.
The router architecture places no restriction on the maximum packet size.

4.4.3 Router Architecture

A 4GHz five-port two-lane pipelined packet-switched router core (Figure 4-5)


with phase-tolerant mesochronous links forms the key communication fabric for
the 80-tile NoC architecture. Each port has two 39-bit unidirectional point-to-
point links. The input-buffered wormhole-switched router [18] uses two logical
lanes (lane 0-1) for dead-lock free routing and a fully non-blocking crossbar
switch with a total bandwidth of 80GB/s (32-bits × 4 GHz × 5 ports). Each lane
has a 16 FLIT queue, arbiter and flow control logic. The router uses a 5-stage
pipeline with a two stage round-robin arbitration scheme that first binds an input
port to an output port in each lane and then selects a pending FLIT from one of
the two lanes. A shared data path architecture allows crossbar switch re-use
across both lanes on a per-FLIT basis. The router links implement a
mesochronous interface with first-in-first-out (FIFO) based synchronization at
the receiver.

The router core features a double-pumped crossbar switch [10] to mitigate


crossbar interconnect routing area. The schematic in Figure 4-6(a) shows the 36-
bit crossbar data bus double-pumped at the 4th pipe-stage of the router by
interleaving alternate data bits using dual edge-triggered flip-flops, reducing
crossbar area by 50%. In addition, the proposed router architecture shares the
crossbar switch across both lanes on an individual FLIT basis. Combined
application of both ideas enables a compact 0.34mm2 design, resulting in a 34%
reduction in router layout area as shown in Figure 4-6(b), 26% fewer devices,
13% improvement in average power and one cycle latency reduction (from 6 to
5 cycles) over the router design in [11] when ported and compared in the same
65-nm process [12]. Results from comparison are summarized in Table 4-2.
4.4 FPMAC Architecture 67

Stage 1
Lane 1 Stage 5
From To PE
PE Lane 0 Stage 4
39
39
From To N
MSINT

N
From To E
MSINT

E
From To S
MSINT

S
39
From To W
MSINT

W
39
Crossbar switch

Figure 4-5: Five-port two-lane shared crossbar router architecture.

Router Work in [11] This Work Benefit


Transistors 284K 210K 26%
Area (mm2) 0.52 0.34 34%
Latency (cycles) 6 5 16.6%

Table 4-2: Router comparison over work in [11].


68 The TeraFLOPS NoC Processor

Stage 4 2:1 Mux


M0 clk clkb Stage 5
clkb

SDFF
clk

Latch
Repeaters q0
clk
5:1
i0
50%

clk
clkb
(a) clk
S1 clkb
clk 1.0mm
clkb crossbar
i1 interconnect

SDFF
clkb clk q1

clk
Td < 100ps

Work in [11], scaled to 65nm


0.8mm
Lane 0 Lane 1
0.65mm

Crossbar Crossbar

(b)
34%
0.53mm
Lane 0 Lane 1
Shared double-pumped
0.65mm

Crossbar

This Work
Figure 4-6: (a) Double-pumped crossbar switch schematic. (b) Area benefit
over work in [11].
4.4 FPMAC Architecture 69

4.4.4 Mesochronous Communication


The 2mm long point-to-point unidirectional router links implement a phase-
tolerant mesochronous interface (Figure 4-7). Four of the five router links are
source synchronous, each providing a strobe (Tx_clk) with 38-bits of data. To
reduce active power, Tx_clk is driven at half the clock rate. A 4-deep circular
FIFO, built using transparent latches captures data on both edges of the delayed
link strobe at the receiver. The strobe delay and duty-cycle can be digitally
programmed using the on-chip scan chain. A synchronizer circuit sets the
latency between the FIFO write and read pointers to 1 or 2 cycles at each port,
depending on the phase of the arriving strobe with respect to the local clock. A
more aggressive low-latency setting reduces the synchronization penalty by one
cycle. The interface includes the first stage of the router pipeline.
4-deep FIFO
38 Latch
Tx_data Latch

Tx_clk delay 4:1 Stage 1


Tx_clkd Rxd 38
d q Rx_data
Latch

Tx_data D1 D2 D3
[37:0] Rx_clk
Rxd [37:0] D1 D2
tmsint
Latch

(1-2 cycles)

Delay Line write read


Tx_clkd state state
Tx_clk machine machine
Sync Sync
Rx clk
Synchronizer
Low latency circuit
Scan Register

Figure 4-7: Phase-tolerant mesochronous interface and timing diagram.

4.4.5 Router Interface Block (RIB)


The RIB is responsible for message-passing and aids with synchronization of
data transfer between the tiles and with power management at the PE level.
Incoming 38-bit wide FLITs are buffered in a 16-entry queue, where
70 The TeraFLOPS NoC Processor

demultiplexing based on the lane-ID and framing to 64-bits for data packets
(DMEM) and 96-bits for instruction packets (IMEM) are accomplished. The
buffering is required during program execution since DMEM stores from the 10-
port register file have priority over data packets received by the RIB. The unit
decodes FLIT_1 (Figure 4-4) of an incoming instruction packet and generates
several PE control signals. This allows the PE to start execution (REN) at a
specified IMEM address (PCA) and is enabled by the new program counter
(NPC) bit. After receiving a full data packet, the RIB generates a break signal to
continue execution, if the IMEM is in a stalled (WFD) mode. Upon receipt of a
sleep packet via the mesh network, the RIB unit can also dynamically put the
entire PE to sleep or wake it up for processing tasks on demand.

4.5 Design details


To allow 4+ GHz operation, the entire core is designed using hand-optimized
data path macros. CMOS static gates are used to implement most of the logic.
However, critical registers in the FPMAC and router logic utilize implicit-pulsed
semi-dynamic flip-flops (SDFF) [13]–[14]. The SDFF (Figure 4-8) has a
dynamic master stage coupled to a pseudo-static slave stage. The FPMAC
accumulator register is built using data-inverting rising edge-triggered SDFFs
with synchronous reset and enable. The negative setup time of the flip-flop is
taken advantage of in the critical path. When compared to a conventional static
master-slave flip-flop, SDFF provides both shorter latency and the capability of
incorporating logic functions with minimum delay penalty, properties which are
desirable in high-performance digital designs.

clkb
den q
db rst

clkb
clk
ck

Figure 4-8: Semi-dynamic flip-flop (SDFF) schematic.


4.5 Design details 71

The chip uses a scalable global mesochronous clocking technique, which


allows for clock phase-insensitive communication across tiles and synchronous
operation within each tile. The on-chip PLL output (Figure 4-9a) is routed using
horizontal M8 and vertical M7 spines. Each spine consists of differential clocks
for low duty-cycle variation along the worst-case clock route of 26 mm. An op-
amp at each tile converts the differential clocks to a single ended clock with a
50% duty cycle prior to distributing the clock across the tile using a balanced H-
tree. This clock distribution scales well as tiles are added or removed. The
worst-case simulated global duty-cycle variation is 3 ps and local clock skew
within the tile is 4ps. Figure 4-9(b) shows simulated clock arrival times for all
80 tiles at 4 GHz operation. Note that multiple cycles are required for the global
clock to propagate to all 80 tiles. The systematic clock skews inherent in the
distribution help spread peak currents due to simultaneous clock switching over
the entire cycle.

12.6mm 1.5mm
Clock Arrival Times @ 4GHz (ps)
PE S10
2mm

S9

+ S8

Vertical clock spine


-
20mm

S7
router S6

S5
clk clk#
S4

200-250ps S3

150-200ps S2
100-150ps S1
1 2 3 4 5 6 7 8
M8 50-100ps
PLL clk, clk# PLL
0-50ps
M7 Horizontal clock spine
Clock gating points
(a) (b)

Figure 4-9: (a) Global mesochronous clocking and (b) simulated clock
arrival times.

Fine-grained clock gating, sleep transistor and body bias circuits [15] are used
to reduce active (Figure 4-9a) and standby leakage power, which are controlled
at full-chip, tile-slice, and individual tile levels based on workload. Each tile is
partitioned into 21 smaller sleep regions with dynamic control of individual
72 The TeraFLOPS NoC Processor

blocks in PE and router units based on instruction type. The router is partitioned
into 10 smaller sleep regions with control of individual router ports, depending
on network traffic patterns. The design uses NMOS sleep transistors to reduce
frequency penalty and area overhead. Figure 4-10 shows the router and on-die
network power management scheme. The enable signals gate the clock to each
port, MSINT and the links. In addition, the enable signals also activate the
NMOS sleep transistors in the input queue arrays of both lanes. The 360µm
NMOS sleep device in the register-file is sized to provide a 4.3X reduction in
array leakage power with a 4% frequency impact. The global clock buffer
feeding the router is finally gated at the tile level based on port activity.

Ports 1-4
Router Port 0

Stages 2-5
Lanes 0 & 1
MSINT

Input
Queue Array
(Register
Queue File)

Port_en Sleep
5 Transistor
1
Gclk
gated_clk
Global clock
buffer
Figure 4-10: Router and on-die network power management.

Each FPMAC implements unregulated sleep transistors with no data retention


(Figure 4-11a). A 6-cycle pipelined wakeup sequence largely mitigates current
spikes over single-cycle re-activation scheme, while allowing floating point unit
execution to start one-cycle into wakeup. Circuit simulations (SPICE) show up
to 4X reduction in peak currents when compared to single-cycle wakeup. Note
the staged activation of the three FPMAC sub-blocks from sleep. A faster 3-
cycle fast-wake option is also supported. On the other hand, memory arrays use
a regulated active clamped sleep transistor circuit (Figure 4-11b) that ensures
4.5 Design details 73

data retention and minimizes standby leakage power [16]. The closed-loop
opamp configuration ensures that the virtual ground voltage (VSSV) is no greater
than a VREF input voltage under PVT variations. VREF is set based on memory
cell standby VMIN voltage. The average sleep transistor area overhead is 5.4%
with a 4% frequency penalty. About 90% of FPU logic and 74% of each PE is
sleep-enabled. In addition, forward body bias can be externally applied to all
NMOS devices during active mode to increase the operating frequency and
reverse body bias can be applied during idle mode for further leakage savings.
sleep Multiplier 5
Single cycle wakeup
4.5
clk nap1 FPMAC current (A) 4
W 3.5
Logic
3
fast-wake 1 0
2.5
nap2
2W 2
1.5
4X
(a) Accumulator 1
Blocks enter sleep at same time
0.5
clk nap3 0
W 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
Logic 2
time (ns)
fast-wake 1 0 Pipelined wakeup sequence
nap4
FPMAC current (A)

2W 1.5

Accumulator
Normalization Normalization
1
M ultiplier
clk nap5
W 0.5
Logic
fast-wake 1 0
nap6 0
2W 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

time (ns)

Vcc
Standby VMIN

Memory array

(b)
Vssv

+
VREF=Vcc-VMIN _ body
wake bias
control

Figure 4-11: (a) FPMAC pipelined wakeup diagram and simulated peak
current reduction and (b) state-retentive memory clamp circuit.
74 The TeraFLOPS NoC Processor

4.6 Experimental Results


The teraFLOPS processor is fabricated in a 65-nm process technology [12]
with a 1.2-nm gate-oxide thickness, nickel salicide for lower resistance and a
second-generation strained silicon technology. The interconnect uses eight
copper layers and a low-K carbon-doped oxide (k = 2.9) inter-layer dielectric.
The functional blocks of the chip and individual tile are identified in the die
photographs in Figure 4-12. The 275 mm2 fully custom design contains 100
million transistors. Using a fully-tiled approach, each 3 mm2 tile is drawn
complete with C4 bumps, power, global clock and signal routing, which are
seamlessly arrayed by abutment. Each tile contains 1.2 million transistors with
the processing engine accounting for 1 million (83%) and the router 17% of the
total tile device count. De-coupling capacitors occupy about 20% of the total
logic area. The chip has 3 independent voltage regions, one for the tiles, a
separate supply for the PLL and a third one for the I/O circuits. Test and debug
features include a TAP controller and full-scan support for all memory blocks on
chip.
12.64mm 1.5mm
I/O Area
Global clk spine + clk buffers
single tile DMEM FPMAC0
RF
1.5mm
RIB IMEM
2.0mm

2.0mm
21.72mm

MSINT CLK

FPMAC1
MSINT
Router
MSINT

Technology 65nm, 1 poly, 8 metal (Cu)


Transistors100 Million (full-chip)
1.2 Million (tile)
PLL TAP Die Area 275mm2 (full-chip)
3mm2 (tile)
I/O Area C4 bumps # 8390

Figure 4-12: Full-Chip and tile micrograph and characteristics.


4.6 Experimental Results 75

The evaluation board with the packaged chip is shown in Figure 4-13. The die
has 8390 C4 solder bumps, arrayed with a single uniform bump pitch across the
entire die. The chip-level power distribution consists of a uniform M8-M7 grid
aligned with the C4 power and ground bump array.

(a) (b)

(c)
Figure 4-13: (a) Package die-side. (b) Land-side. (c) Evaluation board.
76 The TeraFLOPS NoC Processor

The package is a 66 mm × 66 mm flip-chip LGA (land grid array) and


includes an integrated heat spreader. The package has a 14-layer stack-up (5-4-
5) to meet the various power planes and signal requirements and has a total of
1248 pins, out of which 343 are signal pins. Decoupling capacitors are mounted
on the land-side of the package as shown in Figure 4-13(b). A PC running
custom software is used to apply test vectors and observe results through the on-
chip scan chain. First silicon has been validated to be fully functional.
Frequency versus power supply on a typical part is shown in Figure 4-14.
Silicon chip measurements at a case temperature of 80°C demonstrates chip
maximum frequency (FMAX) of 1 GHz at 670 mV and 3.16 GHz at 950 mV, with
frequency increasing to 5.1 GHz at 1.2 V and 5.67 GHz at 1.35 V. With all 80
tiles (N=80) actively performing single precision block-matrix operations, the
chip achieves a peak performance of 0.32 TFLOPS (670 mV), 1.0 TFLOPS (950
mV), 1.63 TFLOPS (1.2 V) and 1.81TFLOPS (1.35 V).

6
80°C (1.63 T FLOPS)
5 5.1GHz
Frequency (GHz)

(1.81 T FLOPS)
(1 T FLOPS) 5.67GHz
4 3.16GHz

3
(0.32 T FLOPS)
1GHz
2

1
N=80
0
0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4
V cc (V)
Figure 4-14: Measured chip FMAX and peak performance.
4.6 Experimental Results 77

Several application kernels have been mapped to the design and the
performance is summarized in Table 4-3. The table shows the single-precision
floating point operation count, the number of active tiles (N) and the average
performance in TFLOPS for each application, reported as a percentage of the
peak performance achievable with the design. In each case, task mapping was
hand optimized and communication was overlapped with computation as much
as possible to increase efficiency. The stencil code solves a steady state two-
dimensional heat diffusion equation with periodic boundary conditions on left
and right boundaries of a rectilinear grid, and prescribed temperatures on top
and bottom boundaries. For the stencil kernel, chip measurements indicate an
average performance of 1.0 TFLOPS at 4.27 GHz and 1.07 V supply with 358K
floating-point operations, achieving 73.3% of the peak performance. This data is
particularly impressive because the execution is entirely overlapped with local
loads and stores and communication between neighboring tiles. The SGEMM
matrix multiplication code operates on two 100 × 100 matrices with 2.63 million
floating-point operations, corresponding to an average performance of 0.51
TFLOPS. It is important to note that the read bandwidth from local data
memory limits the performance to half the peak rate. The spreadsheet kernel
applies reductions to tables of data consisting of pairs of values and weights. For
each table the weighted sum of each row and each column is computed. A 64-
point 2D FFT (Fast Fourier Transform) which implements the Cooley-Tukey
algorithm [17] using 64 tiles has also been successfully mapped to the design
with an average performance of 27.3 GFLOPS. It first computes 8-point FFTs in
each tile, which in turn passes results to 63 other tiles for the 2D FFT
computation. The complex communication pattern results in high overhead and
lower efficiency.
Application FLOP Average % Peak Number of
Kernels count Performance TFLOPS active tiles (N)
(TFLOPS)
Stencil 358K 1.00 73.3% 80

SGEMM: 2.63M 0.51 37.5% 80


Matrix
Multiplication
Spreadsheet 62.4K 0.45 33.2% 80
2D FFT 196K 0.02 2.73% 64

Table 4-3: Application performance measured at 1.07V and 4.27GHz


operation.
78 The TeraFLOPS NoC Processor

250
80°C, N=80 1.33TFLOPS
225
@ 230W
200 Active Power
175 Leakage Power
152
Power (W)

150 1TFLOPS
125 @ 97W
100
78
75
50
26
25 15.6
0
0.67 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35
Vcc (V)

Figure 4-15: Measured chip power for stencil application.

Figure 4-15 shows the total chip power dissipation with the active and leakage
power components separated as a function of frequency and power supply with
case temperature maintained at 80°C. We report measured power for the stencil
application kernel, since it is the most computationally intensive. The chip
power consumption ranges from 15.6 W at 670 mW to 230 W at 1.35 V. With
all 80 tiles actively executing stencil code the chip achieves 1.0 TFLOPS of
average performance at 4.27 GHz and 1.07 V supply with a total power
dissipation of 97 W. The total power consumed increases to 230 W at 1.35 V
and 5.67 GHz operation, delivering 1.33 TFLOPS of average performance.
Figure 4-16 plots the measured energy efficiency in GFLOPS/W for the stencil
application with power supply and frequency scaling. As expected, the chip
energy efficiency increases with power supply reduction, from 5.8 GFLOPS/W
at 1.35 V supply to 10.5 GFLOPS/W at the 1.0 TFLOPS goal to a maximum of
19.4 GFLOPS/W at 750 mV supply. Below 750 mV the chip FMAX degrades
faster than power saved by lowering tile supply voltage, resulting in overall
performance reduction and consequent drop in the processor energy efficiency.
The chip provides up to 394 GFLOPS of aggregate performance at 750 mV with
a measured total power dissipation of just 20 W.
4.6 Experimental Results 79

20
19.4 (750 mV) N=80
80°C
15
GFLOPS/W

15 (670mV)
10
10.5
(1.07V)
5
5.8
394 GFLOPS (1.35V)
0
200 400 600 800 1000 1200 1400
GFLOPS
Figure 4-16: Measured chip energy efficiency for stencil application.

Figure 4-17 presents the estimated power breakdown at the tile and router
levels, which is simulated at 4 GHz, 1.2 V supply and at 110°C. The processing
engine with the dual FPMACs, instruction and data memory, and the register file
accounts for 61% of the total tile power (Figure 4-17a). The communication
power is significant at 28% of the tile power and the synchronous tile-level
clock distribution accounts for 11% of the total. Figure 4-17(b) shows a more
detailed tile to tile communication power breakdown, which includes the router,
mesochronous interfaces and links. Clocking power is the largest component,
accounting for 33% of the communication power. The input queues on both
lanes and data path circuits is the second major component dissipating 22% of
the communication power.
Figure 4-18 shows the output differential and single ended clock waveforms
measured at the farthest clock buffer from the PLL at a frequency of 5GHz.
Notice that the duty cycles of the clocks are close to 50%. Figure 4-19 plots the
global clock distribution power as a function of frequency and power supply.
This is the switching power dissipated in the clock spines from the PLL to the
op-amp at the center of all 80 tiles. Measured silicon data at 80°C shows that the
power is 80 mW at 0.8 V and 1.7 GHz frequency, increasing by 10 X to 800
80 The TeraFLOPS NoC Processor

mW at 1 V and 3.8 GHz. The global clock distribution power is 2 W at 1.2 V


and 5.1 GHz and accounts for just 1.3% of the total chip power.

Dual
Clock dist.
FPMACs
11%
36%

(a) IMEM +
Router + DMEM
Links 10-port RF 21%
28% 4%

Links Crossbar
17% 15%
MSINT Clocking
6% 33%
(b)
Arbiters + Queues +
Control Datapath
7% 22%
Figure 4-17: Estimated (a) Tile power profile (b) Communication power
breakdown.

Figure 4-20 plots the chip leakage power as a percentage of the total power
with all 80 processing engines and routers awake and with all the clocks
disabled. Measurements show that the worst-case leakage power in active mode
varies from a minimum of 9.6% to a maximum of 15.7% of the total power
when measured over the power supply range of 670 mV to 1.35 V. In sleep
mode, the NMOS sleep transistors are turned off, reducing chip leakage by 2X,
while preserving the logic state in all memory arrays.
4.6 Experimental Results 81

clk clkse clk#

M8 I/O

+

M7

clk,
clk#
20mm

clkse

PLL clk, clk#

Figure 4-18: Measured global clock distribution waveforms.

3.0
Clock Dist. Power (W)

2.5
80°C 2W (1.2V)
2.0
1.5 0.8W (1V)
80m W
1.0
(0.8V)
0.5
0.0
1 2 3 4 5 6
Frequency (GHz)

Figure 4-19: Measured global clock distribution power.


82 The TeraFLOPS NoC Processor

16%
Sleep disabled 80°C
12% Sleep enabled N=80
% Total Power

8% 2X

4%

0%
0.67 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35
Vcc (V)

Figure 4-20: Measured chip leakage power as percentage of total power vs.
Vcc. A 2X reduction is obtained by turning off sleep transistors.

1000
1.2V, 5.1GHz 924mW
Network power per tile (mW)

800 80°C

600 7.3X

400

126mW
200

0
0 1 2 3 4 5
Number of active router ports
Figure 4-21: On-die network power reduction benefit.
4.6 Experimental Results 83

Figure 4-21 shows the active and leakage power reduction due to a
combination of selective router port activation, clock gating and sleep transistor
techniques described in section III. Measured at 1.2 V, 80°C and 5.1 GHz
operation, the total network power per-tile can be lowered from a maximum of
924 mW with all router ports active to 126 mW, resulting in a 7.3X reduction.
The network leakage power per-tile with all ports and global clock buffers
feeding the router disabled is 126 mW. This number includes power dissipated
in the router, MSINT and the links. Figure 4-22 shows measured scope
waveform of the virtual ground (Vssv) node for one of the instruction memory
(IMEM) arrays showing transition of the block to and from sleep. For this
measurement, VREF is set at 400mV, with a memory cell standby VMIN voltage
of 800mV (Vcc = 1.2V) for data retention. Notice that the active clamping circuit
ensures that Vssv is within a few mV (5%) of VREF. A photograph of the
measurement setup used to characterize the design is shown in Figure 4-23.
Standby VMIN = 0.8V

Vcc = 1.2V

IMEM
Memory
array Probe
Vssv Pad
+
VREF= 400mV _
wake

IMEM
Vssv
Active

In Sleep In Sleep

Figure 4-22: Measured IMEM virtual ground waveform slowing transition


to and from sleep.
84 The TeraFLOPS NoC Processor

Figure 4-23: Measurement setup.

4.7 Applications for the TeraFLOP NoC Processor


We presented results from mapping a handful of typical application kernels to
our NoC processor with good success. A broader question to answer is “what
applications would really benefit from a tera-FLOP of computing power and
tera-bits of on-die communication bandwidth?". The teraFLOP processor
provides a 500X jump in compute capability over today’s giga-scale devices.
This is required for tomorrow's emerging applications including real-time
recognition, mining and synthesis (RMS) workloads on tera-bytes of data [19].
Other applications include artificial intelligence (AI) for smarter appliances,
virtual reality for 3D modeling, gaming, visualization, physics simulation,
financial applications and medical training; and other applications that are still
on the edge of being science fiction. In medicine, a full-body medical scan
already contains tera-bytes of information. Even at home, people are generating
large amounts of data, including hundreds of hours of video, and thousands of
digital photos that need to be indexed and searched. This computing model
promises to bring the massive compute capabilities of supercomputers to laptops
or even handheld devices.
4.8 Conclusion 85

4.8 Conclusion
In this chapter, we have presented an 80-tile high-performance NoC
architecture implemented in a 65-nm process technology. The prototype
contains 160 lower-latency FPMAC cores and features a single-cycle
accumulator architecture for high throughput. Each tile also contains a fast and
compact router operating at core speed where the 80 tiles are interconnected
using a 2D mesh topology providing a high bisection bandwidth of over 2 Tera-
bits/s. The design uses a combination of micro-architecture, logic, circuits and a
65-nm process to reach target performance. Silicon operates over a wide voltage
and frequency range, and delivers teraFLOPS performance with high power
efficiency. For the most computationally intensive application kernel, the chip
achieves an average performance of 1.0 TFLOPS, while dissipating 97 W at
4.27 GHz and 1.07 V supply, corresponding to an energy efficiency of 10.5
GFLOPS/W. Average performance scales to 1.33 TFLOPS at a maximum
operational frequency of 5.67 GHz and 1.35 V supply. These results
demonstrate the feasibility for high-performance and energy efficient building
blocks for peta-scale computing in the near future.

Acknowledgement
I sincerely thank Prof. A. Alvandpour, V. De, J. Howard, G. Ruhl, S. Dighe,
H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C.
Roberts, Y. Hoskote, N. Borkar, S. Borkar, D. Somasekhar, D. Jenkins, P.
Aseron, J. Collias, B. Nefcy, P. Iyer, S. Venkataraman, S. Saha, M. Haycock, J.
Schutz and J. Rattner for help, encouragement, and support; T Mattson, R.
Wijngaart, M. Frumkin from SSG and ARL teams at Intel for assistance with
mapping the kernels to the design, the LTD and ATD teams for PLL and
package design and assembly, and the entire mask design team for chip layout.

4.9 References
[1]. L. Benini and G. D. Micheli., “Networks on Chips: A New SoC
Paradigm,” IEEE Computer, vol. 35, pp. 70–78, Jan., 2002.

[2]. W. J. Dally and B. Towles, “Route Packets, Not Wires: On-Chip


Interconnection Networks,” in Proceedings of the 38th Design Automation
Conference, pp. 681-689, June 2001.

[3]. M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald,


H. Hoffmann, P. Johnson, J. W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski,
86 The TeraFLOPS NoC Processor

N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal,


“The Raw Microprocessor: A Computational Fabric for Software Circuits
and General Purpose Programs,” IEEE Micro, vol.22, no.2, pp. 25–35,
2002.

[4]. K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, S.W.


Keckler, and C. R. Moore, “Exploiting ILP, TLP, and DLP with The
Polymorphous TRIPS Architecture,” in Proceedings of the 30th Annual
International Symposium on Computer Architecture, pp. 422–433, 2003.

[5]. Z. Yu, M. Meeuwsen, R. Apperson, O. Sattari, M. Lai, J. Webb, E. Work,


T. Mohsenin, M. Singh, and B. Baas, “An asynchronous array of simple
processors for dsp applications,” ISSCC Dig. Tech. Papers, pp. 428–429,
Feb., 2006.

[6]. H. Wang, L. Peh and S. Malik, “Power-driven design of router


microarchitectures in on-chip networks,” MICRO-36, Proceedings of 36th
Annual IEEE/ACM International Symposium on Micro architecture, pp.
105-116, 2003.

[7]. S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan,


P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote and N.
Borkar, “An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS,”
ISSCC Dig. Tech. Papers, pp. 98–99, Feb. 2007.

[8]. S. Vangal Y. Hoskote, N. Borkar and A. Alvandpour, “A 6.2-GFlops


Floating-Point Multiply-Accumulator with Conditional Normalization,”
IEEE J. of Solid-State Circuits, pp. 2314–2323, Oct., 2006.

[9]. IEEE Standards Board, “IEEE Standard for Binary Floating-Point


Arithmetic,” Technical Report ANSI/IEEE Std. 754-1985, IEEE, New
York, 1985.

[10]. S. Vangal, N. Borkar and A. Alvandpour, "A Six-Port 57GB/s Double-


Pumped Non-blocking Router Core", Symposium on VLSI Circuits, pp.
268–269, June 2005.

[11]. H. Wilson and M. Haycock., “A six-port 30-GB/s non-blocking router


component using point-to-point simultaneous bidirectional signaling for
high-bandwidth interconnects”, IEEE Journal of Solid-State Circuits, vol.
36, Dec. 2001, pp. 1954–1963.
4.9 References 87

[12]. P. Bai, C. Auth, S. Balakrishnan, M. Bost, R. Brain, V. Chikarmane, R.


Heussner, M. Hussein, J. Hwang, D. Ingerly, R. James, J. Jeong, C.
Kenyon, E. Lee, S.-H. Lee, N. Lindert, M. Liu, Z. Ma, T. Marieb, A.
Murthy, R. Nagisetty, S. Natarajan, J. Neirynck, A. Ott, C. Parker, J.
Sebastian, R. Shaheed, S. Sivakumar, J. Steigerwald, S. Tyagi, C. Weber,
B. Woolery, A.Yeoh , K. Zhang and M. Bohr, “A 65nm Logic
Technology Featuring 35nm Gate Lengths, Enhanced Channel Strain, 8 Cu
Interconnect Layers, Low-k ILD and 0.57µm2 SRAM Cell,” IEDM
technical digest, pp. 657–660, Dec. 2004.

[13]. F. Klass, "Semi-Dynamic and Dynamic Flip-Flops with Embedded


Logic,” 1998 Symposium on VLSI Circuits, Digest of Technical Papers,
pp. 108–109.

[14]. J. Tschanz, S. Narendra, Z. Chen, S. Borkar, M. Sachdev, V. De,


“Comparative Delay and Energy of Single Edge-Triggered & Dual Edge-
Triggered Pulsed Flip-Flops for High-Performance Microprocessors,”
ISLPED 2001, pp. 147-151.

[15]. J. Tschanz, S. Narendra, Y. Ye, B. Bloechel, S. Borkar and V. De,.,


“Dynamic sleep transistor and body bias for active leakage power control
of microprocessors,” IEEE J. of Solid-State Circuits, pp. 1838–1845, Nov.
2003.

[16]. M. Khellah, D. Somasekhar, Y. Ye, N. Kim, J. Howard, G. Ruhl, M.


Sunna, J. Tschanz, N. Borkar, F. Hamzaoglu, G. Pandya, A. Farhang, K.
Zhang, and V. De, “A 256-Kb Dual-Vcc SRAM Building Block in 65-nm
CMOS Process With Actively Clamped Sleep Transistor,” IEEE J. of
Solid-State Circuits, pp. 233–242, Jan. 2007.

[17]. J. W. Cooley and J. W. Tukey, "An algorithm for the machine calculation
of complex Fourier series," Math. Comput. 19, 297–301, 1965.

[18]. S. Vangal, A. Singh, J. Howard, S. Dighe, N. Borkar and A. Alvandpour,


"A 5.1GHz 0.34mm2 Router for Network-on-Chip Applications ", 2007
Symposium on VLSI Circuits, Digest of Technical Papers, pp. 42–43, June
2007.

[19]. J. Held, J. Bautista, and S. Koehl. , "From a few cores to many: A tera-
scale computing research overview", 2006 http://www.intel.com/
research/platform/terascale/.
88 The TeraFLOPS NoC Processor
Chapter 5

Conclusions and Future Work

5.1 Conclusions
Multi-core processors are poised to become the new embodiment of Moore’s
law. Indeed, the processor industry is now aiming for ten to hundred core
processors on a single die within the next decade [1]. To realize this level of
integration quickly and efficiently, NoC architectures are rapidly emerging as
the candidate of choice by providing a highly scalable, reliable, and modular on-
chip communication infrastructure platform. Since a NoC is constructed from
multiple point-to-point links interconnected by switches (i.e., routers) that
exchange packets between the various IP blocks, this thesis focused on research
into these key building blocks, which spans three process generations (150-nm,
90-nm and 65-nm). NoC routers should be small, fast and energy-efficient. To
address this, we demonstrated the area and peak energy reduction benefits of a
six-port four-lane 57 GB/s communications router core. Design enhancements
include four double-pumped crossbar channels and destination-aware crossbar
channel drivers that dynamically configure based on the current packet
destination. Combined application of both techniques enables a 45% reduction
in channel area, 23% overall router core area, up to 3.8X reduction in peak

89
90 Conclusions and Future Work

crossbar channel power, and 7.2% improvement in average channel power in a


150-nm six-metal CMOS process. A second-generation 102 GB/s router design
with a shared-crossbar architecture increases the router area savings to 34%,
enabling a compact 0.34mm2 design. The cost savings are expected to increase
for crossbar routers with larger link widths (n > 64 bytes) due to the O(n2) effect.
We next presented the design of a pipelined single-precision FPMAC
featuring a bit-level pipelined multiplier unit and a single-cycle accumulation
loop, with delayed addition and normalization stages. A combination
of algorithmic, logic and circuit techniques enabled 3GHz operation, with just
15 FO4 stages in the critical path. The design eliminates scheduling restrictions
between consecutive FPMAC instructions and improves throughput. In addition,
an improved leading zero anticipator and overflow detection logic applicable to
carry-save format has been developed. Fabricated in a 90-nm seven-metal dual-
VT CMOS process, silicon achieves 6.2 GFLOPS of performance at 3.1GHz,
1.3V supply, and dissipates 1.2W at 40°C, resulting in 5.2 GFLOPS/W. This
performance comes at the expense of larger datapath in the accumulate loop,
increasing layout area and power. The conditional normalization technique
helps reclaim some of the power by enabling up to 24.6% reduction in FPMAC
active power and 30% reduction in leakage power.
We finally describe an 80-tile high-performance NoC architecture
implemented in a 65-nm eight-metal CMOS process technology. The prototype
contains 160 lower-latency FPMAC cores and features a single-cycle
accumulator architecture for high throughput. Each tile also contains a fast and
compact router operating at core speed where the 80 tiles are interconnected
using a 2D mesh topology providing a high bisection bandwidth of over 2 Tera-
bits/s. The design uses a combination of micro-architecture, logic, circuits and a
65-nm process to reach target performance. Silicon operates over a wide voltage
(0.67V–1.35V) and frequency (1 GHz–5.67 GHz) range, delivering teraFLOPS
performance with high power efficiency. For the most computationally intensive
application kernel, the chip achieves an average performance of 1.0 TFLOPS,
while dissipating 97 W at 4.27 GHz and 1.07 V supply, corresponding to an
energy efficiency of 10.5 GFLOPS/W. This represents a 100X-150X
improvement in power efficiency, while providing a 200X increase in aggregate
performance over a typical desktop processor available in the market today.
It is clear that realization of successful NoC designs require well balanced
decisions at all levels: architecture, logic, circuit and physical design. Our results
demonstrate that the NoC architecture successfully delivers on the promise of
greater integration, high performance, good scalability and high energy
efficiency. We are one step closer to having super-computing performance built
into our desktop and mobile computers in the near future.
5.2 Future Work 91

5.2 Future Work


NoC design is a multi-variable optimization problem. Several challenging
research problems remain to be solved at all levels of the layered-stack, from the
physical link level through the network level, and all the way up to the system
architecture and application software. Network architecture research in the
areas of topology, routing, and flow control must complement efforts in
designing NoCs with a high degree of fault tolerance while satisfying
application quality-of-service (QoS) requirements. Heterogeneous NoCs that
allocate resources as needed [2] and circuit-switched networks [3] are promising
approaches. Designing efficient routers to support such networks is a
worthwhile challenge.
The on-die fabric power constitutes a significant and growing portion of the
overall design budget. Recall that the 2D mesh network in the teraflops NoC
accounted for 28% of total power. It is imperative to contain the power and area
needs of the communication fabric without adversely affecting performance.
Much work still needs to be done in the areas of low-latency, power and area
efficient router architectures. An aggressive router design should have
latencies of just 1-2 cycles, account for less than 10% of tile power and area
budgets, and still allow multi-GHz operation. Future routers should incorporate
extensive fine-grained power management techniques that dynamically adapt
to changing network workload conditions for energy-efficient operation.
Low-power circuits for fabric (link) power reduction, incorporating low-
swing signaling techniques is a vital research area [4]. The circuits must not only
reduce the interconnect swing, but also enable use of very low supply voltages
to obtain quadratic energy savings. While low-power encoding techniques have
been proposed at the link level, recent work [5] concludes that non-encoded
links with end-to-end data protection through error correction methods allows
aggressive supply voltage scaling and better power savings.
Optical interconnects present an attractive alternative to electrical
communication for both on-chip and off-chip use, providing high bandwidth
potentially at a lower power. Research shows that optical links are better than
electrical wires for lengths longer than 3-5 mm in a 50 nm CMOS process [6].
Optical communication shows great potential for low power operation and for
use in future clock distribution and global signaling. Fast silicon optical
modulator and Raman silicon laser have been recently demonstrated [7]. Several
practical issues and challenges associated with integrating the photonic devices
on silicon and processing silicon photonic devices in a high volume CMOS
manufacturing environment are discussed in [7].
92 Conclusions and Future Work

Multi-core processors need substantial memory bandwidth, a challenge


commonly referred as "feeding the beast". To enable memory bandwidths
beyond 1TB/s, the 3D processor + memory stacked die architecture using multi-
chip packaging (MCP) becomes interesting [8]. Since this will provide the
shortest possible interconnect between the CPU and memory die, the bit
rate can exceed 10 Gb/s. In addition, the interconnect density will scale to
enable thousands of die-to-die interconnections. Despite the promising
advantages of 3D stacking, significant process flow integration, packaging and
thermal challenges [9] need to be overcome.
Chip power delivery is becoming harder due to increase in the current drawn
from the supply and the rate of change of current (di/dt) due to faster switching
as technology is scaled. Delivering 200 W at 500 mV supply voltage is a very
challenging problem due to high power supply current and low supply voltage.
Recently, researchers have shown the effectiveness of integrated CMOS voltage
regulators (VR) [10]. The VRs are small, have high current efficienty (>95%)
and fast (<1ns) response time. The modular nature of NoCs makes possible on-
chip VR integration with ultra-fast control of dozens of independent multi-
supply-voltage regions on-die. This should improve the power distribution
efficiency and allow excellent dynamic power management.
As a final note – to be able to successfully exploit the pool of integrated
hardware resources avaliable on silicon, research into parallel programming is
vital. There is a compelling need to develop enhanced compilation and
scheduling tools, fine-grained instruction scheduling algorithms and new
programming tools that make highly-threaded and data-parallel applications a
reality.

5.3 References
[1]. J. Held, J. Bautista, and S. Koehl. , "From a few cores to many: A tera-
scale computing research overview", 2006 http://www.intel.com/
research/platform/terascale/.
[2]. K. Rijpkema et al., “Trade-Offs in the Design of a Router with Both
Guaranteed and Best-Effort Services for Networks on Chip,” IEE Proc.
Comput. Digit. Tech., vol. 150, no. 5, 2003, pp. 294-302.

[3]. P. Wolkotte et al., “An Energy-Efficient Reconfigurable Circuit-Switched


Network-on-Chip,” Proc. IEEE Int’l Parallel and Distributed Symp.
(IPDS 05), IEEE CS Press, 2005, pp. 155-162.
5.3 References 93

[4]. H. Zhang, V. George and J. Rabaey, “Low-swing on-chip signaling


techniques: effectiveness and robustness”, IEEE Transactions on VLSI
Systems, June 2000, pp. 264 – 272.

[5]. A. Jantsch and R. Vitkowski, “Power analysis of link level and end-to-end
data protection in networks-on-chip”, International Symposium on Circuits
and Systems (ISCAS), 2005, pp. 1770–1773.

[6]. P. Kapur and K Saraswat, “Optical interconnects for future high


performance integrated circuits”, Physica E 16, 3–4, 2003, pp. 620–627.

[7]. M. Paniccia, M. Ansheng, N. Izhaky and A. Barkai, “Integration challenge


of silicon photonics with microelectronics”, 2nd IEEE International Conf.
on Group IV Photonics, Sept. 2005, pp. 20 – 22.

[8]. P. Reed, G. Yeung and B. Black, “Design Aspects of a Microprocessor


Data Cache using 3D Die Interconnect Technology”, Proceedings of the
International Conference on Integrated Circuit Design and Technology,
May 2005, pp. 15–18.

[9]. Intel Technology Journal, Volume 11, Issue 3, 2007, http://www.intel.com/


technology/itj/.

[10]. P. Hazucha, S. Moon, G. Schrom, F. Paillet, D. Gardner, S. Rajapandian


and T. Karnik, “High Voltage Tolerant Linear Regulator With Fast Digital
Control for Biasing of Integrated DC-DC Converters”, IEEE J. of Solid-
State Circuits, Jan. 2007, pp. 66–73.

You might also like