Professional Documents
Culture Documents
Abstract
This paper presents a physical implementation of
the DSPI% network-on-chip in the FAUST
architecture. FAUST is a stream-oriented multiapplication SoC platform for telecommunications
addressing IEEE 802.11a and MC-CDMA standards.
The original asynchronous network-on-chip (A%OC) of
FAUST has been replaced by the multi-synchronous
DSPI% network-on-chip. In this paper, we analyze how
the DSPI% network-on-chip, originally designed to
support shared memory and multi-processors
architectures,
can
support
stream-oriented
architectures. The physical implementation of both
A%OC and DSPI% are presented. Finally, a
comparison between A%OC and DSPI% designs in a
130nm technology is carried out in terms of area,
throughput, packet latency, and power consumption.
1. Introduction
Increasing the system performance by scaling the
technology and the clock frequency become more and
more complex due to the lower scalability of the wire
delays. New approaches such as Network-on-Chip
(NoC) architectures and the Globally Asynchronous,
Locally Synchronous (GALS) paradigm tries to solve
the design bottleneck by partitioning the circuit in
small synchronous islands while they communicate
asynchronously. Each island can be clocked by
independent
clock
frequency,
while
the
communications between neighbor islands are carried
out by the NoC. Moreover, the NoC approach attempts
to solve the bandwidth bottleneck of a central bus by
splitting the communications over a plurality of routers
and links.
A large number of NoC architectures have been
published, but there is very few detailed analysis of
their physical implementation: SPIN [16,10], Tera-
139
2. FAUST Application
FAUST, which stands for Flexible Architecture of
Unified Systems for Telecom is a hardware
demonstration platform for the 4MORE mobile
terminals. 4MORE [1] is an IST program targeting 4G
baseband modem chips. The FAUST project was
initiated in 2003 for supporting multiple OFDM air
interfaces in a single SoC. FAUST architecture (Figure
1) is composed by processing units interconnected by a
NoC. It also includes an ARM946ES in an AHB
subsystem. The communication protocol between the
functional units is carried out by message passing
through the NoC. Each processing unit contains a
programmable Network Interface Controller, which
contains input and output FIFOs and regulates the
traffic through the network. This regulation is carried
out by credits to synchronize the producer to the
consumer on a self-synchronized data pipeline manner.
NOC1 IF
84 Pads
JTAG
RAC
Clk, Rst
SPort
APort
EXP
TX units
OFDM
MOD.
ALAM.
MOD.
CDMA
MOD.
MAPP.
BIT
INTER.
NoC
Perf.
RAM
ARM946
RAM
EXT.
RAM
CTRL
RAM IF
58 Pads
ROTOR
EQUAL.
CHAN.
EST.
CONV.
DEC.
ETHER
NET
ETHERNET IF
17 Pads
FRAME
SYNC.
ODFM
DEM.
CDMA
DEM.
DEMAPP.
DEINTER.
AHB
TURBO
CODER
CONV.
CODER
AHB units
RX units
Async/ Sync IF
Async node
EXP
NOC2 IF
SPort
83 Pads
APort
DART
140
S1
E2
E3
S0
Combinational
QDI logic
R
R
Eack
S2
S3
Sack
Stage n-1
Stage n
CLK Reset
E1
Exp.
E0
NP1
rst_n
Stage n+1
Exp.
NP2
141
3. DSPI oC
DSPIN NoC [2] stands for Distributed, Scalable,
Programmable, Integrated Network. It is a wormhole
packet-based NoC, with a 2D mesh topology. The
packets are routed following the X-first deterministic
routing algorithm. With this algorithm, packets are first
routed on the X direction and then on the Y direction.
The routing information on the packets is encoded by
the absolute address of the destination subsystem on
the first flit of the packet. Figure 8b shows the first flit
and the following flits of the packet. DSPIN uses a
generic flit size, which has been tuned to 34-bit flit in
this implementation, providing a payload of 32-bits.
South
(Y+1,X-1)
(Y,X-1)
Cluster(Y+1,X)
Cluster(Y,X)
(Y+1,X+1)
(Y,X+1)
South
(Y-1,X-1)
West
East
GS
BE
East
GS
BE
West
North
Local
Cluster(Y-1,X)
(Y-1,X+1)
North
142
AOC
Topology
Irregular
Router arity
5 port router
DSPI
Regular 2D mesh
5 port router
Address-based
X-First algorithm
Switching
technique
Wormhole
Wormhole
Flit size
34 bits
34 bits (generic)
Flit payload
32 bits
32 bits (generic)
Routing overhead
and capability
18-bits, allowing
9 routing hops.
Path extension is
possible
Virtual channels
Programming
model
Message passing
Shared memory
(2 routers per cluster)
Message passing
(1 router per cluster)
Clocking scheme
Fully asynchronous
(QDI) with GALS
interfaces
Multi-synchronous
with mesochronous
interfaces
Flow control
protocol
Send/accept
asynchronous
handshake
FIFO protocol
(Write and WriteOk)
Metastability
issues
Clock tree
None
Physical
implementation
Hard macro
Long wires
Inter-router wires
Intra-cluster wires
143
5.1. Synthesis
We used a hierarchical approach for the physical
synthesis of the FAUST architecture with the DSPIN
NoC. Each cluster was synthesized separately, before
being assembled on the top FAUST architecture. Thus,
no RTL synthesis was performed on the top level. The
design was synthesized using STMicroelectronics
CMOS 130nm low power standard cells.
The timing constraints for the DSPIN routers
synthesis were chosen to take into account the physical
implementation. Thus, the DSPIN long wires (intracluster wires) were constrained with 300ps of
propagation time. Moreover, low power standard cells
with low Vt transistors were uses in conjunction with
clock-gating techniques to minimize the power
consumption.
CLK_IP
IP
IP
NIC
NIC
Synchronous
SEND/ACCEPT
GALS interface
5. DSPI Implementation
Synchronous
SEND/ACCEPT
Protocol_conversion
LUT
Asynchronous
SEND/ACCEPT
ANOC router
Asynchronous
READ/WRITE
DSPIN router
Asynchronous
SEND/ACCEPT
Mesochronous
READ/WRITE
5.2. Floorplanning
CLK_NoC
a) ANOC IP template
b) DSPIN IP template
144
RAC
Ala. N
NP1
N
OFDM mod.
EW
CDMA N
Mod.
CLK
EW
W S
N
Turbo
Dec.
Conv.
Codec.
ARM946
Ext.
RAM
Ctrl.
RAM2
N
RAM1
Bit. Inter.
L
E
E W S E W S
N
L
N S
Mapp.
L
EW
Rotor
S
N
Equal.
Frame
sync.
L
N
L
W
E
S
N
EW
S
N
E W S
N
OFDM demod.
L
EW
L
EW
L
CDMA
Dem.
S
S
N
Conv.
Dec.
E
L
W S
N
S
N
L
W S
EW
Channel Est.
NP2
E
S W
EW
L
W
S
N
Ethernet
S E
N
Demapp.
W
E
L
Deinter.
E
W
L
DART
6. etwork Comparison
In this section, the ANOC and DSPIN
implementations are compared in terms of area,
throughput, latency, and power consumption, using
synthetic workload.
6.1. Area
The ANOC router was implemented as a hard
macro. Its area is 0.21mm with a cell density of 95%.
The GALS interface module is implemented as a soft
macro and its area is computed assuming a 95% of cell
density. On the other hand, DSPIN is implemented as a
soft macro and no area is exclusively reserved for the
router. Assuming a 95% integration density, the total
area is computed in Table 2 taking into consideration
the DSPIN clock tree and the FIFO area of GALS
145
will be higher than the gate delays, a multisynchronous architecture as DSPIN would have higher
packet throughput than an asynchronous one as
ANOC. Fortunately, pipeline stages can be inserted on
the long wires in order to cope with these delays and
improve the throughput, despite of the added latency.
DSPI
Router
0.211 mm
0.161 mm
Interface GALS
0.070 mm
0.024 mm
Clock tree
0.000 mm
0.0016 mm
Total
0.281 mm
0.187 mm
6.2. Throughput
The throughput on the ANOC router depends on
the fabrication process, on the voltage applied, and on
the temperature condition. The throughput of ANOC is
160Mflit/s in worst-case and 220Mflit/s in nominal
conditions. Moreover, the asynchronous circuits have
the advantage to auto-adapt its performances to the
process, temperature, and voltage of the circuit.
The DSPIN router throughput depends on the clock
frequency. Its throughput is one flit per clock cycle
(1Mflit/s for a clock frequency of 1MHz). The
throughput for the DSPIN router is 289Mflit/s in
worst-case and 408Mflit/s in nominal-case.
Table 3 shows the throughput comparison between
the ANOC and DSPIN routers. On a real
implementation, the ANOC will operate on its nominal
conditions 220Mflit/s while the DSPIN router should
be clocked not far away from the worst-case condition
289MHz to improve the fabrication yield.
DSPI
Throughput on worst-case
conditions
~ 160Mflit/s
289Mflit/s
Throughput on nominal
conditions
~ 220Mflit/s
408Mflit/s
AOC
DSPI
F = 150 MHz
Intermediate
router latency
First + Last
latency
146
AOC
DSPI
F = 250 MHz
6.80 ns
16.66 ns
6.80 ns
10.00 ns
60.00 ns
56.66 ns
47.00 ns
34.00 ns
7. Conclusion
A physical implementation of the DSPIN networkon-chip on the generic, stream-oriented, FAUST
platform has been presented. The multi-million gates
FAUST architecture using the DSPIN network was
physically implemented up to mask layout. The DSPIN
architecture was adapted to manage stream-oriented
communications. This adaptation was simple because
both DSPIN and ANOC respect the OSI reference
model. A dedicated wrapper has been designed to
adapt the ANOC packet format into the DSPIN format
without modifying the network interface controllers
defined by the FAUST architecture. We demonstrated
that a network architecture designed to support shared
memory multi-threaded applications can efficiently
support stream-oriented applications. The DSPIN
implementation has similar performances as the ANOC
implementation in terms of silicon area, throughput,
latency, and power consumption. The area of DSPIN is
33% smaller than the area of ANOC. The maximum
sustained throughput of DSPIN is 31% higher than
ANOC throughput, considering that ANOC operates at
nominal conditions and DSPIN in worst-case
conditions. In terms of packet latency, DSPIN should
be clocked at least to 367 MHz to obtain the same
packet latency as ANOC router. However, at that
frequency, the power consumption of the DSPIN router
is three times higher than the ANOC one. Therefore,
the ANOC NoC is a good candidate for low latency
and low power applications, while DSPIN is more
suited to low area and high performance applications.
From a design-flow point of view, the multisynchronous DSPIN network is implemented using
only standard cells and soft-macro conception. It does
not use any asynchronous nor custom cells, giving to
the designer a complete flexibility to control the floorplan of the circuit. On the other hand, ANOC is an
asynchronous network requiring additional standard
cells such as Muller gates and dedicated synthesis
tools. Therefore, to hide the complexity of the
DSPI
F = 150 MHz
AOC
DSPI
F = 250 MHz
Latency for 5
hops path
80.00 ns
106.66 ns
68.00 ns
64.00 ns
Latency for 9
hops path
106.66 ns
173.30 ns
96.00 ns
104.00 ns
DSPI
F = 150 MHz
F = 250 MHz
Router
2.07 mW
2.89 mW
4.85 mW
GALS interface
1.62 mW
0.56 mW
0.81 mW
Clock tree
0.00 mW
2.44 mW
4.73 mW
Total
3.69 mW
5.89 mW
10.39 mW
147
[8]
[9]
[10]
Acknowledgments
[11]
[12]
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[13]
[14]
[15]
[16]
[17]
148