You are on page 1of 10

Second ACM/IEEE International Symposium on Networks-on-Chip

Physical Implementation of the DSPI etwork-on-Chip


in the FAUST Architecture
Ivan Miro-Panades1,2, Fabien Clermidy3, Pascal Vivet3, Alain Greiner1
1
The University of Pierre et Marie Curie, 75252 Paris, France
2
STMicroelectronics, 38921 Crolles, France
3
CEA-Leti, MI%ATEC, 38054 Grenoble, France
{ivan.miro, alain.greiner}@lip6.fr, {fabien.clermidy, pascal.vivel}@cea.fr
scale [11], Silistix [14], and FAUST [8], Xpipes [17].
A 32-port SPIN NoC has been implemented in [10,13].
However, the architecture is not design-flexible and
does not support the GALS approach because it was
designed as a synchronous centric hard macro. The
Tera-scale [11] architecture contains an 80-tile
processor architecture interconnected by a NoC. A
mesochronous technique was used to distribute a 4GHz
clock signal over the 275mm die. However, its
network router takes 0.34mm in CMOS 65nm, which
is several times larger than the DSPIN router presented
in this paper (when compared at the same fabrication
process) while offering the same type of service. The
Silistix CHAIN network [14] is based on packet
switching using asynchronous QDI 4-rail links and is
composed of basic elements such as muxes, demuxes,
arbiters. The CHAIN architecture allows the GALS
strategy but do not offer a real Network-on-Chip
protocol, neither offer Quality-of-Service features.
The DSPIN [2] network-on-chip is an evolution of
the SPIN [16] architecture, and has been developed by
the LIP6 laboratory, in cooperation with
STMicroelectronics. DSPIN is a 2D mesh distributed
NoC well suited to the GALS approach. Its architecture
is synthesizable on standard synchronous cells library
with neither custom nor asynchronous cells. The
DSPIN architecture was initially designed to support
shared memory multi-processors architectures. In this
work, we present the physical implementation of
DSPIN network in a stream-oriented, multi-application
platform, the FAUST generic architecture, developed
by CEA/LETI [8]. The main goal is to evaluate if a
network architecture optimized for shared memory can
efficiently support stream-oriented applications. In
order to do this evaluation, we replaced the
asynchronous network-on-chip (called ANOC) from
FAUST chip by the DSPIN network-on-chip.
The FAUST and ANOC architectures are firstly
analyzed in section 2. DSPIN architecture is detailed in

Abstract
This paper presents a physical implementation of
the DSPI% network-on-chip in the FAUST
architecture. FAUST is a stream-oriented multiapplication SoC platform for telecommunications
addressing IEEE 802.11a and MC-CDMA standards.
The original asynchronous network-on-chip (A%OC) of
FAUST has been replaced by the multi-synchronous
DSPI% network-on-chip. In this paper, we analyze how
the DSPI% network-on-chip, originally designed to
support shared memory and multi-processors
architectures,
can
support
stream-oriented
architectures. The physical implementation of both
A%OC and DSPI% are presented. Finally, a
comparison between A%OC and DSPI% designs in a
130nm technology is carried out in terms of area,
throughput, packet latency, and power consumption.

1. Introduction
Increasing the system performance by scaling the
technology and the clock frequency become more and
more complex due to the lower scalability of the wire
delays. New approaches such as Network-on-Chip
(NoC) architectures and the Globally Asynchronous,
Locally Synchronous (GALS) paradigm tries to solve
the design bottleneck by partitioning the circuit in
small synchronous islands while they communicate
asynchronously. Each island can be clocked by
independent
clock
frequency,
while
the
communications between neighbor islands are carried
out by the NoC. Moreover, the NoC approach attempts
to solve the bandwidth bottleneck of a central bus by
splitting the communications over a plurality of routers
and links.
A large number of NoC architectures have been
published, but there is very few detailed analysis of
their physical implementation: SPIN [16,10], Tera-

978-0-7695-3098-7/08 $25.00 2008 IEEE


DOI 10.1109/NOCS.2008.35

139

section 3. Section 4 compares the two NoCs


architectures and describes the migration from ANOC
to DSPIN. In section 5, the sep-by-step implementation
of FAUST with the DSPIN network is presented.
Finally, in section 6 the ANOC and DSPIN designs are
compared in terms of silicon area, throughput, latency,
and power consumption.

CDMA techniques, with a data rates up to 100 Mbits/s.


In the paper, we focus on the Matrice receiver (RX),
which requires 10 IP-blocks from the complete FAUST
platform. For this application, the NoC interconnect
must support an aggregated throughput up to 10.6
Gbits/s to maintain the real-time constraints imposed
by the OFDM frame rate. An OFDM frame must be
processed in less than 650s. A detailed description of
the frame composition and decoding method can be
found in [9].

2. FAUST Application
FAUST, which stands for Flexible Architecture of
Unified Systems for Telecom is a hardware
demonstration platform for the 4MORE mobile
terminals. 4MORE [1] is an IST program targeting 4G
baseband modem chips. The FAUST project was
initiated in 2003 for supporting multiple OFDM air
interfaces in a single SoC. FAUST architecture (Figure
1) is composed by processing units interconnected by a
NoC. It also includes an ARM946ES in an AHB
subsystem. The communication protocol between the
functional units is carried out by message passing
through the NoC. Each processing unit contains a
programmable Network Interface Controller, which
contains input and output FIFOs and regulates the
traffic through the network. This regulation is carried
out by credits to synchronize the producer to the
consumer on a self-synchronized data pipeline manner.
NOC1 IF
84 Pads

JTAG

RAC

Clk & Reset CTRL

2.1. AOC Architecture


ANOC stands for Asynchronous NoC and has been
developed by the CEA-Leti [6]. ANOC is a wormhole
packet switching NoC with 32bits payload. Its
architecture is fully asynchronous and has been
implemented using ST standard cells and the TIMA
TAL library [15]. Its architecture is composed by five
ports routers interconnected by bidirectional links
using send/accept asynchronous handshake protocol
(Figure 2). Thus, the ANOC protocol offers naturally a
GALS architecture. As the ANOC routers are
asynchronous, the entire end-to-end path traveling the
packets is completely asynchronous. In addition, only
the input and output ports of the NoC are
resynchronized to the local synchronous IP frequency,
using dedicated synchronization FIFOs [7]. Moreover,
a four-phase protocol is used on the network
guaranteeing no metastability issues. Only the input
and output ports of the network, where synchronization
is required, are susceptible to metastability failure.

Clk, Rst

SPort
APort
EXP

TX units

OFDM
MOD.

ALAM.
MOD.

CDMA
MOD.

MAPP.

BIT
INTER.

NoC
Perf.

RAM

ARM946

RAM

EXT.
RAM
CTRL

RAM IF
58 Pads

ROTOR

EQUAL.

CHAN.
EST.

CONV.
DEC.

ETHER
NET

ETHERNET IF
17 Pads

FRAME
SYNC.

ODFM
DEM.

CDMA
DEM.

DEMAPP.

DEINTER.

AHB

TURBO
CODER

CONV.
CODER

AHB units
RX units
Async/ Sync IF
Async node

EXP

NOC2 IF

SPort

83 Pads

APort

DART

Figure 1. FAUST architecture


The FAUST chip is a multi-application platform for
4G telecom. It can support OFDM-based applications
such as 802.11a standard, MC-CDMA [1,9] and 3GPPLTE protocols. All these applications share the same
set of constraints, including real-time requirements,
high throughput and low power consumption for
battery-powered devices.
In this paper, a SISO-MC-CDMA data-streaming
application called Matrice [9] is addressed. It consists
in transmitting and receiving frames using OFDM and

Figure 2. ANOC architecture


The ANOC topology is not reduced to regular 2D
mesh. Irregular 2D mesh or torus topology can also be
implemented as ANOC uses a source routing algorithm.

140

the tool using the reset signal of ANOC router logic


(Figure 3).
Using this pseudo-clock mechanism, asynchronous
logic timing loops are broken, and static timing
analysis of the QDI logic can be performed using
standard tools. Due to the 4-phase protocol, the
pseudo-clock frequency must be equal to 4 times the
asynchronous targeted average frequency. Once the
ANOC router hard macro was available, the standard
abstract and gds files were generated. A pseudosynchronous timing model of the asynchronous ANOC
router was also automatically generated using this
pseudo-clock.
For the GALS interfaces implementation, a softmacro approach was defined. This approach allowed to
perform clock tree balancing with their attached
synchronous unit, and also bring less constraints for
top-level floor-planning.
For top-level, the complete floor-planning was
done in order to place all the hard-macros: ANOC
routers, SRAM memories, ARM946 core (Figure 4).
The place & route was done hierarchically with five
distinct partitions using Encounter tool. Thanks to the
ANOC router hard-macro, no top-level spaghetti
routing and the common congestion drawbacks were
observed at all, only parallel routing of ANOC link
signals between ANOC routers was performed. The
timing analysis and optimization of the NoC links was
possible using the pseudo-synchronous timing model
of the ANOC router. For the GALS interfaces, the
timing optimization has been nevertheless more
difficult due to mix-timing constraints of these
interfaces [7].

Moreover, source routing can be used to minimize the


congestion on some links, and thus reduce the packet
latency. The first flit of the packet contains the routing
information and the router uses this path-to-target to
decide the correct routing destination. A flit is the
smallest flow control unit of the network. The routing
information is enclosed on 18-bits and two bits encodes
each routing hop as shown in Figure 8a. Hence, the
routing path is limited to nine hops H0 to H8. However,
a path extension mechanism is also proposed to extend
the routing path [6].
ANOC provides two virtual channels per NoC link.
A low latency channel VC0 for real-time applications
and a higher latency and lower priority channel VC1
for best effort traffic.

2.2. AOC Implementation


The ANOC design has been implemented in the
STMicroelectronics 130nm technology, using standard
place-route tools (EncounterTM from Cadence). In the
proposed architecture, two challenges need to be
addressed: the physical implementation of the ANOC
router, which is robust QDI 4-phase/4-rail asynchronous
logic [6], and the implementation of the GALS
interfaces [7].

S1

E2

E3

S0

Combinational
QDI logic

R
R

Eack

S2
S3

Sack

Stage n-1

Stage n

CLK Reset

E1

Exp.

E0

Global Reset signal


=
Pseudo-clock

NP1

rst_n

Stage n+1

Exp.

For the ANOC router, a hard-macro approach was


defined in order to re-use the ANOC router all over the
FAUST top floor-plan. This choice obviously allows
proper placing of the ANOC router port signal pins
(North, East, South, West, Unit). The ANOC router
contains robust QDI logic, which is implemented using
standard-cells and specific C-elements from the TAL
library [15], jointly developed by the TIMA laboratory
and CEA-Leti. Due to the un-clocked nature of the
logic, in order to optimize place&route under timing
constraints, a pseudo-clock has been emulated within

NP2

Figure 3 QDI logic optimization with


pseudo-clock

Figure 4. FAUST floor-plan with ANOC

141

Finally, due to the GALS nature of the chip, the


clock-tree of the chip was constituted of 27
independent clock trees: one distinct clock tree per
synchronous IP unit. The 27 clock-trees were then
easily generated one-by-one by the tool with neither
timing convergence problems nor floor-planning issues
(clock congestion, and so on).

compatible with the Globally Asynchronous, Locally


Synchronous approach, where synchronous islands or
subsystems communicate asynchronously. Each
DSPIN router is clocked by the network-clock
frequency but a phase skew can exist between two
neighbor routers. Moreover, each subsystem can have
its own clock frequency, which can be independent of
the network-clock frequency. Hence, the inter-router
communication
is
mesochronous
while
the
communication between router and subsystem is
asynchronous. These communications are carried out
by bi-synchronous FIFOs [3,4,5].

3. DSPI oC
DSPIN NoC [2] stands for Distributed, Scalable,
Programmable, Integrated Network. It is a wormhole
packet-based NoC, with a 2D mesh topology. The
packets are routed following the X-first deterministic
routing algorithm. With this algorithm, packets are first
routed on the X direction and then on the Y direction.
The routing information on the packets is encoded by
the absolute address of the destination subsystem on
the first flit of the packet. Figure 8b shows the first flit
and the following flits of the packet. DSPIN uses a
generic flit size, which has been tuned to 34-bit flit in
this implementation, providing a payload of 32-bits.
South

(Y+1,X-1)

(Y,X-1)

Cluster(Y+1,X)

Cluster(Y,X)

Figure 6. DSPIN topology and


clocking regions

(Y+1,X+1)

(Y,X+1)

South

(Y-1,X-1)

In order to avoid metastability situations on the


mesochronous links, neighbor routers have inverted
clock phases. Thus, the bi-synchronous FIFO is able to
interface the mesochronous links even when the skew
between neighbor routers is up to 50% of the clock
period [3]. Figure 6 show a 4x5 network architecture.
Each square defines a cluster, which contains a
subsystem and a DSPIN router. The DSPIN routers on
the black squares have its clock signal inverted while
the ones on the white square have not. Consequently,
neighbor routers always have inverted clock signals.

West

East

GS
BE

East

GS
BE

West

North

Local

Cluster(Y-1,X)

(Y-1,X+1)

North

Figure 5. DSPIN architecture


In order to address the GALS issues, the DSPIN
router itself is distributed, and is composed by five
modules. Four of them are placed on the north, south,
east, and west side of the subsystem. Finally, a local
module communicates to the local subsystem through
the Network Interface Controller (NIC). Figure 5
presents the DSPIN architecture and its modules. The
local subsystem and the associated DSPIN router
compose a synchronous cluster. With this approach,
the longest wires are the intra-cluster wires (for
example, from one input module on the west side to
one output module on the east side), and cannot be
longer than the cluster size.
DSPIN is a multi-synchronous architecture
synthesizable with standard cells. Moreover, it is

4. Migration of DSPI into FAUST


In this section, the ANOC and DSPIN NoCs are
firstly compared, and then ANOC is replaced by
DSPIN within the FAUST architecture.

4.1. Architecture Comparison


The DSPIN architecture, as explained in [2], was
designed for generic shared memory multiprocessor
architectures. In order to avoid deadlocks in
request/responses traffic, DSPIN contains two fully
separated sub-networks for requests and responses
packets as shown in Figure 7a. Moreover, the DSPIN
NIC is very simple because the routing address (Y, X),

142

Table 1. ANOC and DSPIN architecture


comparison

can be directly extracted from the MSB bits of the


destination address. Thus, the NIC does not require any
configuration.

AOC

Figure 7. Programming model


Moreover, the DSPIN routing technology can be
used with the message passing programming model as
the one in the FAUST platform. The shared-memory
network interface controller (NIC) must be replaced by
a stream-oriented NIC (Figure 7b). This streamoriented NIC has to manage end-to-end flow control
signals to avoid deadlock situations, maintain the endto-end FIFO synchronicity, and minimize the network
congestion. For those stream-oriented applications, the,
DSPIN architecture requires just one DSPIN router per
cluster while on shared-memory approach it requires
two.

Topology

Irregular

Router arity

5 port router

DSPI
Regular 2D mesh
5 port router

Routing technique Source routing

Address-based
X-First algorithm

Switching
technique

Wormhole

Wormhole

Flit size

34 bits

34 bits (generic)

Flit payload

32 bits

32 bits (generic)

Flow control bits


on the flit

Begin of packet (BOP) Begin of packet (BOP)


End of packet (EOP) End of packet (EOP)

Routing overhead
and capability

18-bits, allowing
9 routing hops.
Path extension is
possible

8-bits, allowing any


architecture up to
16x16 clusters

Virtual channels

Best effort and


Guaranteed service

Best effort and


Guaranteed service

Programming
model

Message passing

Shared memory
(2 routers per cluster)
Message passing
(1 router per cluster)

Clocking scheme

Fully asynchronous
(QDI) with GALS
interfaces

Multi-synchronous
with mesochronous
interfaces

Flow control
protocol

Send/accept
asynchronous
handshake

FIFO protocol
(Write and WriteOk)

Metastability
issues

Metastable-free inside Resolved by


routers
bi-synchronous FIFOs
(4 phase protocol)
GALS FIFO interfaces
on the local ports

Clock tree

None

One per router

Physical
implementation

Hard macro

Soft macro distributed


on five modules

Long wires

Inter-router wires

Intra-cluster wires

Figure 8. ANOC and DSPIN packet definition

4.2. Integration of DSPI in the FAUST


Architecture

DSPIN and ANOC use similar packet format as


shown in Figure 8. DSPIN has a generic flit size and
can be adjusted to fit the ANOC flit size, 34 bits. Thus,
both architectures have a 32bits payload. ANOC uses
18 bits on the first flit for the source routing
information while DSPIN uses 8 bits for the
destination address. Furthermore, both architectures
use the same flow control bits Begin_of_Packet (BOP)
and End_of_Packet (EOP).
Next table summarizes the ANOC and DSPIN
similarities and differences.

Figure 9a shows the integration of the ANOC


router into the FAUST architecture. The ANOC router
is asynchronous while the NIC is synchronous.
Therefore, the GALS interface module contains 4
FIFOs that perform the asynchronous communication
between these two modules while buffering the data.
On the other hand, DSPIN router uses different
flow control protocol and routing technique than

143

ANOC, and integrates two of the four GALS interfaces


FIFOs. Therefore, a protocol_conversion module was
designed to interface the FAUST NIC with the DSPIN
router (Figure 9b). This module adapts the flow control
signals, converts the routing algorithm, and integrates
two bi-synchronous FIFOs. The routing algorithm
conversion was implemented with a Look Up Table
(LUT) where the source-routing path of ANOC is
recoded into the (Y,X) destination of DSPIN. This
solution is not optimized as it uses a hard-wired LUT.
However, this work was focused on the fair
comparison of two NoC on the same architecture and
not on the optimization of the architecture for each
NoC. Otherwise, it would be required to modifying the
NIC of FAUST.
CLK_IP

DSPIN architecture has been designed to be


synthesizable on standard cells and easily implemented
on a synchronous digital flow. Therefore, neither
custom nor asynchronous cells have been used. The
clock boundaries and the long wires issues have been
analyzed to minimize the timing cloture effort [2].

5.1. Synthesis
We used a hierarchical approach for the physical
synthesis of the FAUST architecture with the DSPIN
NoC. Each cluster was synthesized separately, before
being assembled on the top FAUST architecture. Thus,
no RTL synthesis was performed on the top level. The
design was synthesized using STMicroelectronics
CMOS 130nm low power standard cells.
The timing constraints for the DSPIN routers
synthesis were chosen to take into account the physical
implementation. Thus, the DSPIN long wires (intracluster wires) were constrained with 300ps of
propagation time. Moreover, low power standard cells
with low Vt transistors were uses in conjunction with
clock-gating techniques to minimize the power
consumption.

CLK_IP

IP

IP

NIC

NIC

Synchronous
SEND/ACCEPT
GALS interface

5. DSPI Implementation

Synchronous
SEND/ACCEPT
Protocol_conversion
LUT

Asynchronous
SEND/ACCEPT

ANOC router

Asynchronous
READ/WRITE

DSPIN router
Asynchronous
SEND/ACCEPT

Mesochronous
READ/WRITE

5.2. Floorplanning

CLK_NoC

a) ANOC IP template

b) DSPIN IP template

The DSPIN routers are not built using hard macros.


They are placed and routed as standard cells modules.
This property gives to the designer the flexibility to
decide the best shape and position to place the DSPIN
router modules. Hence, the routers shape and position
is constrained using regions. A region is a
floorplanning delimiter that conditions all the cells of a
module to be placed inside the defined area. However,
the region does not define an exclusive area, because
cells of other modules can be placed inside this area.
The DSPIN routers were designed using five regions,
one for each module (North, South, East, West, and
Local). The placement density for these regions was
tuned around 70%. Figure 10 shows the FAUST floorplan using DSPIN routers. The clusters areas are
delimited by the colored rectangles while the N, S, E,
W, and L filled boxes denote the North, South, East,
West, and Local DSPIN router modules respectively.
The top left and bottom red boxes are reserved for
the RAC and DART hard macros. However, these
modules are not used by the Matrice application and
they were not implemented. Nevertheless, its area was
reserved for fair comparison.

Figure 9. IP integration detail


The FAUST architecture does not follow a regular
2D mesh topology, which is imperative for the DSPIN
NoC. Therefore, some FAUST connections were
rearranged in order to respect a regular mesh topology.
The FAUST mapping topology was designed for a
source routing algorithm. However, when the
deterministic DSPIN X-First routing algorithm was
used, some GS routing conflicts appeared, because the
source routing algorithm can avoid using some
congested links, which is not possible with a
deterministic routing algorithm. In order to respect the
real-time constraints while avoiding modifying the
mapping topology, the DSPIN BE and GS FIFOs were
dimensioned to 7 words depth to support these routing
conflicts.
Finally, the system performance for the reference
OFDM application described in section 2, were
equivalent for both ANOC and DSPIN networks. A
more detailed and systematic comparison using
synthetic traffic, is described in section 6.

144

RAC

Ala. N
NP1

N
OFDM mod.

EW

CDMA N
Mod.
CLK

EW

W S
N

Turbo
Dec.

Conv.
Codec.

ARM946

Ext.
RAM
Ctrl.

RAM2
N

RAM1

interfaces these communications without a complex


back-end flow. The timing constraints file has to be
properly defined to guarantee a correct tool operation.
For
the
asynchronous
interfaces,
the
set_false_path condition is set between the clock
signals of independent clock frequency. Hence,
the tool understands the asynchronous nature of
this kind of interfaces. Otherwise, the tool tries
unsuccessfully to synchronize non-synchronous
interfaces.
For
the
mesochronous
interfaces,
a
set_multi_cycle_path condition is added on the
output ports of the FIFO data registers. This
condition informs the tool that the content of the
FIFO data registers are not written and read on the
same clock cycle. By-construction, the writing and
later reading of bi-synchronous FIFO data register
is delayed by the synchronization latency [3].
Hence, the data is stable when it is read, the timing
paths are less constrained, and the tool can easily
interface the mesochronous interface.

Bit. Inter.

L
E
E W S E W S
N

L
N S

Mapp.

L
EW
Rotor
S
N

Equal.
Frame
sync.

L
N
L
W
E
S
N

EW

S
N

E W S
N

OFDM demod.
L

EW
L

EW
L

CDMA
Dem.
S

S
N

Conv.
Dec.

E
L
W S
N

S
N

L
W S

EW
Channel Est.

NP2

E
S W

EW

L
W
S
N

Ethernet
S E
N

Demapp.
W
E
L

Deinter.
E
W
L

DART

Figure 10. FAUST floor-plan with DSPIN

5.3. Clock Tree


We have added a buffer or an inverter on the clock
input of each DSPIN router for the mesochronous
communications (Figure 6). These buffers/inverters are
used to build the clock tree of the DSPIN routers while
supporting the GALS approach (Figure 11). The clock
tree implementation follows four steps:
1. The buffer/inverter on the clock input pin of each
DSPIN router is manually placed in the middle of
the area occupied by the cluster. Thus, the router
clock-tree wires are as short as possible.
2. A clock tree is synthesized for each DSPIN router.
The starting point of the clock-tree is the
buffer/inverter on the clock input pin of the router.
Each clock tree is synthesized with 5% skew
target.
3. Each clock tree is characterized with its input
delay, skew, and input capacitance.
4. A top clock tree is synthesized to balance the
clock trees of all the DSPIN routers. Following the
GALS approach, the top clock tree is balanced
with a 30% skew while the leaves have a 5%
skew.

Figure 11. DSPIN clock tree

6. etwork Comparison
In this section, the ANOC and DSPIN
implementations are compared in terms of area,
throughput, latency, and power consumption, using
synthetic workload.

6.1. Area
The ANOC router was implemented as a hard
macro. Its area is 0.21mm with a cell density of 95%.
The GALS interface module is implemented as a soft
macro and its area is computed assuming a 95% of cell
density. On the other hand, DSPIN is implemented as a
soft macro and no area is exclusively reserved for the
router. Assuming a 95% integration density, the total
area is computed in Table 2 taking into consideration
the DSPIN clock tree and the FIFO area of GALS

5.4. Mesochronous and Asynchronous links


The communication between neighbors routers are
mesochronous as the clock tree is not equilibrated
while the communications between router and
subsystem are fully asynchronous because they use
different clock frequencies. The bi-synchronous FIFO

145

interfaces (ANOC requires 4 FIFOs while DSPIN


requires 2). The total DSPIN area is 33% smaller than
the ANOC area.

will be higher than the gate delays, a multisynchronous architecture as DSPIN would have higher
packet throughput than an asynchronous one as
ANOC. Fortunately, pipeline stages can be inserted on
the long wires in order to cope with these delays and
improve the throughput, despite of the added latency.

Table 2. ANOC and DSPIN router area


comparison
AOC

6.3. Packet Latency

DSPI

Router

0.211 mm

0.161 mm

Interface GALS

0.070 mm

0.024 mm

Clock tree

0.000 mm

0.0016 mm

Total

0.281 mm

0.187 mm

The minimal Packet Latency is the end-to-end delay


between the time a packet header enters into the network
and the time it exits the network, assuming no
contention. This path can be decomposed in three parts:
First, Intermediate, and Last latency [12]. The First
latency is the time it takes the packet to cross the GALS
interface and the first router. The Last latency is the time
it takes the packet to cross the last router and the GALS
interface. The Intermediate latency is the time it takes
the packet to cross an intermediate routers between the
first and the last router as shown in Figure 12.

6.2. Throughput
The throughput on the ANOC router depends on
the fabrication process, on the voltage applied, and on
the temperature condition. The throughput of ANOC is
160Mflit/s in worst-case and 220Mflit/s in nominal
conditions. Moreover, the asynchronous circuits have
the advantage to auto-adapt its performances to the
process, temperature, and voltage of the circuit.
The DSPIN router throughput depends on the clock
frequency. Its throughput is one flit per clock cycle
(1Mflit/s for a clock frequency of 1MHz). The
throughput for the DSPIN router is 289Mflit/s in
worst-case and 408Mflit/s in nominal-case.
Table 3 shows the throughput comparison between
the ANOC and DSPIN routers. On a real
implementation, the ANOC will operate on its nominal
conditions 220Mflit/s while the DSPIN router should
be clocked not far away from the worst-case condition
289MHz to improve the fabrication yield.

Figure 12. Packet latency definition


The latency of the ANOC routers was measured on
the real implementation of the FAUST circuit. The
ANOC Intermediate router latency is 6.8ns and does
not depend on the clock frequency. On the other hand,
DSPIN router latency depends on the network and
subsystem clock frequencies.
Table 4 shows the latency comparison between
ANOC and DSPIN when the subsystem is clocked
with 150MHz or 250MHz. Moreover, these
frequencies are also used to clock the DSPIN routers.

Table 3. Throughput comparison


AOC

DSPI

Throughput on worst-case
conditions

~ 160Mflit/s

289Mflit/s

Throughput on nominal
conditions

~ 220Mflit/s

408Mflit/s

Table 4. Latency comparison between ANOC


and DSPIN routers

In terms of critical path analysis and cycle time for


long distance communications, the ANOC critical path
crosses four times the long wires in between ANOC
routers while DSPIN crosses just one time. This comes
from the fact that ANOC uses a 4-phase QDI
asynchronous protocol. Thus, the long wire delay has
four times higher influence on the ANOC router rather
than on the DSPIN router. Consequently, on deep
submicron technologies where the interconnect delays

AOC

DSPI

F = 150 MHz
Intermediate
router latency
First + Last
latency

146

AOC

DSPI

F = 250 MHz

6.80 ns

16.66 ns

6.80 ns

10.00 ns

60.00 ns

56.66 ns

47.00 ns

34.00 ns

The ANOC intermediate router latency is lower


than the DSPIN one. This comes from the fact that the
DSPIN routers resynchronize the data packets on each
hop. To obtain the same intermediate router latency,
the DSPIN router should be clocked to 367MHz.
Moreover, the first and last router latency is better
optimized on the DSPIN side.
Table 5 shows the latency of the ANOC and
DSPIN router for 5 and 9 hops path. It is clear that the
ANOC router have lower latency than the DSPIN
router for low clock frequencies. However, the
latencies are quite similar when the DSPIN clock
frequency increases.

The power consumption of the ANOC router is


lower than the DSPIN router even at 150 MHz. This
comes from the fact that DSPIN uses larger FIFOs (7
words depth compared to 2 words depth on ANOC).
On the other hand, the GALS interface on DSPIN is
more efficient that the one on the ANOC. Finally,
DSPIN requires a clock tree that consumes as much
power as the router itself. Consequently, the total
power consumption of ANOC is at least 37% lower
than the one of DSPIN for the same application.

7. Conclusion
A physical implementation of the DSPIN networkon-chip on the generic, stream-oriented, FAUST
platform has been presented. The multi-million gates
FAUST architecture using the DSPIN network was
physically implemented up to mask layout. The DSPIN
architecture was adapted to manage stream-oriented
communications. This adaptation was simple because
both DSPIN and ANOC respect the OSI reference
model. A dedicated wrapper has been designed to
adapt the ANOC packet format into the DSPIN format
without modifying the network interface controllers
defined by the FAUST architecture. We demonstrated
that a network architecture designed to support shared
memory multi-threaded applications can efficiently
support stream-oriented applications. The DSPIN
implementation has similar performances as the ANOC
implementation in terms of silicon area, throughput,
latency, and power consumption. The area of DSPIN is
33% smaller than the area of ANOC. The maximum
sustained throughput of DSPIN is 31% higher than
ANOC throughput, considering that ANOC operates at
nominal conditions and DSPIN in worst-case
conditions. In terms of packet latency, DSPIN should
be clocked at least to 367 MHz to obtain the same
packet latency as ANOC router. However, at that
frequency, the power consumption of the DSPIN router
is three times higher than the ANOC one. Therefore,
the ANOC NoC is a good candidate for low latency
and low power applications, while DSPIN is more
suited to low area and high performance applications.
From a design-flow point of view, the multisynchronous DSPIN network is implemented using
only standard cells and soft-macro conception. It does
not use any asynchronous nor custom cells, giving to
the designer a complete flexibility to control the floorplan of the circuit. On the other hand, ANOC is an
asynchronous network requiring additional standard
cells such as Muller gates and dedicated synthesis
tools. Therefore, to hide the complexity of the

Table 5. Latency analysis for 5 and 9


hops path
AOC

DSPI

F = 150 MHz

AOC

DSPI

F = 250 MHz

Latency for 5
hops path

80.00 ns

106.66 ns

68.00 ns

64.00 ns

Latency for 9
hops path

106.66 ns

173.30 ns

96.00 ns

104.00 ns

6.4. Power Consumption


In order to analyze the power consumption of the
NoC
architectures,
back-annotation
gate-level
simulations were performed on both architectures. The
back-annotation data was extracted from the physically
implemented architectures using typical conditions.
Both architectures computed the same OFDM frame
demodulation using real functional traffic in order to
compute accurate power consumption estimations.
Table 6 shows the detailed power consumption
analysis for both architectures. The GALS interface
power corresponds to the power consumption of the 4
FIFOs on the GALS interface module of ANOC, and
the 2 FIFOs on the protocol_conversion module of
DSPIN.
Table 6. Power consumption comparison
AOC

DSPI
F = 150 MHz

F = 250 MHz

Router

2.07 mW

2.89 mW

4.85 mW

GALS interface

1.62 mW

0.56 mW

0.81 mW

Clock tree

0.00 mW

2.44 mW

4.73 mW

Total

3.69 mW

5.89 mW

10.39 mW

147

[8]

asynchronous logic, a hard-macro approach has been


used for the ANOC router design, that helps the toplevel timing optimization.
We demonstrated that the multi-synchronous
DSPIN architecture can be simply and automatically
implemented, and can directly be ported to other
CMOS process technologies, as it is fully
synthesizable.

[9]

[10]

Acknowledgments

[11]

The authors would like to thank Didier Lattard,


Yvain Thonnart, Edith Beigne, and Jean Durupt from
CEA-Leti for their contribution to this work.

[12]

References
[1]

[2]

[3]

[4]

[5]
[6]

[7]

S. Kaiser et al., 4G MC-CDMA Multi Antenna system


on chip for Radio Enhancements (4MORE), Proc. of
13th IST Mobile And Wireless Communications
Summit, Lyon, France, June 2004.
I. Miro-Panades, A. Greiner, and A. Sheibanyrad, A
Low Cost Network-on-Chip with Guaranteed Service
Well Suited to the GALS Approach, 1st Int. Conf. on
Nano-Networks and Workshops (Nano-Net 2006),
September 2006.
I. Miro-Panades and A. Greiner, Bi-Synchronous
FIFO for Synchronous Circuit Communication Well
Suited for Network-on-Chip in GALS Architectures,
First Inter. Symposium on Network-on-Chip
(NOCS07), pp. 83-94, Princeton, NJ , May 2007.
I. Miro-Panades, Buffer memory control device
(Dispositif de commnade dune memoire tampon),
Patent FR2899985, October 2007.
I. Miro-Panades, Control circuit for FIFO memory,
Patent pending.
E. Beigne, F. Clermidy, P. Vivet, A. Clouard, and M.
Renaudin, An Asynchronous NoC Architecture
Providing Low Latency Service and its Multi-Level
Design Framework, Proceedings 11th Int. Symp. on
Advanced Research in Asynchronous Circuits and
Systems (ASYNC'2005), pp. 54-63, March 2005.
E. Beigne and P. Vivet, Design of On-chip and Offchip Interfaces for a GALS NoC Architecture, Proc.
12th Int. Symp. on Advanced Research in
Asynchronous Circuits and Systems, ASYNC'2006,
Grenoble, France, pp. 172-181, March 2006.

[13]

[14]

[15]

[16]

[17]

148

D. Lattard et al., A Telecom Baseband Circuit-Based


on an Asynchronous Network-on-Chip, Proc. of Int.
Solid State Circuits Conference (ISSCC2007), San
Francisco, USA, Feb. 2007.
F. Berens et al., Designing a multiple antenna MCCDMA SoC for beyond 3G, Embedded Systems, San
Francisco, USA, March 2005.
A. Andriahantenaina and A. Greiner Micro-network
for SoC: Implementation of a 32-port SPIN network,
Design Automation and Test in Europe (DATE 2003)
pp. 1128-1129, March 2003.
S. Vanlgal et al., An 80-Tile 1.28TFLOPS Networkon-Chip in 65nm CMOS, ISSCC Dig. Tech. Papers,
pp. 98-99, Feb. 2007.
A. Sheibanyrad, I. Miro-Panades, and A. Greiner,
Systematic comparison between the asynchronous and
the multi-synchronous implementations of a Network
on Chip architecture, in Proc. IEEE Design,
Automation and Test in Europe (DATE07), April
2007.
A. Andriahantenaina, Physical implementation of a
32-port
SPIN
micro-network
(Implmentation
matrielle dun micro-rseau SPIN 32 ports), PhD
thesis, The University of Pierre et Marie Curie, France,
Jan. 2006.
A. M. Scott et al., Asynchronous on-Chip
Communication: Explorations on the Intel PXA27x
Processor Peripheral, 13th IEEE Int. Symp. on
Asynchronous Circuits and Systems (ASYNC'07),
2007.
P. Maurine, J.B. Rigaud, F. Bouesse, G. Sicard, and M.
Renaudin,
Static
Implementation
of
QDI
Asynchronous
Primitives,
13th
International
Workshop on Power and Timing Modeling,
Optimization and Simulation (PATMOS 2003), Torino,
Italy, pp. 181-191, Sept. 2003.
P. Guerrier and A. Greiner, A generic architecture for
on-chip packet-switched interconnections, Proc.
Design Automation and Test in Europe (DATE00),
pp. 250-256, Mars 2000.
A. Pullini et al. NoC Design and Implementation in
65nm Technology, First International Symposium on
Networks-on-Chip (NOCS 2007), pp. 273-282,
Princeton, NJ, 7-9 May 2007.

You might also like