You are on page 1of 6

2015 1st International Conference on Next Generation Computing Technologies (NGCT-2015)

Dehradun, India, 4-5 September 2015

Design of a Dynamic Depth High-Throughput


Multi-clock FIFO for the DSPIN

Rajeev Kamal∗ and Juan M. Moreno Arostegui†


∗ Phd Student Electronic Engineering Department Universitat Politecnica de Catalunya Barcelona, Spain
† Associate Professor Electronic Engineering Department Universitat Politecnica de Catalunya Barcelona, Spain

Abstract—The clock distribution within Chip- A. Previous related work


Multiprocessors(CPMs) and System-on-chips (SoCs) come
to be difficult as the number of processing elements increasing Dally, Poulten, and Balch present top-level view of bi-
and the communication between those components are becoming synchronous FIFO architecture [7], but detail microarchitecture
even more critical. In recent years, researchers proposed is not available in the literature. Ebergen[8] and Molnar et.
Globally Synchronous Locally Synchronous (GALS) clocking al. often discuss fully asynchronous FIFO into the literature,
scheme to reduce clock skew, power, and energy consumption but these design do not utilize the clocks[9], thats why it
in CPMs and SoCs. In this paper we have demonstrated is difficult to apply synchronization between different clock
dynamic depth multi-synchronous first-in first-out (FIFO) buffer domain.
which is useful for transferring data between two processing
elements within a Distributed Scalable Predictable Interconnect
Network(DSPIN).It also demonstrates dynamic calculation of Table I provide information about the several different bi-
FIFO depth using two clock frequency and packet size of in synchronous FIFO design. The work provided by Greenstreet,
coming data. the individual islands clock derived from global clock, which
have same frequency but different phase like mesosynchronous
Keywords: NOC, Asynchronous,FIFO, DSPIN, On-chip
[10]. Chakraborty et al. presented a FIFO design, it first
interconnection networks, Router microarchitecture.
calculates time to develop a frequency difference estimate,
I. I NTRODUCTION before transferring the data [11]. A linear FIFO architecture
is presented by Seizovic for data synchronization, it has
It is becoming more difficult and expensive to distribute some limitation in term of initial latency [12]. Apperson et
a global clock without skew within a System-on-Chip and al. represented a scalable and robust bi-synchronous FIFO
Chip Multiprocessors due to shrinking technologies and architecture but its memory size is fixed thats why it is not
design sizes. Globally Asynchronous Locally Synchronous provide high throughput and cost effective [13]. Similarly
(GALS) systems provide a better alternative for the CMPs Panades and Greinear proposed a FIFO architecture that
and SoCs[1]. It contains different synchronous Island which is well-suited for GALS system but does not good for
operates with their own clock frequencies and phases where mesosynchronous system [14]. Chelcea and Nowick proposed
these synchronous Islands are connected of each other by an alternative FIFO architecture for the application of GALS
means of multi-synchronous FIFO. Achieving this task system [15]. This designed is based on Register File and each
consistently and efficiently are key challenges in GALS register has its own full and empty flags. This style is suitable
system designs [2]. for small FIFO only.

One structure that is best suited at the interface of different This work uses a Register File as buffer elements, which
synchronous blocks is bi-synchronous first-in first-out (FIFO) support speed and improve FIFO latency. Its also includes
buffer or multi-synchronous FIFO buffer [3]. The Basic configurable buffer depth or size, when compared with most
FIFO buffer must be improved to accommodate two different recent work. The proposed work has been implemented
self-regulating clock inputs. Data writing into the FIFO buffer using parametric HDL and Xilinx FPGA series. This paper is
taking place with the reference of write clock domain and organized as Section II describes the fundamental architectures
data reading with the reference of read clock domain. In this of all type of synchronous FIFO and their key parameters.
way, data can be passed smoothly between two clock domains Section III introduces metastability and synchronization issues
without issue of metastability [4]. The important application and their solution. In the Section IV, the proposed design of
of the bi-synchronous FIFO at the router-router interface and dynamic depth FIFO architecture is discussed. And Section V
the router-IP interface within DSPIN. describes HDL implementation of proposed bi-synchronous
FIFO.
The presented architecture facilitates the transfer of data
between different modules, which are completely unrelated
clock domains and also provide dynamic depth calculation of II. SYNCHRONOUS FIFO
the buffer to save unwanted space of FIFO [5]. It is particularly
A. Serial input serial output FIFO or Linear FIFO
useful in applications where size of buffer is important rather
than latency which is critical such as in many NoC applications This section discusses the fundamental principle and prac-
[6]. tices of basic synchronous FIFO structure called Linear FIFO.

978-1-4673-6809-4/15/$31.00 ©2015 IEEE 30


2015 1st International Conference on Next Generation Computing Technologies (NGCT-2015)
Dehradun, India, 4-5 September 2015

TABLE I. COMPARISON OF VARIOUS BI-SYNCHRONOUS


FIFO

Fig. 2. Elastic Buffer FIFO model block diagram

area per bit. In the recent Publications, many extension of


the basic FIFO architecture have been developed with some
elementary differences in term of datapath e.g. parallel FIFO,
square FIFO, tree FIFO and folded FIFO [18].

In the recent years, a new synchronous FIFO architecture


Fig. 1. Serial input serial output FIFO or Linear FIFO block diagram has been proposed called Circular FIFO as shown in figure
3. It comprises primarily circular buffer using an array
of arbitrary addressable memory elements supporting high
The simplest form of FIFO consists of flip-flops which are energy throughput and low latency, and its scalability is
which are connected like a serial input serial output sift register radically improved due to the fact that data and clock signals
as shown in figure 1. Data is serially enters at the one end and are not affected by FIFO size.
propagate through every flip-flop until it reaches at the end of
the register. Since all the movements of the data is controlled
by the single clock therefore it is called synchronous FIFO. It generates two control flags called full and empty flag for
the valid data read and write into the FIFO memory. Using read
B. Elastic Buffer FIFO model and write pointer alone to define the full and empty condition,
always comparison of pointer must be taking place. For the
Alternatively the elastic buffer (EB) is the most primitive empty condition the value of read pointer must be equal to the
form of a register (or buffer) that implements the ready/valid value of the write pointer whereas for the full condition, firstly
handshake protocol [17]. The EB at the sender implements a increases the size of address by one bit then equivalence tests
dual interface; it accepts new data from its internal logic and of lower bits and Ex-or of MSB of address pointers. Following
transfers the available data to the link as shown in figure 2, inequalities must be satisfied for the correct operation of the
when the valid and ready signals are both equal to logic high, FIFO.
so an EB can be built around a FIFO queue. An abstract rptr ≤ wptr ≤ rptr + N (1)
FIFO provides a push and a pop interface and informs its
connecting modules when it is full or empty. The abstract where; rptr = read pointer wptr = write pointer and N =
FIFO model does not provide any guarantees on how a push number of words
to a full queue or a pop from an empty queue is handled. The
AND gates outside the FIFO provide such protection. A push III. SYNCHRONIZATION AND METASTABILITY
(write) is done when valid data are present at the input of the
FIFO and the FIFO is not full. Synchronization is a fundamental problem in a digital
system, which missing a single global timing reference.
Synchronization is a process through which an ordering of
At the read side, a pop (read) occurs when the upstream events performed on the signal lines. There are five way to
channel is ready to receive new data and the FIFO is not represents the classification of timing relationship between
empty, i.e., it has valid data to send. In both sides of the EB signal and clock, their types and definitions are discussed in
we can observe that a transfer to/from the FIFO occurs, when [19]. Within a flip-flop, the input changes during the setup
the corresponding ready/valid signals are both asserted (as or hold time window, the flip-flop enters a unstable state
implemented by the AND gates in front of the push and pop called metastable state resulting output reveals an in-between
interfaces). voltage value.

Metastable state is a fundamental problem of modern


C. Circular FIFO digital systems, whenever flip-flop not receiving a stable input
There are major drawbacks of these available approaches value near the positive or negative edge of the system clock.
in terms of high latency, poor power efficiency and low Metastability is resolved in the flip-flop itself after resolution
memory density with larger FIFO size. Since data must flow time Tr(time required to reach a stable state is called resolution
through every element of the FIFO, therefore it increases time). It is characterize by the PDF (probability distribution
latency and power. It also required additional circuits for extra function)
−Tr
control signals like ready and valid i.e. required large circuit p(Tr ) = e τ (2)

31
2015 1st International Conference on Next Generation Computing Technologies (NGCT-2015)
Dehradun, India, 4-5 September 2015

TABLE II. SAMPLE OF MTBF(T R ) COMPUTATION

Fig. 3. Circular FIFO model block diagram

Whereτ is the time decay constant and depend upon


the electrical property of flip-flop. The average time interval
between two synchronization failures is known as mean time
between synchronization failures (MTBR) and it is expressed
as a function of Tr .
−Tr
e τ
M T BR(Tr ) − (3)
ω ∗ fclk ∗ fd

where;ω = the susceptible time window fd = the rate


of change of input data and fc lk = the system clock
frequency Synchronization methods are used to remove
or reduce the probability of metastability. From table II it
can be seen that increasing the value of Tr , it increases Fig. 4. All type of synchronizers
the value of MTBR.Here consider fc lk is 50MHz, data
rate is 0.1fclk, ω of 0.1ns and τ of 0.5ns. There are three
types of synchronizer available called single-FF, Double-FFs depth in signal, FIFO memory fixes the size of the variable
and Triple-FFs synchronizer as shown in figure4.It can register file as shown in figure 5. Full flag generator and
remove metastability within a digital systems. In the diagram Empty flag generator, calculate the write address and read
syncronizer flip-flops provide the sufficient resolution time to address signal respectively for the FIFO memory. High level
move metatstable state to one of the stable state of input signal. diagram of blocks within the proposed bi-synchronous FIFO
is shown in figure 6.

IV. PROPOSED DYNAMIC DEPTH


BI-SYNCHRONOUS FIFO B. FIFO depth calculation
The section describes a proposed architecture of dynamic Bi-synchronous FIFO is very useful for synchronizing
depth bi-synchronous FIFO supports data transfer between two between two different clock domains. Therefore, depth cal-
different arbitrary clock domains. It also describes, algorithm culation is very important to pass the data securely between
of depth calculation of the proposed FIFO, the microarchi- unrelated clock domains.Basically depth of the Bi-synchronous
tectures of the different sub-modules as full-flag generation, FIFO depend upon three things, write clock frequency, read
empty-flag generation, gray code pointer etc. clock frequency and the size of the input data packet.

A. Overall Architecture 1) formula to depth calculation :


fr clk
The working architecture of proposed bi-synchronous d = [p − ( ∗ (p) ∗ f rac1Rd )] (4)
FIFO is shown in figure 5, which is used to transfer data fw clk
between sender and receiver. The sender clock (wr clk) and where; d = depth of proposed FIFO;
receiver clock (rd clk) are not related to each other in term p = packet size of the input data;
of their phase and frequency. Before sending any data, the fr clk = read side clock frequency;
depth calculator, calculates the depth of FIFO on the basis fw clk = write side clock frequency;
of wr clk, rd clk and pkt size signal. Where wr clk, rd clk, and Rd = read side delay in .. between reads:
and pkt size are write clock frequency, read clock frequency 2) Example : Consider the situations here. Given data
and the packet size of the incoming data. On the basis of for the both sides of FIFO :

32
2015 1st International Conference on Next Generation Computing Technologies (NGCT-2015)
Dehradun, India, 4-5 September 2015

Fig. 6. High-level block diagram of proposed bi-synchronous FIFO

read pointer always points to the current FIFO word to be read.

Initially or on reset both pointers set to be zero, in the case


of write operation, the write pointer is incremented to point
Fig. 5. Over all architecture of the proposed FIFO
to the next location to be written once a memory location
that pointed by write pointer is written, similarly a location
that is indicated by read pointer is read, the read pointer is
Write clock frequency = 15MHz (fw clk) incremented to point to the next location to be read.
Maximum size of the Burst = 100 bytes (p)
Read clock frequency = 10MHz (fr clk) A. Gray Coding and Address Pointers
Delay between reads = 2 clock cycles (Rd ) The suggested architecture uses two pointers called READ
FIFO Depth(d)=[100-( 10 1
15 *100* 2 )] and WRITE address pointers to track vacancy of the FIFO.
500
=[100 - ( 15 )] The size of the pointer = log2 (N ) + 1, where N = memory
= 67 word size. The pointer requires one more bit for the generation
So we will need to design a FIFO 67 deep of full flag and empty flag.

V. DETAIL ARCHITECTURE OF PROPOSED FIFO Due to the metastability issue, the pointers are transferred
to a Gray code format before the clock cross domains.
The complete architecture of the proposed bi-synchronous The pointers again converted back to Binary representation
FIFO is shown in figure 6. In this high-level diagram, there into other clock domain because arithmetic of binary
are four major blocks namely, Register file array, Full flag representation is quite easy and understandable. Since Gray
generation, Empty flag generation, and depth calculator. The code representation is a single bit change code format, thats
WRITE logic is shown on the left hand side where as READ why the chance of metastability issue will be very less in
logic is on the right hand side of figure 6. case of clock domain crossing. In the case of Binary pointers,
trying to synchronize binary count value from one clock
On the READ side FIFO calculates whether or not it is domain to another clock domain is challenging. Consider an
empty, on the basis of Rd enable signal, receiver can consume example, when pointer value changes from 0111 to 1000, then
all the data available within the FIFO memory. On the WRITE all bits changed and increase the probability of metastability.
side, the FIFO indicates whether or not it is full. The sender The implementation of binary-to-gray conversion and grayto-
should only send data when the FIFO is not full and asserting binary conversion requires special circuit that is based
wr en signal. On the READ side FIFO indicates whether or on xoring operations. In the case of binary-to-gray, an
not it is empty for the receiving the data. Actually the sender n-bit binary vector (bn − 1, bn − 2, ..., b2, b1) can be used to
cannot write data within the FIFO memory when full signal convert to n-bit gray coded vector(gn − 1, gn − 2, ..., g2, g1) as
is generated and receiver cannot read from the FIFO memory shown in given equation 5, where + indicate the XOR function.
when empty signal is generated.
gn − 1 = bn − 1, gn − 2 = bn − 1 + bn − 2, gn − 3
How FIFO pointers works, it is very useful for the better = bn − 2 + bn − 3, . . . , g1 = b2 + b1, g1 = b1 + b0 (5)
understanding the FIFO design. There are two types of pointer
available called write pointer and read pointer where write Similarly in In the case of binary-to-gray, an n-bit gray
pointer always points to next word to be written and similarly coded vector (gn − 1, gn − 2, ..., g2, g1) can be used to convert

33
2015 1st International Conference on Next Generation Computing Technologies (NGCT-2015)
Dehradun, India, 4-5 September 2015

VI. CONCLUSION
The projected bi-synchronous FIFO design is well-matched
for the many applications especially at the interface of two
different clock domains. It can be utilized as a drop-in module
at the router interface of the multi-synchronous networkon-
chip. This design provides high sturdiness, variable size
register files, good energy proficiency, high frequency clock
support and good scalability.
This FIFO architecture is implemented using parametric
VHDL and synthesis is performed using Xilinx ISE 12.1.The
functional simulation and Verification is performed by the
Modelsim ISE 6.0d.

ACKNOWLEDGMENT: The authors are grateful to their


respective university and tutors for their help and support.
Fig. 7. Four bits binary sequence for the pointers
R EFERENCES
[1] L.A.Plana, S.B.Furber, S.Temple, M.Khan, Y.Shi, J.Wu, and S.Yang, A
to n-bit binary coded vector (bn − 1, bn − 2, ..., b2, b1) as GALS Infrastructure for a Massively Parallel Multiprocessor, in Design
shown in given equation 6, where + indicate the same XOR Test of Computers,IEEE, vol.24, no.5, pp.454-63, Sept. 2007.
function. [2] A.Chattopadhyay and Z.Zilic, GALDS: a complete framework for de-
signing multiclock ASICs and SoCs, in IEEE Trans. on VLSI Systems,
vol.13, no.6, pp.641-654, June 2005.
gn − 1 = bn − 1, bn − 2 = bn − 1 + gn − 2, bn − 3 [3] A.E.Sjogren and C.J.Myers, Interfacing synchronous and asynchronous
modules within a high-speed pipeline, in IEEE Trans. on VLSI Systems,
= bn − 2 + gn − 3, . . . , b1 = b2 + g1, b1 = b1 + g0 (6) vol.8, no.5, pp.573-583, Oct. 2000.
[4] D.J.Kinniment, A.Bystrov, and A.Yakovlev, Synchronization Circuit
B. Handling of pointers Performance, in IEEE Journal of Solid-State Circuits, vol. 37, pp.202-
209, 2002.
This subsection deals with handling the binary pointers
[5] R.E.Perego and F.A.Ware, Memory device and system having a variable
as shown in figure 7. One can observe that from the basic depth write buffer and preload method, in US 7380092 B2, May 27,
of binary number system, they are repeated sequence with 2008.
changed MSB. Consider an example, a 4 bit binary number [6] F.Jafari, L.Zhonghai, A.Jantsch, and M.H. Yaghmaee, Buffer Optimiza-
start with 0 to 7 with MSB 0 and this sequence repeats again tion in Network-on-Chip Through Flow Regulation, in IEEE Tra. on
with MSB 1 as shown in figure 7. Using this concept, the full com. aid-des. of Int. Cir, Sys.,vol.29, no.12, pp.19731986, Dec. 2010.
flag and the empty flag can be easily generated.Here lower bits [7] W.J.Dally and J.W.Poulton, Digital Systems Engineering, Cambridge,
are consider for the addressing the memory buffer location. in Cambridge University Press, 1998.
[8] J.Ebergen, Squaring the FIFO in GasP, in Proc., Asynchronous. Circuits
and Systems,, pp.194205 Mar. 2001.
• Empty Flag Generation: In this proposed bisyn- [9] C.E.Molnar, I.W.Jones, W.S.Coates, and J.K.Lexau, A FIFO ring per-
chronous FIFO architecture the empty flag will be formance experiment, in Proc., Adv. Res. in. Asynchronous. Circuits
and Systems, pp. 194 205 Mar. 2001.
produced in the right side or read clock domain
whenever Register File is unoccupied, immediately [10] M.R.Greenstreet, Implementing a STARI chip, in Proc., Int. conf. on
VLSI in Computers and Processors, pp.38-43, Oct. 1995.
the empty flag is generated. From figure 7, when
[11] A.Chakraborty and M.R.Greenstreet, Efficient self-timed interfaces for
synchronized write pointer is simply equal to the read crossing clock domains, in Proc., Asynchronous Circuits and Systems
pointer, the FIFO is empty. The condition for the , pp.78-88, May 2003.
empty flag generation in the read clock domain is [12] J.N.Siezovic, Pipeline synchronization, in Proc., dvanced Research in
Asynchronous Circuits and Systems , pp.87-96, Nov. 1994.
rptr[(n − 1)downto0] = syncw ptr[(n − 1)downto0]
[13] R.W.Apperson, Z.Yu, M.J.Meeuwsen, T.Mohsenin, and B.M.Baas, A
(7) Scalable Dual-Clock FIFO for Data Transfers Between Arbitrary and
Haltable Clock Domains, in IEEE Tran. on VLSI System, vol.15, no.10
• Full Flag Generation: Similarly in this proposed bisyn- pp.1125-1134, Oct. 2007.
chronous FIFO architecture the full flag will be pro-
[14] I.M.Panades and A.Greiner, Bi-Synchronous FIFO for Synchronous
duced in the left side or write clock domain whenever Circuit Communication Well Suited for Network-on-Chip in GALS
Register File is fully occupied, immediately the full Architectures, in First Int. Symposium on NoC, pp.83-94, May 2007.
flag is generated. From figure 7, when synchronized [15] T.Chelcea and S.M.Nowick, A low-latency FIFO for mixed-clock
read pointer is equal to the write pointer except MSBs, systems, in Proc., IEEE Computer Society Workshop on VLSI, pp.119-
the FIFO is full. The condition for the full flag 126, Apr. 2000.
generation in the write clock domain is [16] C.Cummings, Simulation and synthesis techniques for asynchronous
FIFO design, in Synopsys Users Group, San Jose, CA, 2002.
wptr[n − 1]! = syncr ptr[n − 1] [17] G.Dimitrakopoulos, A. Psarras, and I. Seitanidis, Microarchitecture of
Network-on-Chip Routers, A Designers Perspective, in Springer New
wptr[(n-2) downto 0] = syncr ptr[(n − 2)downto0](8) York Heidelberg Dordrecht London 2015

34
2015 1st International Conference on Next Generation Computing Technologies (NGCT-2015)
Dehradun, India, 4-5 September 2015

[18] Y.Xiao and R.Zhou, Low latency high throughout circular asynchronous
FIFO, in Tsinghua Science and Technology, pp.812-816, Dec. 2008.
[19] P.P.Chu, RTL Hardware Design Using VHDL, Coding for Efficiency,
Portability, and Scalability , in John Wiley Sons 2006

35

You might also like