You are on page 1of 13

INTEGRATION, the VLSI journal 55 (2016) 3042

Contents lists available at ScienceDirect

INTEGRATION, the VLSI journal


journal homepage: www.elsevier.com/locate/vlsi

BiLink: A high performance NoC router architecture using


bi-directional link with double data rate
Jingyang Zhu a,n, Zhiliang Qian b, Chi-Ying Tsui a
a
Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong
b
Department of Micro- and Nano- Electronics, Shanghai Jiao Tong University, Shanghai, China

art ic l e i nf o a b s t r a c t

Article history: This paper presents a novel high performance Network-on-Chip (NoC) router architecture design using a
Received 8 January 2015 bi-directional link with double data rate (BiLink). Ideally, it can provide as high as 2 times speed-up
Received in revised form compared with the conventional NoC router. BiLink utilizes an extra link stage between routers and
22 February 2016
transmits two its in one link per cycle using phase pipelining if both routers require to use the current
Accepted 22 February 2016
link. To further increase the effective bandwidth, the direction of each link can be congured in every
Available online 2 March 2016
clock cycle to cater for different trafc loads from each side. Therefore, the data rate can be as high as
Keywords: 4 times compared with conventional NoC routers under uneven trafc. Centralized mode control scheme
Network-on-Chip (NoC) is implemented using a nite state machine (FSM) approach. Cycle-accurate simulations are carried out
Bi-directional link
on both synthetic trafc patterns as well as real application benchmarks. Simulation results show that
Double data rate
BiLink can provide as high as 90% and 250% speedup compared with conventional NoC routers for even
and uneven trafc, respectively. 2X and 3X gains in throughput are obtained under even and uneven
trafc, respectively, when compared with the conventional NoC router for the virtual channel ow
control. The BiLink router architecture is synthesized using TSMC 65 nm process technology and it is
shown that an area overhead of 28% over state-of-the-art bi-directional NoC is introduced while the
critical path is about 9% higher than that of the conventional routers. Despite the overhead in critical path
and power consumption, a 47.45% improvement of Energy-Delay-Product (EDP) is achieved by BiLink
under high injection rate trafc.
& 2016 Elsevier B.V. All rights reserved.

1. Introduction router architecture is essential and crucial for the next generation
of many-core systems.
Network-on-Chip (NoC) has become a promising approach to As the trafc pattern is usually uneven distributed among the
solve the communication bottleneck in the modern many-core network [13], self-recongurable router architectures have been
system-on-chip. With the potential deployment of many-core proposed [13,6,20,2] to improve the NoC performance by adapting
systems on new applications such as big data, articial intelli- the direction of the links to the run time trafc conditions. A bi-
gence and deep machine learning, the NoC router requires to directional NoC (BiNoC) router architecture was introduced in
transfer a larger amount of communication data among pro- [13,6] to cater for the uneven trafc patterns. However, most of the
cessors. For example, the Google Brain project [14,8] uses 1000 emphasis on the existing recongurable NoC architecture has been
machines to train a deep neural network. Each machine contains focusing on optimizing the design of the router itself. The opti-
16 cores on it and a subset of neural network will be mapped on mization of the interconnection between two neighboring routers
each of them [8]. The requirement of the data bandwidth is high is rarely touched. On the other hand, in the domain of commu-
and uneven due to the interleaving of the feed-forward and back- nications, the introduction of network coding [1] provides an
propagation training phases. To address for the intensive band- optimized way to use the channel bandwidth and achieves a
width requirement of these applications, a higher throughput NoC signicant improvement in the system throughput. Borrowing
the concept of network coding, in [5], an extra coding unit
was inserted between each pair of routers to enable the
n
Corresponding author.
data transmission from both ends over a single physical channel
E-mail addresses: jzhuak@ust.hk, eetsui@ust.hk (J. Zhu),
qianzl@sjtu.edu.cn (Z. Qian). simultaneously.

http://dx.doi.org/10.1016/j.vlsi.2016.02.006
0167-9260/& 2016 Elsevier B.V. All rights reserved.
J. Zhu et al. / INTEGRATION, the VLSI journal 55 (2016) 3042 31

In this work, to address the high bandwidth requirement for to further improve the throughput by allowing the transmission of
the next generation NoC architecture, we propose BiLink, a new two its over the channel in every cycle. More specically, we use
NoC router architecture using bidirectional double data rate links. both phases of the clock to transmit two different its. In the rst
More specically, in BiLink, a customized link stage is designed to phase of clock cycle, the routers at both ends of the link send the
transmit two its over one physical channel in each cycle in a its to the link module in the middle of the link (shown in Fig. 1
phase pipelined fashion. To further increase the effective band- (a)). Then, in the next phase, the link module sends the two its to
width, the direction of each link can be congured to cater for the the corresponding destination routers (shown in Fig. 1(b)). Com-
uneven distribution of the trafc loads. A centralized controller is pared to the conventional transfer mode, the transfer data rate is
implemented using a FSM to dynamically determine the operation doubled using the proposed BiLink scheme and it can transfer up
mode to support BiLink transmission. In this way, data are trans- to four its between routers R1 and R2 in every clock cycle.
mitted in both the clock edges to maximize the potential The main function of the intermediate link module is to isolate
throughput of the NoC router, leading to a better solution for the the its from both routers at the two ends. For the link stage, two
future data-intensive applications. D Flip-Flops (DFFs) are required to store the data received from
Cycle accurate simulations were executed to verify the perfor- each side during the rst half cycle. Moreover, two switches are
mance improvement of the proposed BiLink architecture. Simula- used to control the direction of the data ow, in order to avoid
tion results show that the proposed BiLink architecture can achieve overwriting the data originally stored in the registers. Fig. 2
90% and 60% improvements in the saturation injection rate com- (a) shows the hardware implementation of the link stage. When
pared to Bi-directional (BiNoC) router architectures [13] for even the clock phase is high, the switches S1 and S2 are open and the
and uneven trafc distributions, respectively. Furthermore, BiLink its transmitted from both sides will be stored into these two DFFs
also has a 250% improvement over the conventional NoC router for in the link stage, respectively. Then, at the second phase of the
the uneven trafc distribution. In summary, this work brings the clock cycle, S1 and S2 will be closed. The two DFFs will transmit
following contributions: the stored its to the corresponding destination. For the router
side, the output stage of each router has a similar structure to
 We combine the idea of self-recongurable router structures synchronize with the link stage. It sends its at the rst half clock
with a double data rate link for NoC and achieve a signicant cycle and receives its at the second half as shown in Fig. 2(b).
performance improvement through this joint optimization.
 We implement the proposed BiLink structure to verify the per- 2.2. Analysis of the timing constraints for BiLink
formance as well as the hardware overhead.
 We propose three variants of BiLink architecture and perform a With the insertion of the link module, we need to analyze the
thorough analysis on the performance and implementation impact on the timing of the overall system under reasonable clock
tradeoff of these structures. skew and jitter assumptions.
First we investigate whether the insertion of the link stage will
The remainder of the paper is organized as follows. In Section affect the clock frequency performance of the system. The data-
2, we discuss the basic idea of the normal double data rate bidir- path of a router consists of 2 parts, the inner pipeline stage and the
ectional link (BiLink) and analyze its timing issue. In Section 3, a link transfer stage. As will be shown in the simulation results in
self-recongurable direction control scheme, namely aggressive Section 5, the critical path of the inner pipeline stage of the router
bidirectional link (A-BiLink) is proposed. In Section 4, the detailed for the BiLink architecture is similar to that of the BiNoC. For the
hardware implementation of BiLink and A-BiLink are addressed. In link transfer delay, the insertion of the link module will not cause
addition, a new variant of A-BiLink which is more suitable for extra delay. If the long wire delay of the link transfer is the critical
hardware implementation is presented in this section. Simulation path of the design, adding a link module in the middle breaks the
and hardware synthesis results are shown in Section 5 and the long wire into half. Therefore the total delay of driving the long
related work is discussed in Section 6. Finally, Section 7 concludes wire will be decreased instead and the overall critical path, which
this work. includes the clock to Q delay and the setup time of the DFF
inserted in the link module, will be shortened.
We designed and layouted the link stage and the router's
2. Bidirectional link stage output stage in TSMC 65 nm process, and used it to drive different
lengths of wires. We simulated the performance of the overall link
To understand the basic principle behind the bidirectional link transfer using HSPICE under a clock skew of 10% of the clock
(BiLink), we will rst discuss the data ow in BiLink. Then the period [19]. The results show that the wire with a link stage is
related timing issues will be analyzed to show that BiLink can work always better in terms of critical path performance than that
properly under different timing constraints. without a link stage.
The hold time constraint of the link stage has also to be satis-
2.1. Motivation for exploring BiLink ed. The hold time of the DFF in the link module due to the
datapath through the wire is easily satised because of the large
In both uni-directional and bi-directional NoCs, the data delay of the long wire even under 10% positive clock skew. For the
transfer occupies the entire clock cycle. In this work, we propose hold time requirement due to the inner loop with the link module

Fig. 1. Data transfer mode for BiLink.


32 J. Zhu et al. / INTEGRATION, the VLSI journal 55 (2016) 3042

,
From XBAR
D Q D Q
S2 S3
IN / OUT IN / OUT IN / OUT

,
To Input VC
Q D Q D
S1

Link stage Routers output


Fig. 2. Hardware implementation of BiLink.

Fig. 3. Aggressive transmission mode for BiLink (A-BiLink).

as shown in Fig. 2(a), the following timing condition needs to be


held:
t clkQ t delay Zt hold 1

where t clkQ , t hold , t delay are the delay time and hold time of the DFF,
and the delay of the switch, respectively. Since the two DFFs are
placed close to each other, we can assume the clock skew is
negligible. From the cell library information of the TSMC 65 nm
technology, the intrinsic delay of a DFF together with the delay of a
switch are already much larger than the hold time of a DFF.
Therefore Eq. (1) is easily met. Same result can be obtained for the
inner loop within the router's output as shown in Fig. 2(b).

3. Self-recongurable BiLink

As discussed in [13], most of the trafc patterns in real appli-


cations are strongly uneven. Hence, in the BiLink, although we can
have at most 2X data rate compared with the conventional router,
we will not achieve such a huge improvement in reality since most
of the time only one router has its to send over the channels.
Fig. 4. Interconnection between each pair of routers in a 4x4 mesh topology.
Therefore, we further modify the link stage presented in Section 2
to add the exibility of conguring the direction of each link at run
transmission modes according to the trafc requests from both
time based on trafc conditions. The proposed self-congurable
BiLink, named as the Aggressive Bidirectional Link (A-BiLink) is ends. In the aggressive mode, the router has to pump out data at
shown in Fig. 3. both the positive and negative edges of the clock. One imple-
In A-BiLink, we dene two transmission modes between two mentation is to deploy a double edge triggered ip-op (DET)
neighboring routers, namely the normal model and the aggressive similar to that described in [17]. However, DET will result in a large
mode. In the normal mode, the link stage is congured as normal area and power overhead due to the large buffer size of the virtual
BiLink, and its from both sides are transported. If only one side channels. Therefore, in the next section, we present a Pseudo
requires to access the link stage, the link stage is congured into Aggressive Bidirectional Link (PA-BiLink), which implements the
the aggressive transmission mode, under which 2 its are trans- idea of A-BiLink with reasonable hardware cost.
mitted consecutively from one direction in both the rst and the
second half of the clock cycle. In this way, under uneven trafc
pattern where the other link direction can be reversed most of the 4. BiLink hardware implementation
time, the A-BiLink can transmit at most 4 its from one router to
the other in every clock cycle. Thus, 2X and 4X data rate The BiLink architecture consists of two main parts, the router
improvements are achieved when compared to BiNoC and con- implementation and the link module design. For a traditional
ventional NoC architectures, respectively. virtual channel (VC) ow-control router, it consists of 5 pipeline
The direction controller is implemented by a nite state stages: route computation (RC), virtual channel allocation (VA),
machine (FSM) for each pair of routers. At run time, each router switch allocation (SA), switch transversal (ST) and link transversal
will send a notication signal, i.e., channel request, to the FSM. (LT) [7]. In BiLink implementation, most of the stages in the rou-
Then the FSM will congure the interconnect into different ter's pipeline are similar to those of the conventional router and
J. Zhu et al. / INTEGRATION, the VLSI journal 55 (2016) 3042 33

input virtual channels in the current router and are forwarded to


different input virtual channels in the down-streaming receiver.
Based on this observation, a pipeline stage is added to store the
its received at the positive edge and the negative edge tem-
porarily into two DFFs at different phases as shown in Fig. 5.
In Fig. 5, we show a new variant of A-BiLink, PA-BiLink. Com-
pared with A-BiLink, it has an extra pipeline stage, which converts
the double edge transfer to the single edge transfer (i.e., D2S). As
shown in Fig. 5(b), it utilizes two DFFs to sample at both the
positive and negative edges. Then, the data is forwarded to the
corresponding VCs. Since each it coming in the same clock cycle
goes to a different destination, it is safe to push them into the
FIFOs without data clash. A small crossbar is used to connect the
input DFFs with the VC buffers. The extra pipeline stage added in
PA-BiLink will result in some degradation in the performance.
More precisely, the zero load latency for PA-BiLink, following the
analysis in [7], is equal to:
L
T 0 H min t r t w 2
b
where H min is the average minimum hop count of the network, t r
is the time delay through a single router, t w stands for the average
time of ight and L=b represents the serialization latency of a
packet. In PA-BiLink, only t r will be different compared to A-BiLink.
It is increased from 4 to 5 and therefore the network latency will
be increased by H min .
In the D2S stage, for the packet that is sampled at the negative
clock edge, it has only half clock cycle to reach the VC buffer and
get sampled. It may pose timing issue when the clock frequency is
high. However, the delay between DFF2 and the VC buffer shown
in Fig. 5 is mainly due to the delay of the DFF and the XBAR
between the two buffers. The size of the XBAR is 4  v, where v is
the number of VCs. The delay of it is that of one 4-to-1 MUX, which
is smaller than 50% of that of the 10  10 XBAR in the ST stage, and
hence it will not be the critical path of the router.
Fig. 5. Pseudo aggressive BiLink.
Similar technique can be used for the output buffer imple-
mentation. Rather than using a DET to transmit its at both edges
only SA, ST and LT stages are modied. The connection between of the clock, we use two DFFs to implement the output buffer as
two neighboring routers is shown in Fig. 4. shown in Fig. 5(c).
The channel request is sent by a router to the direction control
FSM to indicate that there are trafc ows requiring to use the 4.2. Route computation and virtual channel allocator
current channel. Upon receiving the channel request signal, the
control FSM sends the appropriate mode control signal to both the The RC and VA modules have the same structures as those of
routers and the link module. Based on the decision of mode con- the typical NoC router. In order to avoid the potential deadlock or
trol, the routers and the link module are congured into the cor- livelock problem, deterministic XY routing is used in this work.
responding directions to make full use of the channel bandwidth. The VC allocator is implemented as a separable allocator which
In this section, we present the hardware implementation of each consists of two stages to resolve the input and output request
datapath component in BiLink. contentions, respectively (shown in Fig. 6(a)). In order to dyna-
mically change the channel direction based on the trafc load, a
4.1. Input and output buffer dedicated channel request counter is deployed for each direction.
When a new packet enters the RC stage, it will increase the
Since A-BiLink receives or transmits at most four its every downlink request counter by 1. When a packet is ready to leave
cycle, a double edge triggered ip-op is required for its input and the router, i.e., the tail it nishes the SA stage, it will decrease the
output stages. However, the double edge triggered ip-ops such downlink request counter by 1. When the channel request counter
as [17] will introduce additional area/power overhead. Since the is greater than 0, it will send a request signal to the control FSM in
area of the input FIFO contributes about 6080% of the total area Fig. 4, indicating the current channel is requested to be used by the
in a typical VC-based router [13], this additional overhead will router.
increase the area and power a lot. Therefore, instead of using DET,
we make a slight modication to the A-BiLink scheme and utilize 4.3. Switch allocator
the conventional single edge triggered ip-op for the FIFOs in
each virtual channel. In this work, in order to support transferring maximally four
First, we assume that virtual channel based allocation scheme its in a clock cycle under the aggressive transmission mode, the
is used for the A-BiLink architecture. In each cycle, only the its switch allocator needs to be modied. Specically, two different
from different virtual channels will be switch allocated and sent types of requests, which corresponds to transferring its at the
out by the output buffer. As a result, although A-BiLink may receive positive and negative edges, respectively, are identied in the SA.
at most four its at each cycle, they are coming from different Normally, the request for positive edge transmission will always be
34 J. Zhu et al. / INTEGRATION, the VLSI journal 55 (2016) 3042

Fig. 6. Allocator structure.

asserted except when the current channel is not in use. On the


other hand, the request for negative edge transmission will only be
asserted when the current channel is in the aggressive transmis-
sion mode. A separable allocator as shown in Fig. 6(b) is used with
p v : 4 arbiters in the input stage and 4p 4p : 1 arbiters in the
output stage.1
In Fig. 6(b), a v : 4 arbiter2 can be implemented as four obliv-
ious arbiters with different priorities. In order to obtain a maximal
matching, we use the round robin arbiters in the output stage.

4.4. Crossbar

The dimension of the crossbar is 5  5 for the conventional


router and 10  10 for BiNoC router. In BiLink, to support maxi-
mally 4 its for each input direction, a 20  20 crossbar is used.
However, it will cause a large area overhead. In order to reduce the
crossbar size, we split the 20  20 crossbar into two 10  10
crossbars, one is responsible for the positive edge transmission
while the other is used for the negative edge transmission as
shown in Fig. 7. Since the complexity of a crossbar is On2 , where
n is the crossbar dimension, the area of two 10  10 crossbars is
approximately half of that of a single 20  20 crossbar.

Fig. 7. Crossbar stage for BiLink.


4.5. Mode controller

The request signals are generated by the RC module of each


In each router, the self-congurable BiLink architecture can
transmit its in three different modes, namely, the normal mode, router and transmitted to the centralized control stage in the rst
the aggressive TX mode and the aggressive RX mode. The mode half cycle. Then, the FSM is triggered at the negative clock edge
control logic is similar to that used in the BiNoC. However, the its and send the conguration signals back to each router as well as
are transmitting at different clock phases and this complicates the the link stage in the negative phase of the following cycle. In
design. In this work, instead of using a distributed control logic summary, the mode controller will experience a time delay of two
inside each router, we propose an explicit control module located clock cycles.
in the middle of the link as shown in Fig. 4. The timing diagram of The state transition diagram of the control FSM is shown in
the proposed FSM of the control module is shown in Fig. 8. Fig. 9. It consists of ve different states, which are categorized into
two different types: transmission mode and waiting mode. The
1
p is the number of ports and v is the number of VCs.
transmission mode indicates how each pair of the neighboring
2
Typically when VC number is 4, no input arbiter is required since maximally routers is congured. The two routers are either in normal
only 4 VCs will be requested.
J. Zhu et al. / INTEGRATION, the VLSI journal 55 (2016) 3042 35

Fig. 8. Timing diagram for FSM.

Fig. 10. Data conict hazard in state transition.

4.6. Network interface

The network interface (NI) of the BiLink architecture is modied


mainly in the it signal and credit control signal.
In BiLink architecture, the data may be received or sent at both
positive and negative edges. Therefore, the local processing ele-
ment (PE) needs to support such features, and it should have a
similar structure as shown in Fig. 5. The credit signal between the
router and the PE should also be doubled since the maximum
number of allocated its is 4 rather than 2 in BiNoC.
Fig. 9. Transition diagram for control FSM. The states are: AggL2R (aggressive
transmission, left to right), AggR2L (aggressive transmission, right to left), L2RWait
(wait state after left to right aggressive transmission), R2LWait (wait state after 4.7. BiLink for router that has less pipeline stages
right to left aggressive transmission), Normal (normal transmission).
In some design, the speculation and lookahead techniques are
transmission mode which is shown in Fig. 1(a), (b) or aggressive used to reduce the number of pipeline stages of the NoC router [7].
transmission or receiving mode as shown in Fig. 3. In general, the depth of pipeline stages will not have serious
Each edge in the transition diagram is represented by a 2-tuple impact on the saturation throughput of the NoC as for high packet
(L, R), where L and R stand for requests from the left and the right injection rate, the packet latency is mainly due to the contention
side, respectively. In the beginning, the FSM is initialized as the delay. Therefore, the high throughput of the proposed BiLink
normal transmission mode. If the left router has a channel request architecture will be preserved for router with shallow pipeline. For
and its right neighbor does not, the mode is directly changed to low injection rate, it seems that the additional D2S stage in PA-
aggressive transmission, left to right (AggL2R) as shown in Fig. 9. BiLink would impact the zero load latency for the router with less
In addition, extra waiting states are added to make sure there is pipeline stages. Without loss of generality, the timing diagrams of
no it conict during the mode transition. In A-BiLink, data conict a 2-cycle router and a 4-cycle router under low packet injection
may occur because the link transverse stage spans for one and a rate (i.e., without contention) are shown in Fig. 11 (Note that the
link transfer stage is considered as a separate pipeline stage in
half cycles. If we do not stall the state transition one more cycle,
addition to the router pipeline). In this example, the packet length
the data coming from the last cycle will crash into the data
is assumed to be 8-it. The end-to-end packet latency for the
transmitted in the current cycle as shown in Fig. 10. In this
conventional 4-cycle router increases from 12 to 13 (an 8%
example, the TX end is changing from aggressive TX mode back to
increase) due to the additional D2S stage in PA-BiLink. On the other
normal mode. In order to avoid this situation, waiting states are hand, the number of cycles for transmitting the 8-it packet for a
introduced. During these states, the aggressive transmitting end 2-cycle router increases from 10 to 11 (a 10% increase) due to the
will not participate in the switch allocation and drain the additional D2S stage. The difference in the increase in zero load
remaining negative edged transmitted its away. At the receiving latency for these two routers is very small (within 2% difference).
end, the router will wait for an extra cycle to receive the draining Therefore, the PA-BiLink performance for the 2-cycle and 4-cycle
its, i.e., 2nd it in Fig. 10. routers under low packet injection rate are quite similar. This will
36 J. Zhu et al. / INTEGRATION, the VLSI journal 55 (2016) 3042

Fig. 11. Comparisons of the 4-cycle and 2-cycle router under the low packet injection rate.

be demonstrated by simulation results presented in the next Table 1


section. Property of trafc patterns and real benchmarks.

Purely even Random, mpeg


Purely uneven Bit reversal
5. Results evaluation Mixture of even and uneven Buttery, shufe, transpose, mms, pip, vopd,
dvopd

5.1. Simulation setup


The synthetic trafc and real benchmarks can be further clas-
We implemented and compared the proposed BiLink archi- sied based on whether the trafc pattern is even or not. Table 1
tectures with two existing architectures, i.e., the traditional uni- summarizes this property for different benchmarks. Most of the
directional NoC and the bi-directional NoC (BiNoC) [13]. For BiLink trafc patterns are mixtures of even and uneven trafc load. The
architecture, all three variants proposed have been implemented trafc property for the real benchmarks depends on not only the
and evaluated. In particular, normal BiLink only supports the nor- trafc attribute, but also the mapping algorithm used for the task
mal transmission mode. A-BiLink can congure the channel graph. From the mapping results, mpeg is the only real benchmark
direction based on the run time trafc conditions while PA-BiLink that has even trafc loads.
reduces the hardware cost by sacricing the performance of A- To compare the area and power overhead, we implemented all
BiLink. The 8  8 and 16  16 mesh NoC architectures are used for the router architectures in Verilog and synthesized them using
evaluation to demonstrate the performance gain and also the Synopsys Design Compiler with TSMC 65 nm technology.
scalability of the proposed architecture. For all architectures,
credit-based virtual channel ow control is used. Each input 5.2. Performance comparisons
direction has four virtual channels with a buffer depth of 16 its.
To evaluate the system latency and throughput performance, a Figs. 1216 show the cycle accurate simulation results of the
cycle accurate NoC simulator is implemented which is an exten- packet latency. The simulations explore the impact of different
sion of Noxim [3,16,11]. Both synthetic trafc patterns and real packet lengths (8-it and 16-it), different network sizes (8  8
application benchmarks have been used to verify the performance. mesh and 16  16 mesh), and different router architectures (2-
For synthetic trafc, random, bit reversal, buttery, shufe and cycle router and 4-cycle router) on the packet latency perfor-
transpose trafcs are used. For real applications, ve different mance. The latency is calculated as the equivalent time period
benchmarks are used. They are multimedia system (mms), MPEG4 where latencyeqv clock cycle  Tclk instead of just the clock cycle.
decoder (mpeg), picture-in-picture application (pip), video object It is a more fair comparison since the critical path is different for
plane decoder (vopd) and dual video object plane decoder (dvopd) the conventional router and BiLink router as shown in the next
[10,4,18]. The benchmarks are characterized by the corresponding sub-section. We compare the latency of the network under dif-
communication task graphs. Similar to [4,10,16], we rst use the ferent synthetic trafc patterns for 6 architectures including a
mapping algorithm described in [10] to map the tasks onto the PEs typical router with double data width (64-bit per it), represented
in the mesh NoC. Based on the mapping results, the communica- as Typical (64). As shown in Figs. 1216, a signicant improvement
tion volumes among the cores are determined from the commu- of performance can be observed for all synthetic trafc patterns.
nication task graph. Each PE will then generate the corresponding Here, we dene the performance as the injection rate at near the
trafc with the desired packet injection rate (pir). Of note, the pir saturation point of the network. More specically, the network is
is computed based on the communication data volumes as well as assumed to reach the saturation point when the latency is
the packet injection factor as done in [16]. approaching 100 equivalent clock cycles. For evenly distributed
J. Zhu et al. / INTEGRATION, the VLSI journal 55 (2016) 3042 37

Fig. 12. Random trafc pattern.

Fig. 13. Bit reversal trafc pattern.

trafc patterns such as random, the BiLink will mostly be cong- packet length is 8 its and a 250% performance gain when the
ured in the normal mode as the trafc from both ends of the link packet length is 16 its are obtained. In addition, compared with
are uniform. Thus it is expected that there is not much difference the BiNoC, which can adapt to the trafc load as well, our pro-
in the performance between the A-BiLink and normal BiLink posed structure can still have a further performance gain of 57%
architecture. In Figs. 1216, it is shown that A-BiLink and PA-BiLink and 60% when the packet length is 8 and 16 its, respectively. We
can achieve approximately 80% performance gain against the also observe that the typical architecture of double data width can
BiNoC and typical NoC architectures when the packet length is 8, only perform as good as BiLink and BiNoC. This is mainly due to the
and 90% performance gain when the packet length is 16. For the xed channel direction of the typical router. For other trafc pat-
uneven trafc patterns like bit reversal, the gain for 8-it and 16- terns, such as buttery or shufe, our proposed BiLink architecture
it packets are still quite large, over 200%. As a result, the pro- will also outperform the typical router of single and double data
posed architecture works well for a wide range of packet lengths width as well as the BiNoC, because they have both even and
because the speedup mainly depends on the amount of conten- uneven trafcs. In addition, as we discussed previously, PA-BiLink
tion. In addition, it can be seen in Figs. 1216 the performance gain only has a small performance degradation in latency compared to
is similar for both 2-cycle and 4-cycle routers as discussed in A-BiLink.
Section 4.7. The typical architecture with double link width will In Fig. 17, the simulation results using real application bench-
have a 100% performance gain when the trafc pattern is purely marks are presented. In the simulations, each architecture is
even. However, we can see that it cannot perform as good as BiLink simulated under three different injection factors as dened in [16],
when the trafc pattern becomes uneven. In addition, it will have which correspond to low, medium and high trafc loads, respec-
a large hardware cost in terms of area and power. tively. Specically, the low injection factor refers to the injection
When we increase the size of the mesh NoC network, the rate that makes all 5 architectures work at the less congestion
chance of the contention occurrence between each pair of routers region (i.e., close to the zero load latency of the network). The
will increase as well. Therefore, from Figs. 1216 we can observe medium workload means that the typical router will enter into the
that compared with the 8  8 mesh topology, the performance saturation region (i.e., the delay is larger than hundreds of cycles)
gain of the PA-BiLink over the typical NoC router (in terms of the while the other architectures still operate in the less congestion
saturation point) under the random trafc pattern increases from region. For the high injection factor, it is referred to the workload
78% to 86% for the 16  16 mesh topology. Similar performance that even BiNoC is operating in the saturation region. Under this
gain can be observed for other trafc patterns. workload, all the existing NoC architectures will become saturated
For those patterns which exhibit strong uneven trafc dis- while the three variants of our proposed BiLink are still operating
tribution (e.g., bit-reversal), some of the links in A-BiLink and PA- in the low-latency region. We rst compare the BiLink architecture
BiLink will be congured as the aggressive transmission mode with conventional NoC and BiNoC by employing a low injection
most of the time. From the simulation results, we can observe that factor. Then, to demonstrate the superiority of BiLink architecture
a 210% performance gain over the typical architecture when the over BiNoC, we use a medium injection factor. Under this injection
38 J. Zhu et al. / INTEGRATION, the VLSI journal 55 (2016) 3042

Fig. 14. Buttery trafc pattern.

Fig. 15. Shufe trafc pattern.

factor, conventional NoC becomes saturated and the latency 5.3. Area and power overhead
becomes very high, so we do not include conventional NoC in
Fig. 17(b). Finally, in order to further compare the three different We synthesize 5 different architectures, i.e, typical, BiNoC,
BiLink variants, a high injection factor is used which will make BiLink and PA-BiLink and typical with double data width, using the
both conventional NoC and BiNoC fall into saturation. Fig. 17 shows same TCL script. The basic parameters for each router are:
the normalized latency for different architectures. From Fig. 17(a),
1) VC depth is 16 its for each direction.
we can observe that the 3 variants of BiLink achieve approximately
2) Flit data width is 32 bits.
2090% performance gain over the BiNoC and 100300% gain over
3) 4 VCs per direction.
the conventional NoC depending on the trafc distribution of the
4) 5 directions for the router, i.e., north, east, south, west and local.
applications. For mpeg, which is a completely even trafc pattern 5) Credit based ow control scheme.
as listed in Table 1, we can see that there is some performance
degradation in BiNoC compared with the conventional NoC. It is The detailed area breakdown for each router is shown in
due to the overhead caused by frequent mode transition in BiNoC. Table 3. From Table 3, we can see that PA-BiLink has a 45% area
However, BiLink architectures mitigate the problem because it can overhead compared with the typical router. However, it still shows
transmit more its in each cycle. In Fig. 17(b), the BiLink archi- a large area reduction compared with the typical architecture with
tectures always outperform BiNoC counterpart by at least 100%. double data width. More importantly, PA-BiLink outperforms the
From Fig. 17(c), we can see that in general A-BiLink performs better typical router with double data width for most of the trafc pat-
than normal BiLink for high injection rate. For benchmarks such as terns. From Table 3, it is also shown that:
mms, pip, vopd and dvopd, the latency of A-BiLink is reduced by
4070% compared with that of BiLink. No performance gain is 1) The main contribution of the area breakdown is the area of
obtained in the even trafc pattern such as mpeg for A-BiLink input VC buffers. PA-BiLink only adds some additional control
compared with BiLink. Furthermore, under different injection logics and DFFs as shown in Fig. 5(b), which is much more
scalable than the double data width architecture.
rates, the latency of PA-BiLink is always higher than that of A-BiLink
2) VA stages are the same for each architecture as shown in Fig. 6
because of the additional pipeline stage.
(a). Thus it will not cause any hardware overhead.
Finally, to show the throughput gain of the PA-BiLink archi-
3) XBAR size in PA-BiLink has been doubled compared with BiNoC,
tecture, the throughputs of different router architectures at the
since two 10  10 XBARs instead of one 10  10 XBAR are
saturation injection rate under different trafc patterns were also utilized here.
estimated and the results are summarized in Table 2. 4) The area of SA stage has been increased linearly from BiNoC to
PA-BiLink, because the number of allocation has been changed
from 2 to 4 as shown in Fig. 6(b), which duplicates the output
stage of the arbiters.
J. Zhu et al. / INTEGRATION, the VLSI journal 55 (2016) 3042 39

Fig. 16. Transpose trafc pattern.

5) The size of the output registers has been doubled for PA-BiLink when compared with the typical one. However, the overall energy
compared with BiNoC since we need additional negative edge consumption has been reduced by 17% and 14%, respectively
triggered DFFs to transmit the data at each phase of the clock. because the throughput gain is twice of that of the typical router at
high trafc load. We also simulate the power consumption at a
The timing report of different architectures is shown in Table 4. lower trafc load. From the power simulation result, with proper
For all the architectures, the critical path of the intra-router stage clock gating, the overhead of the power consumption over the
is SA, which is also reported in [15]. It should be noted that typical router is about 18%. The overhead is smaller than that of
although some of the works, such as [12] can achieve a much the high and medium trafc load since most of the time the extra
higher frequency, they are using some full custom design techni- XBAR and the D2S stage are not needed and clock-gating is used to
ques to shorten their critical path. It can be seen that BiNoC and reduce the dynamic power consumption. In conclusion, a better
PA-BiLink has the same critical path delay, which is 9% higher than energy efciency is obtained for the proposed BiLink architecture
that of the typical router. When we consider together with the since a higher throughput compensates the power overhead. For
reduction in latency cycle, even we have 9% increase in cycle time, instance, the calculated energy-delay-product (EDP) of PA-BiLink is
the gain in overall latency is still very signicant. The saturation reduced by 47.45% when compared with the typical router under
point for PA-BiLink surpasses the typical one by at least 80% as high packet injection rate. For low injection rate, since the
shown in Figs. 1216. throughput improvement is lower, PA-BiLink has a higher EDP. For
Finally, we need to take the intermediate control logic and the target applications which require high throughput, the injection
link stage into account. The nal equivalent area and power for the rate is high and the energy efciency of PA-BiLink is higher.
BiLink structure is calculated as:
Total Router 5  Link Stage 2:5  Mode Control 3
Since there are 2 link stages and 1 mode controller between 6. Related work
each pair of neighboring routers and each router has 5 neighbors,
the equivalent number of them associated with each router will Recongurable channel link direction: Conventionally, the
then be 5 and 2.5, respectively. The equivalent area and power are transfer mode for the interconnect between a pair of routers is
summarized in Tables 5 and 6. classied into two different types: unidirectional and bidirectional
From Table 5, the area overhead of PA-BiLink is 28% compared as shown in Fig. 18. The direction of each link is xed in uni-
with BiNoC. This tradeoff of area cost for performance is accep- directional NoC, one for transmitting data and the other for
table because the performance gain for PA-BiLink over BiNoC is receiving data as shown in Fig. 18(a). However, the link capacity is
large (8090% under even and uneven trafc patterns). In addition, not fully utilized if the distributions of trafc from both ends are
the area of the router is typically small compared with that of the not uniform and even. Thus in [13,20], a bi-directional router,
processing element in a tile (about 6% as reported in [21]). BiNoC was introduced where the direction of each link can be
Therefore, a 28% area increase in the router only incurs less than recongured to maximize the bandwidth utilization as shown in
2% area overhead for the tile. Fig. 18(b). A dedicated Channel Direction Control (CDC) algorithm
The power consumption of the routers depends on the is used in each router to control the direction of the link [13]. An
switching activity and hence the trafc workload of the routers. To alternative approach to control the direction of the link is using a
have an accurate power analysis on different routers, we construct centralized bandwidth arbiter [6]. The it-level speedup scheme is
a testbench using a pair of routers to form a small network. Power introduced to further increase the throughput of BiNoC by allow-
consumption is evaluated under three different trafc scenarios: ing 2 its within 1 packet to be transmitted simultaneously [20].
high injection rate, medium injection rate, and low injection rate. Application mapping algorithm based on Quality-of-Service (QoS)
For high and medium injection rates, 4 and 2 packets are injected of recongurable NoC routers is discussed in [2]. However, routers
from each router going to the other, respectively. On the other in these works can only transfer as many as 2 its in each clock
hand, only 1 packet will be transmitted from one router to the cycle. In addition, the performance gain will be small compared
other under the low injection rate. The switching activities of the with the unidirectional architecture under the random and even
modules of the routers are rst extracted from the post-synthesis trafc patterns.
simulation, and then back-annotated for the nal power evalua- Fine-grained recongurable interconnect: In [9], the granularity
tion. Table 6 summarizes the power consumption of different of the interconnection between a pair of routers can be sub-
router architectures under different injection rates. From Table 6, divided from the dimension of a it into a phit of which the
an  40% and 31% overhead of power consumption for PA-BiLink direction can be recongured independently. Due to the uneven
is observed under high and medium injection rate, respectively distribution of trafc in NoC, the recongurable channel direction
40 J. Zhu et al. / INTEGRATION, the VLSI journal 55 (2016) 3042

Table 2
Latency for real application (Low injection rate)
Comparison of throughput (its/cycle) for different router architectures.
1.4
Tyical
BiNoC Trafc/Arch. Typical BiNoC PA-BiLink
1.2 BiLink
ABiLink Random 0.20 0.21 0.39
PABiLink
1 Bitreversal 0.07 0.14 0.21
Normalized Latency

Transpose 0.075 0.138 0.24


0.8 Buttery 0.123 0.243 0.35

0.6

0.4 Table 3
Area breakdown of different NoC architectures.

0.2 Unit: m2 Typical BiNoC BiLink PA-BiLink Typical (64)

RC 180 242 180 292 180


0
mms mpeg pip vopd dvopd
Input VC buffer 145,868 152,997 153,406 167,018 282,912
Low packet injection rate factor VA 20,360 18,535 19,095 18,097 20,057

SA 5796 15,209 13,797 30,267 5693


Latency for real application (Medium injection rate)
1 XBAR 2987 11,951 11,951 23,902 6083
BiNoC
0.9 BiLink Output register 4706 6942 6677 12,280 5885
ABiLink
PABiLink Other 10,188 17,656 14,364 24,170 9471
0.8
Total area 190,085 223,532 219,470 276,026 330,281
0.7
Normalized value 1.00 1.18 1.15 1.45 1.74
Normalized Latency

0.6

0.5
Table 4
0.4
Timing report of different NoC architectures.
0.3
Unit: ns Typical BiNoC BiLink PA-BiLink Typical (64)
0.2
Critical path (SA stage) 0.95 1.04 1.04 1.04 0.94
0.1
Normalized value 1.00 1.09 1.09 1.09 0.99
0
mms mpeg pip vopd dvopd

Medium packet injection rate factor


Table 5
Equivalent area of different NoC architectures.
Latency for real application (High injection rate)
1.4
BiLink Unit: m2 Typical BiNoC BiLink PA-BiLink Typical (64)
ABiLink
1.2 PABiLink Router 190,085 223,532 219,470 276,026 330,281

Link stage 0 0 923 1906 0


1 Mode controller 0 0 0 115 0
Normalized Latency

Equivalent area 190,085 223,532 224,085 285,843.5 330,281


0.8
Normalized value 1.00 1.18 1.18 1.50 1.74

0.6

0.4 Table 6
Equivalent power and energy-delay-product (EDP) of different NoC architectures.

0.2 Power (mW)/Normalized EDP Typical BiNoC PA-BiLink Typical (64)

0 Low injection rate 17.8/1 20.8/1.27 21.1/1.29 34.5/1.91


mms mpeg pip vopd dvopd
Medium injection rate 19.5/1 22.7/1.26 25.7/0.60 37.3/0.68
High packet injection rate factor
High injection rate 20.4/1 23.7/1.26 28.9/0.52 39.7/0.59
Fig. 17. Simulation for real applications.

together with this ne granularity can achieve a more power and will cause an area overhead. Also, the communication latency is
area efcient approach without degrading the performance of slightly increased to save the channel resources.
NoC. However, the additional serializer as well as the deserializer Link utilization improvement using network coding: Network
coding has been used in communication systems that employ
J. Zhu et al. / INTEGRATION, the VLSI journal 55 (2016) 3042 41

Fig. 18. Data transfer mode for conventional NoC routers.

Fig. 19. Network Coding in Network-on-Chip.

intermediate relay stage to improve the effective bandwidth. for performance and hardware overhead tradeoff. Simulation
Similar idea was borrowed and applied in the domain of NoC. In results show that the proposed architecture can achieve a 250%
[5], a novel design of the link stage based on network coding has performance gain over the typical one, and 60% over the BiNoC
been proposed. The pattern of data transmission that mimics the under the uneven trafc pattern. For the even trafc patten, it also
way in network coding [1] is shown in Fig. 19. More specically, in outperforms the typical and BiNoC routers with performance gain
Fig. 19, during the transmitting phase, R1 and R2 will send the data higher than 90%. The BiLink works well under different packet
p and q to the intermediate coding unit, respectively. Then the lengths, scales well with the larger network size as well as dif-
coding unit will encode these two receiving data into a single ferent pipelined router architectures. The area overhead of PA-
packet (i.e., performing the p XOR q operation). Finally, at the BiLink over BiNoC is around 28% with a 40% overhead in power
receiving phase, R1 and R2 will receive the encoded packet (i.e., p under high injection rate. By utilizing the clock gating, the power
XOR q). To decode the data p and q, R1 and R2 XOR the received overhead is reduced to 18% under low injection rate. Despite the
data with the original data that they send out to obtain the result. overhead, the EDP of BiLink architecture is improved by 47.45%
A coding unit is inserted in the middle of the link to act as a relay under the high injection rate owing to the high throughput. In
station similar to that in the conventional network coding. How- summary, BiLink can provide a good performance/area/power
ever, unlike network coding, the two incoming signals going into tradeoff for high throughput router design.
the coding unit actually do not have to be coded because it does
not need to be broadcasted to the two sides. Moreover, this
architecture cannot adapt well with the uneven trafc patterns for Acknowledgment
real applications.
Comparing with the existing works, the proposed PA-BiLink This work is supported by Hong Kong Research Grant Council
architecture has the highest throughput performance among all (RGC) under Grant 619813.
routers as shown in Table 2. More specically, under the even
trafc pattern such as the random trafc, the throughput for PA-
BiLink outperforms those of the traditional router and BiNoC by References
approximately 100%. Furthermore, the throughput of PA-BiLink
surpasses that of the BiNoC by 45% to 73% under the uneven trafc [1] R. Ahlswede, Ning Cai, S.-Y.R. Li, R.W. Yeung, Network information ow, IEEE
Trans. Inf. Theory 46 (4) (2000) 12041216.
patterns such as bitreversal and transpose due to the data transfer [2] M.A. Al Faruque, T. Ebi, J. Henkel, Congurable links for runtime adaptive on-
in both clock edges. chip communication, in: 2009 Design, Automation Test in Europe Conference
Exhibition, DATE '09, April 2009, pp. 256261.
[3] G. Ascia, V. Catania, M. Palesi, D. Patti, Neighbors-on-path: a new selection
strategy for on-chip networks, in: Proceedings of the 2006 IEEE/ACM/IFIP
7. Conclusion Workshop on Embedded Systems for Real Time Multimedia, 2006, pp. 7984.
[4] D. Bertozzi, A. Jalabert, S. Murali, R. Tamhankar, S. Stergiou, L. Benini, G. De
Micheli, Noc synthesis ow for customized domain specic multiprocessor
In this paper, we have proposed a new NoC router architecture systems-on-chip, IEEE Trans. Parallel Distrib. Syst. 16 (February (2)) (2005)
using bidirectional link with double data rate. We proposed to 113129.
[5] K.C. Bollapalli, R. Garg, K. Gulati, S.P. Khatri, On-chip bidirectional wiring for
insert an intermediate link stage and used phase pipelining to heavily pipelined systems using network coding, in: 2009 IEEE International
double the data rate. In addition, a recongurable structure has Conference on Computer Design, ICCD 2009, 2009, pp. 131136.
been designed to improve the latency as well as the throughput [6] Myong Hyon Cho, M. Lis, Keun Sup Shim, M. Kinsy, T. Wen, S. Devadas,
Oblivious routing in on-chip bandwidth-adaptive networks, in: 2009 18th
under different trafc conditions by changing the direction of the International Conference on Parallel Architectures and Compilation Techni-
link at run time. We explored three different BiLink architectures ques, PACT '09, 2009, pp. 181190.
42 J. Zhu et al. / INTEGRATION, the VLSI journal 55 (2016) 3042

[7] William James Dally, Brian Patrick Towles, Principles and Practices of Inter- Zhiliang Qian received his B.S. degree in Microelec-
connection Networks, Access Online via Elsevier, 2004. tronics from the Fudan University, Shanghai, China in
[8] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, 2008 and Ph.D. degree in Electronic and Computer
Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al., Large scale distributed Engineering from the Hong Kong University of Science
deep networks, in: Advances in Neural Information Processing Systems, 2012, and Technology, Hong Kong in 2014. He is now with the
pp. 12231231. Department of Micro- and Nano- Electronics, Shanghai
[9] R. Hesse, J. Nicholls, N.E. Jerger, Fine-grained bandwidth adaptivity in Jiao Tong University.
networks-on-chip using bidirectional channels, in: 2012 Sixth IEEE/ACM His research interests include high performance
International Symposium on Networks on Chip (NoCS), May 2012, pp. 132 Network-on-Chip design, low power VLSI imple-
141. mentation and embedded system design.
[10] Hu. Jingcao, R. Marculescu, Energy- and performance-aware mapping for
regular noc architectures, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.
24 (April (4)) (2005) 551562.
[11] Kai-Yuan Jheng, Chih-Hao Chao, Hao-Yu Wang, An-Yeu Wu, Trafc-thermal
mutual-coupling co-simulation platform for three-dimensional network-on-
chip, in: 2010 International Symposium on VLSI Design Automation and Test
Chi-Ying Tsui received the B.S. degree in electrical
(VLSI-DAT), IEEE, Hsin Chu, 2010, pp. 135138.
engineering from the University of Hong Kong, Hong
[12] Amit Kumar, Partha Kundu, Arvind P. Singh, Li shiuan Peh, Niraj K. Jha, A
Kong, and the Ph.D. degree in computer engineering
4.6 Tbits/s 3.6 GHz single-cycle noc router with a novel switch allocator in
from the University of Southern California, Los Angeles,
65 nm CMOS, in: ICCD-2007, 2007.
CA, USA, in 1994.
[13] Ying-Cherng Lan, Hsiao-An Lin, Shih-Hsin Lo, Yu Hen Hu, Sao-Jie Chen, A
He joined the Department of Electronic and Com-
bidirectional noc (binoc) architecture with dynamic self-recongurable
puter Engineering with the Hong Kong University of
channel, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 30 (3) (2011)
Science and Technology, Hong Kong, in 1994, where he
427440.
is currently a Full Professor. He has authored over 180
[14] Quoc V Le, Building high-level features using large scale unsupervised learn-
referred publications, and holds 10 U.S. patents on
ing, in: 2013 IEEE International Conference on Acoustics, Speech and Signal
power management, VLSI, and multimedia systems. His
Processing (ICASSP), IEEE, Vancouver, 2013, pp. 85958598.
current research interests include designing VLSI
[15] Chrysostomos Nicopoulos, Vijaykrishnan Narayanan, Chita R Das, Network-on-
architectures for low-power multimedia and wireless
Chip Architectures: A Holistic Design Exploration, vol. 45, Springer, 2009,
applications, developing power management circuits and techniques for embedded
http://www.springer.com/us/book/9789048130306.
portable devices, and ultralow-power systems.
[16] M. Palesi, S. Kumar, V. Catania, Bandwidth-aware routing algorithms for
Dr. Tsui was a recipient of the Best Paper Awards from the IEEE TRANSACTIONS
networks-on-chip platforms, IET Comput. Digit. Tech. 3 (September (5))
ON VLSI SYSTEMS in 1995, the IEEE International Symposium on Circuits and
(2009) 413429.
Systems in 1999, the IEEE/ACM International Symposium on Low Power Electronics
[17] M. Pedram, Qing Wu, Xunwei Wu, A new design of double edge triggered ip-
and Design in 2007, the IEEE International Symposium on Electronic Design, Test
ops, in: Proceedings of the 1998 Asia and South Pacic Design Automation
and Application in 2008, and CODES in 2012. He was also a recipient of the Design
Conference, ASP-DAC '98, February 1998, pp. 417421.
Awards in the IEEE Asia and South Pacic Design Automation Conference University
[18] A. Pullini, F. Angiolini, P. Meloni, D. Atienza, S. Murali, L. Raffo, G. De Micheli, L.
Design Contest in 2004 and 2006.
Benini, Noc design and implementation in 65 nm technology, in: 2007 First
International Symposium on Networks-on-Chip, NOCS 2007, May 2007,
pp. 273282.
[19] A. Pullini, F. Angiolini, S. Murali, D. Atienza, G. De Micheli, L. Benini, Bringing
nocs to 65 nm, IEEE Micro 27 (September (5)) (2007) 7585.
[20] Zhiliang Qian, Ying-Fei Teh, Chi-Ying Tsui, A it-level speedup scheme for
network-on-chips using self-recongurable bi-directional channels, in: 2012
Design, Automation Test in Europe Conference Exhibition (DATE), March 2012,
pp. 12951300.
[21] Praveen Salihundam, Shailendra Jain, Tiju Jacob, Shasi Kumar,
Vasantha Erraguntla, Yatin Hoskote, Sriram Vangal, Gregory Ruhl, Nitin Borkar,
A 2 Tb/s 6 4 mesh network for a single-chip cloud computer with dvfs in
45 nm cmos, IEEE J. Solid-State Circuits 46 (4) (2011) 757766.

Jingyang Zhu received his B.S. degree in School of


Microelectronics from the Shanghai Jiao Tong Uni-
versity, Shanghai, China in 2013. Currently, he is
working towards his Ph.D. degree in the Hong Kong
University of Science and Technology, Hong Kong.
His current research interests include high perfor-
mance Network-on-Chip design, low-power VLSI
implementation, and machine learning specic hard-
ware accelerator.

You might also like