Professional Documents
Culture Documents
art ic l e i nf o a b s t r a c t
Article history: This paper presents a novel high performance Network-on-Chip (NoC) router architecture design using a
Received 8 January 2015 bi-directional link with double data rate (BiLink). Ideally, it can provide as high as 2 times speed-up
Received in revised form compared with the conventional NoC router. BiLink utilizes an extra link stage between routers and
22 February 2016
transmits two its in one link per cycle using phase pipelining if both routers require to use the current
Accepted 22 February 2016
link. To further increase the effective bandwidth, the direction of each link can be congured in every
Available online 2 March 2016
clock cycle to cater for different trafc loads from each side. Therefore, the data rate can be as high as
Keywords: 4 times compared with conventional NoC routers under uneven trafc. Centralized mode control scheme
Network-on-Chip (NoC) is implemented using a nite state machine (FSM) approach. Cycle-accurate simulations are carried out
Bi-directional link
on both synthetic trafc patterns as well as real application benchmarks. Simulation results show that
Double data rate
BiLink can provide as high as 90% and 250% speedup compared with conventional NoC routers for even
and uneven trafc, respectively. 2X and 3X gains in throughput are obtained under even and uneven
trafc, respectively, when compared with the conventional NoC router for the virtual channel ow
control. The BiLink router architecture is synthesized using TSMC 65 nm process technology and it is
shown that an area overhead of 28% over state-of-the-art bi-directional NoC is introduced while the
critical path is about 9% higher than that of the conventional routers. Despite the overhead in critical path
and power consumption, a 47.45% improvement of Energy-Delay-Product (EDP) is achieved by BiLink
under high injection rate trafc.
& 2016 Elsevier B.V. All rights reserved.
1. Introduction router architecture is essential and crucial for the next generation
of many-core systems.
Network-on-Chip (NoC) has become a promising approach to As the trafc pattern is usually uneven distributed among the
solve the communication bottleneck in the modern many-core network [13], self-recongurable router architectures have been
system-on-chip. With the potential deployment of many-core proposed [13,6,20,2] to improve the NoC performance by adapting
systems on new applications such as big data, articial intelli- the direction of the links to the run time trafc conditions. A bi-
gence and deep machine learning, the NoC router requires to directional NoC (BiNoC) router architecture was introduced in
transfer a larger amount of communication data among pro- [13,6] to cater for the uneven trafc patterns. However, most of the
cessors. For example, the Google Brain project [14,8] uses 1000 emphasis on the existing recongurable NoC architecture has been
machines to train a deep neural network. Each machine contains focusing on optimizing the design of the router itself. The opti-
16 cores on it and a subset of neural network will be mapped on mization of the interconnection between two neighboring routers
each of them [8]. The requirement of the data bandwidth is high is rarely touched. On the other hand, in the domain of commu-
and uneven due to the interleaving of the feed-forward and back- nications, the introduction of network coding [1] provides an
propagation training phases. To address for the intensive band- optimized way to use the channel bandwidth and achieves a
width requirement of these applications, a higher throughput NoC signicant improvement in the system throughput. Borrowing
the concept of network coding, in [5], an extra coding unit
was inserted between each pair of routers to enable the
n
Corresponding author.
data transmission from both ends over a single physical channel
E-mail addresses: jzhuak@ust.hk, eetsui@ust.hk (J. Zhu),
qianzl@sjtu.edu.cn (Z. Qian). simultaneously.
http://dx.doi.org/10.1016/j.vlsi.2016.02.006
0167-9260/& 2016 Elsevier B.V. All rights reserved.
J. Zhu et al. / INTEGRATION, the VLSI journal 55 (2016) 3042 31
In this work, to address the high bandwidth requirement for to further improve the throughput by allowing the transmission of
the next generation NoC architecture, we propose BiLink, a new two its over the channel in every cycle. More specically, we use
NoC router architecture using bidirectional double data rate links. both phases of the clock to transmit two different its. In the rst
More specically, in BiLink, a customized link stage is designed to phase of clock cycle, the routers at both ends of the link send the
transmit two its over one physical channel in each cycle in a its to the link module in the middle of the link (shown in Fig. 1
phase pipelined fashion. To further increase the effective band- (a)). Then, in the next phase, the link module sends the two its to
width, the direction of each link can be congured to cater for the the corresponding destination routers (shown in Fig. 1(b)). Com-
uneven distribution of the trafc loads. A centralized controller is pared to the conventional transfer mode, the transfer data rate is
implemented using a FSM to dynamically determine the operation doubled using the proposed BiLink scheme and it can transfer up
mode to support BiLink transmission. In this way, data are trans- to four its between routers R1 and R2 in every clock cycle.
mitted in both the clock edges to maximize the potential The main function of the intermediate link module is to isolate
throughput of the NoC router, leading to a better solution for the the its from both routers at the two ends. For the link stage, two
future data-intensive applications. D Flip-Flops (DFFs) are required to store the data received from
Cycle accurate simulations were executed to verify the perfor- each side during the rst half cycle. Moreover, two switches are
mance improvement of the proposed BiLink architecture. Simula- used to control the direction of the data ow, in order to avoid
tion results show that the proposed BiLink architecture can achieve overwriting the data originally stored in the registers. Fig. 2
90% and 60% improvements in the saturation injection rate com- (a) shows the hardware implementation of the link stage. When
pared to Bi-directional (BiNoC) router architectures [13] for even the clock phase is high, the switches S1 and S2 are open and the
and uneven trafc distributions, respectively. Furthermore, BiLink its transmitted from both sides will be stored into these two DFFs
also has a 250% improvement over the conventional NoC router for in the link stage, respectively. Then, at the second phase of the
the uneven trafc distribution. In summary, this work brings the clock cycle, S1 and S2 will be closed. The two DFFs will transmit
following contributions: the stored its to the corresponding destination. For the router
side, the output stage of each router has a similar structure to
We combine the idea of self-recongurable router structures synchronize with the link stage. It sends its at the rst half clock
with a double data rate link for NoC and achieve a signicant cycle and receives its at the second half as shown in Fig. 2(b).
performance improvement through this joint optimization.
We implement the proposed BiLink structure to verify the per- 2.2. Analysis of the timing constraints for BiLink
formance as well as the hardware overhead.
We propose three variants of BiLink architecture and perform a With the insertion of the link module, we need to analyze the
thorough analysis on the performance and implementation impact on the timing of the overall system under reasonable clock
tradeoff of these structures. skew and jitter assumptions.
First we investigate whether the insertion of the link stage will
The remainder of the paper is organized as follows. In Section affect the clock frequency performance of the system. The data-
2, we discuss the basic idea of the normal double data rate bidir- path of a router consists of 2 parts, the inner pipeline stage and the
ectional link (BiLink) and analyze its timing issue. In Section 3, a link transfer stage. As will be shown in the simulation results in
self-recongurable direction control scheme, namely aggressive Section 5, the critical path of the inner pipeline stage of the router
bidirectional link (A-BiLink) is proposed. In Section 4, the detailed for the BiLink architecture is similar to that of the BiNoC. For the
hardware implementation of BiLink and A-BiLink are addressed. In link transfer delay, the insertion of the link module will not cause
addition, a new variant of A-BiLink which is more suitable for extra delay. If the long wire delay of the link transfer is the critical
hardware implementation is presented in this section. Simulation path of the design, adding a link module in the middle breaks the
and hardware synthesis results are shown in Section 5 and the long wire into half. Therefore the total delay of driving the long
related work is discussed in Section 6. Finally, Section 7 concludes wire will be decreased instead and the overall critical path, which
this work. includes the clock to Q delay and the setup time of the DFF
inserted in the link module, will be shortened.
We designed and layouted the link stage and the router's
2. Bidirectional link stage output stage in TSMC 65 nm process, and used it to drive different
lengths of wires. We simulated the performance of the overall link
To understand the basic principle behind the bidirectional link transfer using HSPICE under a clock skew of 10% of the clock
(BiLink), we will rst discuss the data ow in BiLink. Then the period [19]. The results show that the wire with a link stage is
related timing issues will be analyzed to show that BiLink can work always better in terms of critical path performance than that
properly under different timing constraints. without a link stage.
The hold time constraint of the link stage has also to be satis-
2.1. Motivation for exploring BiLink ed. The hold time of the DFF in the link module due to the
datapath through the wire is easily satised because of the large
In both uni-directional and bi-directional NoCs, the data delay of the long wire even under 10% positive clock skew. For the
transfer occupies the entire clock cycle. In this work, we propose hold time requirement due to the inner loop with the link module
,
From XBAR
D Q D Q
S2 S3
IN / OUT IN / OUT IN / OUT
,
To Input VC
Q D Q D
S1
where t clkQ , t hold , t delay are the delay time and hold time of the DFF,
and the delay of the switch, respectively. Since the two DFFs are
placed close to each other, we can assume the clock skew is
negligible. From the cell library information of the TSMC 65 nm
technology, the intrinsic delay of a DFF together with the delay of a
switch are already much larger than the hold time of a DFF.
Therefore Eq. (1) is easily met. Same result can be obtained for the
inner loop within the router's output as shown in Fig. 2(b).
3. Self-recongurable BiLink
4.4. Crossbar
Fig. 11. Comparisons of the 4-cycle and 2-cycle router under the low packet injection rate.
trafc patterns such as random, the BiLink will mostly be cong- packet length is 8 its and a 250% performance gain when the
ured in the normal mode as the trafc from both ends of the link packet length is 16 its are obtained. In addition, compared with
are uniform. Thus it is expected that there is not much difference the BiNoC, which can adapt to the trafc load as well, our pro-
in the performance between the A-BiLink and normal BiLink posed structure can still have a further performance gain of 57%
architecture. In Figs. 1216, it is shown that A-BiLink and PA-BiLink and 60% when the packet length is 8 and 16 its, respectively. We
can achieve approximately 80% performance gain against the also observe that the typical architecture of double data width can
BiNoC and typical NoC architectures when the packet length is 8, only perform as good as BiLink and BiNoC. This is mainly due to the
and 90% performance gain when the packet length is 16. For the xed channel direction of the typical router. For other trafc pat-
uneven trafc patterns like bit reversal, the gain for 8-it and 16- terns, such as buttery or shufe, our proposed BiLink architecture
it packets are still quite large, over 200%. As a result, the pro- will also outperform the typical router of single and double data
posed architecture works well for a wide range of packet lengths width as well as the BiNoC, because they have both even and
because the speedup mainly depends on the amount of conten- uneven trafcs. In addition, as we discussed previously, PA-BiLink
tion. In addition, it can be seen in Figs. 1216 the performance gain only has a small performance degradation in latency compared to
is similar for both 2-cycle and 4-cycle routers as discussed in A-BiLink.
Section 4.7. The typical architecture with double link width will In Fig. 17, the simulation results using real application bench-
have a 100% performance gain when the trafc pattern is purely marks are presented. In the simulations, each architecture is
even. However, we can see that it cannot perform as good as BiLink simulated under three different injection factors as dened in [16],
when the trafc pattern becomes uneven. In addition, it will have which correspond to low, medium and high trafc loads, respec-
a large hardware cost in terms of area and power. tively. Specically, the low injection factor refers to the injection
When we increase the size of the mesh NoC network, the rate that makes all 5 architectures work at the less congestion
chance of the contention occurrence between each pair of routers region (i.e., close to the zero load latency of the network). The
will increase as well. Therefore, from Figs. 1216 we can observe medium workload means that the typical router will enter into the
that compared with the 8 8 mesh topology, the performance saturation region (i.e., the delay is larger than hundreds of cycles)
gain of the PA-BiLink over the typical NoC router (in terms of the while the other architectures still operate in the less congestion
saturation point) under the random trafc pattern increases from region. For the high injection factor, it is referred to the workload
78% to 86% for the 16 16 mesh topology. Similar performance that even BiNoC is operating in the saturation region. Under this
gain can be observed for other trafc patterns. workload, all the existing NoC architectures will become saturated
For those patterns which exhibit strong uneven trafc dis- while the three variants of our proposed BiLink are still operating
tribution (e.g., bit-reversal), some of the links in A-BiLink and PA- in the low-latency region. We rst compare the BiLink architecture
BiLink will be congured as the aggressive transmission mode with conventional NoC and BiNoC by employing a low injection
most of the time. From the simulation results, we can observe that factor. Then, to demonstrate the superiority of BiLink architecture
a 210% performance gain over the typical architecture when the over BiNoC, we use a medium injection factor. Under this injection
38 J. Zhu et al. / INTEGRATION, the VLSI journal 55 (2016) 3042
factor, conventional NoC becomes saturated and the latency 5.3. Area and power overhead
becomes very high, so we do not include conventional NoC in
Fig. 17(b). Finally, in order to further compare the three different We synthesize 5 different architectures, i.e, typical, BiNoC,
BiLink variants, a high injection factor is used which will make BiLink and PA-BiLink and typical with double data width, using the
both conventional NoC and BiNoC fall into saturation. Fig. 17 shows same TCL script. The basic parameters for each router are:
the normalized latency for different architectures. From Fig. 17(a),
1) VC depth is 16 its for each direction.
we can observe that the 3 variants of BiLink achieve approximately
2) Flit data width is 32 bits.
2090% performance gain over the BiNoC and 100300% gain over
3) 4 VCs per direction.
the conventional NoC depending on the trafc distribution of the
4) 5 directions for the router, i.e., north, east, south, west and local.
applications. For mpeg, which is a completely even trafc pattern 5) Credit based ow control scheme.
as listed in Table 1, we can see that there is some performance
degradation in BiNoC compared with the conventional NoC. It is The detailed area breakdown for each router is shown in
due to the overhead caused by frequent mode transition in BiNoC. Table 3. From Table 3, we can see that PA-BiLink has a 45% area
However, BiLink architectures mitigate the problem because it can overhead compared with the typical router. However, it still shows
transmit more its in each cycle. In Fig. 17(b), the BiLink archi- a large area reduction compared with the typical architecture with
tectures always outperform BiNoC counterpart by at least 100%. double data width. More importantly, PA-BiLink outperforms the
From Fig. 17(c), we can see that in general A-BiLink performs better typical router with double data width for most of the trafc pat-
than normal BiLink for high injection rate. For benchmarks such as terns. From Table 3, it is also shown that:
mms, pip, vopd and dvopd, the latency of A-BiLink is reduced by
4070% compared with that of BiLink. No performance gain is 1) The main contribution of the area breakdown is the area of
obtained in the even trafc pattern such as mpeg for A-BiLink input VC buffers. PA-BiLink only adds some additional control
compared with BiLink. Furthermore, under different injection logics and DFFs as shown in Fig. 5(b), which is much more
scalable than the double data width architecture.
rates, the latency of PA-BiLink is always higher than that of A-BiLink
2) VA stages are the same for each architecture as shown in Fig. 6
because of the additional pipeline stage.
(a). Thus it will not cause any hardware overhead.
Finally, to show the throughput gain of the PA-BiLink archi-
3) XBAR size in PA-BiLink has been doubled compared with BiNoC,
tecture, the throughputs of different router architectures at the
since two 10 10 XBARs instead of one 10 10 XBAR are
saturation injection rate under different trafc patterns were also utilized here.
estimated and the results are summarized in Table 2. 4) The area of SA stage has been increased linearly from BiNoC to
PA-BiLink, because the number of allocation has been changed
from 2 to 4 as shown in Fig. 6(b), which duplicates the output
stage of the arbiters.
J. Zhu et al. / INTEGRATION, the VLSI journal 55 (2016) 3042 39
5) The size of the output registers has been doubled for PA-BiLink when compared with the typical one. However, the overall energy
compared with BiNoC since we need additional negative edge consumption has been reduced by 17% and 14%, respectively
triggered DFFs to transmit the data at each phase of the clock. because the throughput gain is twice of that of the typical router at
high trafc load. We also simulate the power consumption at a
The timing report of different architectures is shown in Table 4. lower trafc load. From the power simulation result, with proper
For all the architectures, the critical path of the intra-router stage clock gating, the overhead of the power consumption over the
is SA, which is also reported in [15]. It should be noted that typical router is about 18%. The overhead is smaller than that of
although some of the works, such as [12] can achieve a much the high and medium trafc load since most of the time the extra
higher frequency, they are using some full custom design techni- XBAR and the D2S stage are not needed and clock-gating is used to
ques to shorten their critical path. It can be seen that BiNoC and reduce the dynamic power consumption. In conclusion, a better
PA-BiLink has the same critical path delay, which is 9% higher than energy efciency is obtained for the proposed BiLink architecture
that of the typical router. When we consider together with the since a higher throughput compensates the power overhead. For
reduction in latency cycle, even we have 9% increase in cycle time, instance, the calculated energy-delay-product (EDP) of PA-BiLink is
the gain in overall latency is still very signicant. The saturation reduced by 47.45% when compared with the typical router under
point for PA-BiLink surpasses the typical one by at least 80% as high packet injection rate. For low injection rate, since the
shown in Figs. 1216. throughput improvement is lower, PA-BiLink has a higher EDP. For
Finally, we need to take the intermediate control logic and the target applications which require high throughput, the injection
link stage into account. The nal equivalent area and power for the rate is high and the energy efciency of PA-BiLink is higher.
BiLink structure is calculated as:
Total Router 5 Link Stage 2:5 Mode Control 3
Since there are 2 link stages and 1 mode controller between 6. Related work
each pair of neighboring routers and each router has 5 neighbors,
the equivalent number of them associated with each router will Recongurable channel link direction: Conventionally, the
then be 5 and 2.5, respectively. The equivalent area and power are transfer mode for the interconnect between a pair of routers is
summarized in Tables 5 and 6. classied into two different types: unidirectional and bidirectional
From Table 5, the area overhead of PA-BiLink is 28% compared as shown in Fig. 18. The direction of each link is xed in uni-
with BiNoC. This tradeoff of area cost for performance is accep- directional NoC, one for transmitting data and the other for
table because the performance gain for PA-BiLink over BiNoC is receiving data as shown in Fig. 18(a). However, the link capacity is
large (8090% under even and uneven trafc patterns). In addition, not fully utilized if the distributions of trafc from both ends are
the area of the router is typically small compared with that of the not uniform and even. Thus in [13,20], a bi-directional router,
processing element in a tile (about 6% as reported in [21]). BiNoC was introduced where the direction of each link can be
Therefore, a 28% area increase in the router only incurs less than recongured to maximize the bandwidth utilization as shown in
2% area overhead for the tile. Fig. 18(b). A dedicated Channel Direction Control (CDC) algorithm
The power consumption of the routers depends on the is used in each router to control the direction of the link [13]. An
switching activity and hence the trafc workload of the routers. To alternative approach to control the direction of the link is using a
have an accurate power analysis on different routers, we construct centralized bandwidth arbiter [6]. The it-level speedup scheme is
a testbench using a pair of routers to form a small network. Power introduced to further increase the throughput of BiNoC by allow-
consumption is evaluated under three different trafc scenarios: ing 2 its within 1 packet to be transmitted simultaneously [20].
high injection rate, medium injection rate, and low injection rate. Application mapping algorithm based on Quality-of-Service (QoS)
For high and medium injection rates, 4 and 2 packets are injected of recongurable NoC routers is discussed in [2]. However, routers
from each router going to the other, respectively. On the other in these works can only transfer as many as 2 its in each clock
hand, only 1 packet will be transmitted from one router to the cycle. In addition, the performance gain will be small compared
other under the low injection rate. The switching activities of the with the unidirectional architecture under the random and even
modules of the routers are rst extracted from the post-synthesis trafc patterns.
simulation, and then back-annotated for the nal power evalua- Fine-grained recongurable interconnect: In [9], the granularity
tion. Table 6 summarizes the power consumption of different of the interconnection between a pair of routers can be sub-
router architectures under different injection rates. From Table 6, divided from the dimension of a it into a phit of which the
an 40% and 31% overhead of power consumption for PA-BiLink direction can be recongured independently. Due to the uneven
is observed under high and medium injection rate, respectively distribution of trafc in NoC, the recongurable channel direction
40 J. Zhu et al. / INTEGRATION, the VLSI journal 55 (2016) 3042
Table 2
Latency for real application (Low injection rate)
Comparison of throughput (its/cycle) for different router architectures.
1.4
Tyical
BiNoC Trafc/Arch. Typical BiNoC PA-BiLink
1.2 BiLink
ABiLink Random 0.20 0.21 0.39
PABiLink
1 Bitreversal 0.07 0.14 0.21
Normalized Latency
0.6
0.4 Table 3
Area breakdown of different NoC architectures.
0.6
0.5
Table 4
0.4
Timing report of different NoC architectures.
0.3
Unit: ns Typical BiNoC BiLink PA-BiLink Typical (64)
0.2
Critical path (SA stage) 0.95 1.04 1.04 1.04 0.94
0.1
Normalized value 1.00 1.09 1.09 1.09 0.99
0
mms mpeg pip vopd dvopd
0.6
0.4 Table 6
Equivalent power and energy-delay-product (EDP) of different NoC architectures.
together with this ne granularity can achieve a more power and will cause an area overhead. Also, the communication latency is
area efcient approach without degrading the performance of slightly increased to save the channel resources.
NoC. However, the additional serializer as well as the deserializer Link utilization improvement using network coding: Network
coding has been used in communication systems that employ
J. Zhu et al. / INTEGRATION, the VLSI journal 55 (2016) 3042 41
intermediate relay stage to improve the effective bandwidth. for performance and hardware overhead tradeoff. Simulation
Similar idea was borrowed and applied in the domain of NoC. In results show that the proposed architecture can achieve a 250%
[5], a novel design of the link stage based on network coding has performance gain over the typical one, and 60% over the BiNoC
been proposed. The pattern of data transmission that mimics the under the uneven trafc pattern. For the even trafc patten, it also
way in network coding [1] is shown in Fig. 19. More specically, in outperforms the typical and BiNoC routers with performance gain
Fig. 19, during the transmitting phase, R1 and R2 will send the data higher than 90%. The BiLink works well under different packet
p and q to the intermediate coding unit, respectively. Then the lengths, scales well with the larger network size as well as dif-
coding unit will encode these two receiving data into a single ferent pipelined router architectures. The area overhead of PA-
packet (i.e., performing the p XOR q operation). Finally, at the BiLink over BiNoC is around 28% with a 40% overhead in power
receiving phase, R1 and R2 will receive the encoded packet (i.e., p under high injection rate. By utilizing the clock gating, the power
XOR q). To decode the data p and q, R1 and R2 XOR the received overhead is reduced to 18% under low injection rate. Despite the
data with the original data that they send out to obtain the result. overhead, the EDP of BiLink architecture is improved by 47.45%
A coding unit is inserted in the middle of the link to act as a relay under the high injection rate owing to the high throughput. In
station similar to that in the conventional network coding. How- summary, BiLink can provide a good performance/area/power
ever, unlike network coding, the two incoming signals going into tradeoff for high throughput router design.
the coding unit actually do not have to be coded because it does
not need to be broadcasted to the two sides. Moreover, this
architecture cannot adapt well with the uneven trafc patterns for Acknowledgment
real applications.
Comparing with the existing works, the proposed PA-BiLink This work is supported by Hong Kong Research Grant Council
architecture has the highest throughput performance among all (RGC) under Grant 619813.
routers as shown in Table 2. More specically, under the even
trafc pattern such as the random trafc, the throughput for PA-
BiLink outperforms those of the traditional router and BiNoC by References
approximately 100%. Furthermore, the throughput of PA-BiLink
surpasses that of the BiNoC by 45% to 73% under the uneven trafc [1] R. Ahlswede, Ning Cai, S.-Y.R. Li, R.W. Yeung, Network information ow, IEEE
Trans. Inf. Theory 46 (4) (2000) 12041216.
patterns such as bitreversal and transpose due to the data transfer [2] M.A. Al Faruque, T. Ebi, J. Henkel, Congurable links for runtime adaptive on-
in both clock edges. chip communication, in: 2009 Design, Automation Test in Europe Conference
Exhibition, DATE '09, April 2009, pp. 256261.
[3] G. Ascia, V. Catania, M. Palesi, D. Patti, Neighbors-on-path: a new selection
strategy for on-chip networks, in: Proceedings of the 2006 IEEE/ACM/IFIP
7. Conclusion Workshop on Embedded Systems for Real Time Multimedia, 2006, pp. 7984.
[4] D. Bertozzi, A. Jalabert, S. Murali, R. Tamhankar, S. Stergiou, L. Benini, G. De
Micheli, Noc synthesis ow for customized domain specic multiprocessor
In this paper, we have proposed a new NoC router architecture systems-on-chip, IEEE Trans. Parallel Distrib. Syst. 16 (February (2)) (2005)
using bidirectional link with double data rate. We proposed to 113129.
[5] K.C. Bollapalli, R. Garg, K. Gulati, S.P. Khatri, On-chip bidirectional wiring for
insert an intermediate link stage and used phase pipelining to heavily pipelined systems using network coding, in: 2009 IEEE International
double the data rate. In addition, a recongurable structure has Conference on Computer Design, ICCD 2009, 2009, pp. 131136.
been designed to improve the latency as well as the throughput [6] Myong Hyon Cho, M. Lis, Keun Sup Shim, M. Kinsy, T. Wen, S. Devadas,
Oblivious routing in on-chip bandwidth-adaptive networks, in: 2009 18th
under different trafc conditions by changing the direction of the International Conference on Parallel Architectures and Compilation Techni-
link at run time. We explored three different BiLink architectures ques, PACT '09, 2009, pp. 181190.
42 J. Zhu et al. / INTEGRATION, the VLSI journal 55 (2016) 3042
[7] William James Dally, Brian Patrick Towles, Principles and Practices of Inter- Zhiliang Qian received his B.S. degree in Microelec-
connection Networks, Access Online via Elsevier, 2004. tronics from the Fudan University, Shanghai, China in
[8] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, 2008 and Ph.D. degree in Electronic and Computer
Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al., Large scale distributed Engineering from the Hong Kong University of Science
deep networks, in: Advances in Neural Information Processing Systems, 2012, and Technology, Hong Kong in 2014. He is now with the
pp. 12231231. Department of Micro- and Nano- Electronics, Shanghai
[9] R. Hesse, J. Nicholls, N.E. Jerger, Fine-grained bandwidth adaptivity in Jiao Tong University.
networks-on-chip using bidirectional channels, in: 2012 Sixth IEEE/ACM His research interests include high performance
International Symposium on Networks on Chip (NoCS), May 2012, pp. 132 Network-on-Chip design, low power VLSI imple-
141. mentation and embedded system design.
[10] Hu. Jingcao, R. Marculescu, Energy- and performance-aware mapping for
regular noc architectures, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.
24 (April (4)) (2005) 551562.
[11] Kai-Yuan Jheng, Chih-Hao Chao, Hao-Yu Wang, An-Yeu Wu, Trafc-thermal
mutual-coupling co-simulation platform for three-dimensional network-on-
chip, in: 2010 International Symposium on VLSI Design Automation and Test
Chi-Ying Tsui received the B.S. degree in electrical
(VLSI-DAT), IEEE, Hsin Chu, 2010, pp. 135138.
engineering from the University of Hong Kong, Hong
[12] Amit Kumar, Partha Kundu, Arvind P. Singh, Li shiuan Peh, Niraj K. Jha, A
Kong, and the Ph.D. degree in computer engineering
4.6 Tbits/s 3.6 GHz single-cycle noc router with a novel switch allocator in
from the University of Southern California, Los Angeles,
65 nm CMOS, in: ICCD-2007, 2007.
CA, USA, in 1994.
[13] Ying-Cherng Lan, Hsiao-An Lin, Shih-Hsin Lo, Yu Hen Hu, Sao-Jie Chen, A
He joined the Department of Electronic and Com-
bidirectional noc (binoc) architecture with dynamic self-recongurable
puter Engineering with the Hong Kong University of
channel, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 30 (3) (2011)
Science and Technology, Hong Kong, in 1994, where he
427440.
is currently a Full Professor. He has authored over 180
[14] Quoc V Le, Building high-level features using large scale unsupervised learn-
referred publications, and holds 10 U.S. patents on
ing, in: 2013 IEEE International Conference on Acoustics, Speech and Signal
power management, VLSI, and multimedia systems. His
Processing (ICASSP), IEEE, Vancouver, 2013, pp. 85958598.
current research interests include designing VLSI
[15] Chrysostomos Nicopoulos, Vijaykrishnan Narayanan, Chita R Das, Network-on-
architectures for low-power multimedia and wireless
Chip Architectures: A Holistic Design Exploration, vol. 45, Springer, 2009,
applications, developing power management circuits and techniques for embedded
http://www.springer.com/us/book/9789048130306.
portable devices, and ultralow-power systems.
[16] M. Palesi, S. Kumar, V. Catania, Bandwidth-aware routing algorithms for
Dr. Tsui was a recipient of the Best Paper Awards from the IEEE TRANSACTIONS
networks-on-chip platforms, IET Comput. Digit. Tech. 3 (September (5))
ON VLSI SYSTEMS in 1995, the IEEE International Symposium on Circuits and
(2009) 413429.
Systems in 1999, the IEEE/ACM International Symposium on Low Power Electronics
[17] M. Pedram, Qing Wu, Xunwei Wu, A new design of double edge triggered ip-
and Design in 2007, the IEEE International Symposium on Electronic Design, Test
ops, in: Proceedings of the 1998 Asia and South Pacic Design Automation
and Application in 2008, and CODES in 2012. He was also a recipient of the Design
Conference, ASP-DAC '98, February 1998, pp. 417421.
Awards in the IEEE Asia and South Pacic Design Automation Conference University
[18] A. Pullini, F. Angiolini, P. Meloni, D. Atienza, S. Murali, L. Raffo, G. De Micheli, L.
Design Contest in 2004 and 2006.
Benini, Noc design and implementation in 65 nm technology, in: 2007 First
International Symposium on Networks-on-Chip, NOCS 2007, May 2007,
pp. 273282.
[19] A. Pullini, F. Angiolini, S. Murali, D. Atienza, G. De Micheli, L. Benini, Bringing
nocs to 65 nm, IEEE Micro 27 (September (5)) (2007) 7585.
[20] Zhiliang Qian, Ying-Fei Teh, Chi-Ying Tsui, A it-level speedup scheme for
network-on-chips using self-recongurable bi-directional channels, in: 2012
Design, Automation Test in Europe Conference Exhibition (DATE), March 2012,
pp. 12951300.
[21] Praveen Salihundam, Shailendra Jain, Tiju Jacob, Shasi Kumar,
Vasantha Erraguntla, Yatin Hoskote, Sriram Vangal, Gregory Ruhl, Nitin Borkar,
A 2 Tb/s 6 4 mesh network for a single-chip cloud computer with dvfs in
45 nm cmos, IEEE J. Solid-State Circuits 46 (4) (2011) 757766.