Professional Documents
Culture Documents
Zoom in
Zoom out
Search Issue
Next Page
May/June 2014
http://www.computer.org/micro
Contents
Zoom in
Zoom out
Search Issue
Next Page
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
_____________________
May/June 2014
Volume 34 Number 3
Features
4
17
31
43
57
69
80
91
Disciplined Nondeterminism
Hyojin Sung, Rakesh Komuravelli, and Sarita V. Adve
Departments
2
149 Awards
Reflections from the 2013 Eckert-Mauchly Award Recipient
_________________________
________________
IEEE Micro (ISSN 0272-1732) is published bimonthly by the IEEE Computer Society.
IEEE Headquarters, Three Park Ave., 17th Floor, New York, NY 10016-5997; IEEE
Computer Society Headquarters, 2001 L St., Ste. 700, Washington, DC 20036; IEEE
Computer Society Publications Office, 10662 Los Vaqueros Circle, PO Box 3014,
Los Alamitos, CA 90720. Annual subscription rates: IEEE Computer Society members
get the lowest rates, US$45 (print and electronic). Go to http://www.computer.org/
subscribe to order and for more information on other subscription prices. Back issues:
___
members, $20; nonmembers, $148. This magazine is also available on the Web.
Postmaster: Send address changes and undelivered copies to IEEE, Membership
Processing Dept., 445 Hoes Ln., Piscataway, NJ 08855. Periodicals postage is paid
at New York, NY, and at additional mailing offices. Canadian GST #125634188.
Canada Post Corp. (Canadian distribution) Publications Mail Agreement #40013885.
Return undeliverable Canadian addresses to 4960-2 Walker Road; Windsor, ON N9A
6J3. Printed in USA.
Reuse rights and reprint permissions: Educational or personal use of this material is
permitted without fee, provided such use: 1) is not made for profit; 2) includes this notice
and a full citation to the original work on the first page of the copy; and 3) does not imply
IEEE endorsement of any third-party products or services. Authors and their companies
are permitted to post the accepted version of IEEE-copyrighted material on their own
web servers without permission, provided that the IEEE copyright notice and a full
citation to the original work appear on the first screen of the posted copy. An accepted
manuscript is a version which has been revised by the author to incorporate review
suggestions, but not the published version with copy-editing, proofreading, and formatting added by IEEE. For more information, please go to http://www.ieee.org/
publications_standards/publications/rights/paperversionpolicy.html.
_________________________
Permission to reprint/republish this material for commercial, advertising, or promotional purposes or for creating new collective works for resale or redistribution must be
obtained from IEEE by writing to the IEEE Intellectual Property Rights Office,
445 Hoes Lane, Piscataway, NJ 08854-4141 or pubs-permissions@ieee.org.
__________
Copyright # 2014 IEEE. All rights reserved.
Abstracting and library use: Abstracting is permitted with credit to the source.
Libraries are permitted to photocopy for private use of patrons, provided the
per-copy fee indicated in the code at the bottom of the first page is paid through
the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
Editorial: Unless otherwise stated, bylined articles, as well as product and service descriptions, reflect the authors or firms opinion. Inclusion in IEEE Micro does not necessarily
constitute an endorsement by IEEE or the Computer Society. All submissions are subject to
editing for style, clarity, and space. IEEE prohibits discrimination, harassment, and bullying.
For more information, visit http://www.ieee.org/web/aboutus/whatis/policies/p9-26.html.
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Erik R. Altman
Thomas J. Watson Research Center
ealtman@us.ibm.com
___________
ASSOCIATE EDITOR IN CHIEF
Lieven Eeckhout
Ghent University
lieven.eeckhout@ugent.be
_____________
ADVISORY BOARD
David H. Albonesi, Pradip Bose, Kemal Ebcioglu,
Michael Flynn, Ruby B. Lee, Yale Patt, James E.
Smith, and Marc Tremblay
EDITORIAL BOARD
Alper Buyuktosunoglu
IBM
Pradeep Dubey
Intel Corp.
Sandhya Dwarkadas
University of Rochester
Babak Falsafi
Ecole Polytechnique Federale de Lausanne
Krisztian Flautner
ARM
R. Govindarajan
Indian Institute of Science
Shane Greenstein
Northwestern University
Lizy Kurian John
University of Texas at Austin
Stephen W. Keckler
University of Texas at Austin
Margaret Martonosi
Princeton University
Richard Mateosian
Shubu Mukherjee
Cavium Networks
Toshio Nakatani
IBM
Vojin G. Oklobdzija
New Mexico State University
Ronny Ronen
Intel Corp.
Kevin W. Rudd
US Naval Academy
Andre Seznec
INRIA Rennes
Richard H. Stern
Olivier Temam
INRIA
Mateo Valero
Technical University of Catalonia
Tilman Wolf
University of Massachusetts, Amherst
Xiaodong Zhang
Ohio State University
EDITORIAL STAFF
Editorial Management
Molly Gamborg
Contributing Editors
Amber Ankerholz, Thomas Centrella,
Kristine Kelly, Keri Schreiner,
Dale Strok, and Joan Taylor
__________
___________
Submissions:
https://mc.manuscriptcentral.com/micro-cs
Author guidelines:
http://www.computer.org/micro
IEEE COMPUTER SOCIETY
PUBLICATIONS BOARD
Vice President
Jean-Luc Gaudiot
Magazine Operations Chair
Paolo Montuschi
Transactions Operations Committee
Laxmi N. Bhuyan
Digital Library Operations Committee
Frank Ferrante
Plagiarism Chair
David S. Ebert
Executive Director
Angela R. Burgess
Members-at-Large
Alain April, Greg Byrd, Robert Dupuis,
Linda I. Shafer, H.J. Siegel, and Per Stenstrom
__________
_________
_________
___________
MAY/JUNE 2014
micro
M
q
M
q
MQmags
q
EDITOR IN CHIEF
IEEE
M
q
M
q
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
1
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
................................................................................................................................................................
ERIK R. ALTMAN
Thomas J. Watson Research Center
......
.......................................................
micro
IEEE
Offer
Suggestions:
http://
____
scholaroneideas.force.com/
ideaListCustom
__________
Rate Ideas of Others: http://
____
mchelp.manuscriptcentral.com/
ScholarOneIdeas/howto.html
__________________
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
MQmags
q
Table 1. Mapping 2013 Top Picks articles to 2003 Top Picks categories.
Category
Unconventional
No. of 2003
No. of 2013
articles in
category
articles in
category
architectures
Power- and
temperature-aware
design
Reliability
Cache, memory,
and multiprocessor
optimizations
Energy Optimization*
A Configurable and Strong RAS Solution for Die-Stacked
DRAM Caches*
Building on conventional
microarchitectures
N/A
Performance analysis
N/A
Erik R. Altman
Editor in Chief
IEEE Micro
Erik R. Altman is the manager of the
Dynamic Optimization Group at the Tho-
___________
_______
.............................................................
MAY/JUNE 2014
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
................................................................................................................................................................................................................
It gives us great pleasure to introduce the special issue of the top picks from
the computer architecture conferences of
2013. The special issue presents a selection
of 12 papers that describe novel, exciting
research directions in areas as diverse as
design of datacenters, processors and accelerators, networks on chip, programmabilityenhancing frameworks, and emerging large
caches.
Mithuna S. Thottethodi
Purdue University
Shubu Mukherjee
Cavium
.......................................................
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
.............................................................
MAY/JUNE 2014
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
GUEST EDITORS INTRODUCTION
..............................................................................................................................................................................................
............................................................
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
MQmags
q
_____________
_______
___________________
.............................................................
MAY/JUNE 2014
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
................................................................................................................................................................................................................
................................................................................................................................................................................................................
......
I~nigo Goiri
William Katsak
Kien Le
Thu D. Nguyen
Ricardo Bianchini
Rutgers University
.......................................................
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
MQmags
q
(a)
Free
cooling
Air
conditioner
Inverter M
M
AC
M
AC
Main
electrical
panel
M
Grid
M AC electrical
Electrical
panel
grid
DC
AC
Solar panels
AC
PDU
M AC
IT
AC
Battery
M
M DC
controller
Batteries
(b)
Figure 1. Parasol: outside view showing the solar panels, container, and air conditioning unit
(a); power distribution and monitoring infrastructure (b). The cooling system can be powered
solely by the grid, or by the main electrical panel that receives power from all sources. Meters
(M) are available for measuring the power flowing into and out of every component.
Parasol
Figure 1a shows Parasol, a solar-powered
datacenter that we built as a research platform to study colocation and self-generation.
Parasol comprises a steel structure, a small
custom container housing two racks of servers and networking equipment, an air-side
economizer free-cooling unit and a directexpansion air conditioner, 16 solar panels
(producing up to 3.2 kW AC), two DC/AC
inverters, 16 lead-acid batteries (storing up
to 32 kWh), two charge controllers, and an
.............................................................
MAY/JUNE 2014
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
1.6
40
1.4
35
1.2
30
1.0
25
0.8
20
0.6
15
0.4
10
0.2
0.0
Temperature (C)
Energy (MWh)
TOP PICKS
0.2
5
Apr. 2012 July 2012 Oct. 2012 Jan. 2013 Apr. 2013 July 2013 Oct. 2013 Jan. 2014
Grid use
Solar use
Net meter
Inside
Outside
Figure 2. Energy consumption, net metering, and temperatures from April 2012 to January
2014. The figure shows the seasonal patterns for both renewable energy generation and
temperature.
inverters) or added on externally (for example, the cooling-system meter), for measuring
the power owing into and out of every component. Parasol also includes a switch that
allows for powering the cooling system from
the main electrical panel or only from the
grid. This enables experimentation with or
without the cooling system loading the solar
system and batteries.
We describe our rationale for the Parasol
design and the mistakes we made while
building it over 16 months (at a total cost of
$300,000) in our ASPLOS paper.3 In this
article, we report on data gathered from operating Parasol over 22 months. Specically,
solar generation and the IT equipment
became operational in April 2012, and Parasol became fully operational in June 2012.
............................................................
10
micro
IEEE
Cooling
Figure 3 shows the operation of the cooling system in Parasol during the second half
of August 2012. In this time period, the setpoint for internal temperature was 30 C; the
dashed line shows the actual internal temperature, whereas the solid line shows the outside temperature. The light gray area shows
the operation of the free-cooling unit,
whereas the dark gray area shows the
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
40
Temperature (C)
75
20
50
10
25
100
3.0
2.5
2.0
1.5
IT load
Battery charge level
Battery discharge
Battery charge
Grid use
Solar use
75
50
1.0
25
0.5
0
0.0
20
15
10
5
0
Sunday
28 Oct. 2012
Monday
29 Oct. 2012
Tuesday
Wednesday
30 Oct. 2012 31 Oct. 2012
which source of energy to use (renewable, battery, and/or grid), and choosing the renewable
energy storage medium (battery or grid) at
each point in time. GreenSwitch seeks to minimize the overall cost of grid electricity
(including both grid energy and peak grid
power), while respecting the characteristics of
.............................................................
MAY/JUNE 2014
micro
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
GreenSwitch
30
100
Inside
Outside
Free cooling
Air conditioner
Speed (%)
IEEE
M
q
M
q
M
q
M
q
MQmags
q
Power (kW)
micro
IEEE
11
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
GreenSwitch
Battery
charge level
Energy source
schedule
Workload
prediction
Solver
Configurer
Workload
schedule
Predictor
Energy
availability
prediction
Parasol
Figure 5. GreenSwitch architecture. Rectangles with round edges are data structures. Rectangles with square borders are
processes.
Architecture
............................................................
12
micro
IEEE
Figure 5 illustrates the GreenSwitch architecture. The predictor forecasts the workload
and the renewable energy production one
day into the future at the granularity of one
hour. The solver takes these predictions and
the current battery charge level as input, and
outputs a workload schedule and an energy
source and storage schedule. To compute
these schedules, the solver uses analytical
models of workload behavior, battery use,
and grid electricity cost. The congurer
effects the changes prescribed by the solver.
The changes may involve transitioning some
servers between power states and/or changing
the conguration of the energy sources. (We
have identied conguration parameters to
the inverters and charge controllers that give
us nearly full dynamic control of every source
of energy available to Parasol.)
A full iteration of GreenSwitch occurs
every 15 minutes, which enables it to properly control peak grid power use. (Utilities
typically compute peak grid power use in
windows of 15 minutes.) However, Green-
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
3.0
Solar available
Battery discharge
Battery charge
Grid use
Net metering
Solar use
IT load
2.5
0.20
1.5
0.10
Price ($/kWh)
0.15
2.0
Power (kW)
M
q
M
q
M
q
M
q
MQmags
q
1.0
0.05
0.5
0
00:00
04:00
08:00
12:00
16:00
20:00
0
00:00
Figure 6. GreenSwitch on deferrable Facebook workload. Most of the load during the night was delayed until renewable
energy became available. Batteries were used when no renewable energy was available.
.............................................................
MAY/JUNE 2014
micro
IEEE
13
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Renewable energy
As we mentioned earlier, several companies are starting to invest in datacenter colocation and self-generation. Regardless of
whether theyre making these investments for
market positioning, public relations, cost, or
environmental reasons, the fact is that they
are expecting bottom-line benets from
them. Moreover, despite their decreasing but
still-high capital costs, exploiting renewables
in datacenters could reduce overall energy
costs, peak grid power costs, or both, as our
ASPLOS paper explains. We expect that an
increasing number of companies will see benets in exploiting renewables.
Some research groups have also started
studying colocated and self-generating datacenters.4,5,7,11,12 These studies have been
attracting the attention of a growing community, with publications in venues such as
the International Symposium on Computer
Architecture (ISCA) and the International
Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS). We expect that our design
and experience with Parasol will accelerate
this growth, as researchers realize that they
can build nontrivial prototypes at relatively
low cost. Moreover, our analysis of solar and
wind energy cost and space requirements suggests that green datacenters will become
increasingly attractive.3
More broadly than datacenters, our experience will likely encourage more researchers
to consider the implications of external signals (such as variable-electricity pricing and
availability) on computing and communication in general.
............................................................
14
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
Research avenues
Parasol and GreenSwitch create many new
research avenues. For example, Parasol enables the study of the interplay between solar
energy and free cooling; interestingly, solar
energy is most abundant when the outside
temperature is hottest (that is, when standard
chiller-based cooling might be necessary in
warm climates). As another example, GreenSwitch demonstrates the benets of aggressive
and coordinated management of energy sources and stores and workload execution, as well
as the interplay between using batteries for
powering the workload and for storing renewable energy. Prior work on aggressive use of
batteries did not consider renewables.13
....................................................................
References
1. J. Koomey, Growth in Data Center Electricity Use 2005 to 2010, Analytic Press, 2011.
2. J. Mankoff, R. Kravets, and E. Blevis,
Some Computer Science Issues in Creating a Sustainable World, Computer, vol.
41, no. 8, 2008, pp. 102-105.
3. I. Goiri et al., Parasol and GreenSwitch:
Managing Datacenters Powered by Renewable Energy, Proc. 18th Intl Conf. Architectural Support for Programming Languages
and Operating Systems (ASPLOS 13), 2013,
pp. 51-64.
4. B. Aksanli et al., Utilizing Green Energy
Prediction to Schedule Mixed Batch and
Service Jobs in Data Centers, Proc. 4th
Workshop Power-Aware Computing and
Systems (HotPower 11), 2011, article no. 5.
5. I. Goiri et al., GreenSlot: Scheduling Energy
Consumption in Green Datacenters, Proc.
Intl Conf. High Performance Computing,
Networking, Storage and Analysis (SC 11),
2011, article no. 20.
6. I. Goiri et al., GreenHadoop: Leveraging
Green Energy in Data-Processing Frameworks, Proc. 7th ACM European Conf.
Computer Systems (EuroSys 12), 2012,
pp. 57-70.
7. A. Krioukov et al., Integrating Renewable
Energy Using Data Analytics Systems: Challenges and Opportunities, Data Eng. Bulletin, vol. 34, no. 1, 2011, pp. 3-11.
8. Z. Liu et al., Renewable and Cooling Aware
Workload Management for Sustainable
Data Centers, Proc. 12th ACM SIGMETRICS/PERFORMANCE
Joint
Intl
Conf.
Acknowledgments
We thank Abhishek Bhattacharjee, David
Meisner, Santosh Nagarakatte, Anand Sivasubramaniam, and Thomas F. Wenisch for com-
Proc.
Modeling,
Analysis
&
.............................................................
MAY/JUNE 2014
micro
IEEE
15
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
399.
10. EPFL, CloudSuite, 2012; http://parsa.epfl.
ch/cloudsuite/cloudsuite.html.
_________________
11. C. Li, A. Qouneh, and T. Li, iSwitch: Coordinating and Optimizing Renewable Energy
Powered Server Clusters, Proc. 39th Ann.
Intl Symp. Computer Architecture (ISCA
12), 2012, pp. 512-523.
17th Intl Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS 12), 2012, pp. 75-86.
I~
nigo Goiri is a research associate in the
Department of Computer Science at Rutgers
University. His research interests include
energy-efficient datacenter design and virtuali-
________________
____________
............................................................
16
micro
IEEE
_______
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
MQmags
q
..................................................................................................................................................................................................................
QUALITY-OF-SERVICE-AWARE
SCHEDULING IN HETEROGENEOUS
DATACENTERS WITH PARAGON
..................................................................................................................................................................................................................
......
Efciency is a rst-class requirement and the main source of scalability concerns both for small and large systems.1,2
Achieving high efciency is not only a matter
of sensible design, but also a function of how
the system is managed, which becomes essential as the hardware grows progressively heterogeneous and parallel and applications get
dynamic and diverse. Architecture has traditionally been about efcient system design.
As efciency increases in importance, architecture should be about both design and
management for systems of any scale.
In this article, we focus on improving efciency while guaranteeing high performance
in large-scale systems. Although an increasing
amount of computing now happens in public
and private clouds, such as Amazon Elastic
Compute Cloud (EC2; see http://aws.
amazon.com/ec2) or vSphere (www.vmware.
micro
IEEE
Christina Delimitrou
Christos Kozyrakis
Stanford University
.............................................................
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
17
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
..............................................................................................................................................................................................
Interference occurs as coscheduled applications contend in shared
resources. Coscheduled applications may interfere negatively even if
they run on different processor cores because they share caches,
memory channels, storage, and networking devices.1,2 If unmanaged,
interference can result in performance degradations of integer factors,2 especially when the application must meet tail latency guarantees apart from average performance.3 Figure A shows that an
interference-oblivious scheduler will slow workloads down by 34 percent on average, with some running more than two times slower. This
is undesirable for both users and operators.
Heterogeneity is the natural result of the infrastructures evolution, as servers are gradually provisioned and replaced over the typical
15-year lifetime of a datacenter.4-7 At any point in time, a datacenter
may host three to five server generations with a few hardware configurations per generation, in terms of the processor speed, memory,
storage, and networking subsystems. Managing the different hardware incorrectly not only causes significant performance degradations
to applications sensitive to server configuration, but also wastes
resources as workloads occupy servers for significantly longer, and
gives a low-quality signal to hardware vendors for the design of future
platforms. Figure A shows that a heterogeneity-oblivious scheduler
will slow applications down by 22 percent on average, with some running nearly 2 times slower (see the Methodology section in the
main article).
Finally, a baseline scheduler that is oblivious to both interference
and heterogeneity and which schedules applications to least-loaded
servers is even worse (48 percent average slowdown), causing some
workloads to crash due to resource exhaustion on the server. Unless
interference and heterogeneity are managed in a coordinated fashion,
the system loses both its efficiency and predictability guarantees. Previous research has identified the issues of heterogeneity6 and interference,2 but while most cloud management systemssuch as
Mesos8 or vSphere (www.vmware.com/products/vsphere)have
____________________
some notion of contention or interference awareness, they either use
empirical rules for interference management or assume long-running
workloads (for example, online services), whose repeated behavior
can be progressively modeled. In this article, we target both heterogeneity and interference and assume no a priori analysis of the application. Instead, we leverage information the system already has about
the large number of applications it has previously seen.
1.0
No interference
Least loaded
0.8
0.6
0.4
0.2
0.0
1,000
2,000
3,000
Workloads
4,000
5,000
Introduction to
the
Design
of
Warehouse-Scale
References
ICAC.2007.16.
............................................................
18
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
datacenter scheduler that accounts for heterogeneity and interference. The key feature of
Paragon is its ability to quickly and accurately
classify an unknown application with respect
to heterogeneity (which server congurations
it will perform best on) and interference
(how much interference it will cause to
coscheduled applications and how much
interference it can tolerate itself in multiple
shared resources). Unlike previous techniques
that require detailed proling of each incoming application, Paragons classication
engine exploits existing data from previously
scheduled workloads and requires only a
minimal signal about a new workload. Specically, it is organized as a low-overhead recommendation system similar to the one
deployed for the Netix Challenge,6 but
instead of discovering similarities in users
movie preferences, it nds similarities in
applications preferences with respect to heterogeneity and interference. It uses singular
value decomposition (SVD) to perform collaborative ltering and identify similarities
between incoming and previously scheduled
workloads.
Once an incoming application is classied, a greedy scheduler assigns it to the server
that is the best possible match in terms of
platform and minimum negative interference
between all coscheduled workloads. Even
though the nal step is greedy, the high accuracy of classication leads to schedules that
achieve both fast execution time and efcient
resource usage. Paragon scales to systems
with tens of thousands of servers and tens of
congurations, running large numbers of
previously unknown workloads. We implemented Paragon and showed that it signicantly improves cluster utilization, while
preserving per-application quality-of-service
(QoS) guarantees both for small- and largescale systems. For more information on
related work, see the Research Related to
Paragon sidebar.
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
19
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
..............................................................................................................................................................................................
Datacenter scheduling
Recent work on datacenter scheduling has highlighted the importance of platform heterogeneity and workload interference. Mars
et al. showed that the performance of Google workloads can vary by
up to 40 percent because of heterogeneity, even when considering
only two server configurations, and by up to 2 times because of interference, even when considering only two colocated applications.1,2
Govindan et al. also present a scheme to quantify the effects of cache
interference between consolidated workloads.3 In Paragon, we extend
the concepts of heterogeneity- and interference-aware scheduling by
providing an online, scalable, and low-overhead methodology that
accurately classifies applications for both heterogeneity and interference across multiple resources.
micro
IEEE
References
1. J. Mars, L. Tang, and R. Hundt, Heterogeneity in Homogeneous Warehouse-Scale Computers: A Performance Oppor2011, pp. 29-32.
20
VM management
............................................................
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
matrix of
1
a1;n
a2;n C
C
C
.. C
. A
am;n
Umr
1
C
C;
A
um;1 um;r
1
v1;1 v1;r
B .
..
.. C
C
B
.
. A;
@ ..
vn;1 vn;r
0
1
r1 0
B . .
C
. . ... C
B
@ ..
A
0 rr
0
V nr
Rrr
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
21
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
............................................................
22
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
Multithreaded
(%)
Multiprogrammed
(%)
I/O bound
(%)
86
91
86
90
83
89
89
92
67
62
59
43
78
28
71
24
63
18
58
22
Metric
Percentage (%)
5.3
3.4
59.0
15.8
14.6
0.9
against two randomly chosen microbenchmarks for one minute, and its sensitivity
scores are added in a new row in each of the
matrices. Then, we use SVD and PQ reconstruction to derive the missing entries and the
condence in each similarity concept.
Validation. We evaluated the accuracy of
interference classication using the same
workloads and systems as before. Table 2
summarizes key statistics on the classication
quality. The average error in estimating both
tolerated and caused interference across SoIs
is 5.3 percent. For high values of sensitivity
(that is, applications that tolerate and cause a
lot of interference), the error is even lower
(3.4 percent).
.............................................................
MAY/JUNE 2014
micro
IEEE
23
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
5 4
3
2 4
5
2 3
3
2
3
3
5
1
5
3
1
1
2
1
5
1
5
1
5
3
4
3
5
2
2
4
2
3
4
5
5
4
4
5
3
2
3
1
4
3
3
5
5
1
2
5
1
5
3
2
5
5
3
4
U V
Heterogeneity
scores
B
C
Selection of colocation
candidates
State: N*16B
2x
5 4
3
2 4
5
2 3
3
2
3
3
1
5
1
5
3
1
1
2
1
5
1
5
1
5
3
4
3
5
2
2
4
2
3
4
5
5
4
4
5
3
2
3
1
4
3
3
5
5
1
2
5
1
5
3
2
5
5
3
4
U V
Interference
scores
Step 2: Server selection
D
E
C
S
D
ES
E F
A
F
A
F
A
A
DC servers
Figure 1. The components of Paragon and the state maintained by each component. Overall, the state requirements are
marginal and scale linearly or logarithmically with the number of applications (N), servers (M), and configurations. (PQ: PQ
reconstruction; SVD: singular value decomposition; DC: datacenter.)
proling overheads. In our full paper,5 we discuss the issue of workload phases (that is,
transient effects that do not appear in the
1-minute proling period). Next, we use collaborative ltering to classify the application
in terms of heterogeneity and interference.
This requires a few milliseconds even when
considering thousands of applications and
several tens of SCs or SoIs. Classication for
heterogeneity and interference is performed
in parallel. For the applications we considered, the overall proling and classication
overheads are 1.2 and 0.09 percent on
average.
Using analytical methods for classication
has two benets. First, we have strong analytical guarantees on the quality of the information used for scheduling, instead of relying
mainly on empirical observation. The analytical framework provides low and tight error
bounds on the accuracy of classication, statistical guarantees on the quality of colocation
candidates, and detailed characterization of
system behavior. Moreover, the scheduler
design is workload independent, which
means that the properties the scheme provides hold for any workload. Second, these
methods are computationally efcient, scale
well with the number of applications and
SCs, and do not introduce signicant scheduling overheads.
Paragon
............................................................
24
micro
IEEE
Scheduler design
Figure 1 presents an overview of Paragons
components and operation. The scheduler
maintains per-application and per-server
state. The per-application state includes the
classication information; for a datacenter
with 10 SCs and 10 SoIs, it is 64 bytes per
application. The per-server state records the
IDs of applications running on a server and
the cumulative sensitivity to interference
(roughly 64 bytes per server). The per-server
state is updated as applications are scheduled
and, later on, completed. Overall, state overheads are marginal and scale logarithmically
or linearly with the number of applications
(N) and servers (M). In our experiments with
thousands of applications and servers, a single
server could handle all processing and storage
requirements of scheduling, although additional servers can be used for fault tolerance.
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
Evaluation methodology
In the following paragraphs, we describe
the server systems, alternative schedulers,
applications, and workload scenarios used in
our evaluation.
We evaluated Paragon on a 1,000-server
cluster on Amazon EC2 with 14 instance
types from small to extra large.12 All instances
were exclusive (reserved)that is, no other
users had access to the servers. There were no
external scheduling decisions or actions such
as auto-scaling or workload migration during
the course of the experiments.
We compared Paragon to three schedulers. The rst is a baseline scheduler that
assigns applications to least-loaded (LL)
machines, accounting for their core and
memory requirements but ignoring their heterogeneity and interference proles. The
second is a heterogeneity-oblivious (NH)
scheme that uses the interference classication in Paragon to assign applications to servers without visibility in their SCs. The third
is an interference-oblivious (NI) scheme that
uses the heterogeneity classication but has
no insight on workload interference.
We used 400 single-threaded (ST), multithreaded (MT), and multiprogrammed
(MP) applications from SPEC CPU2006,
several multithreaded benchmark suites,5 and
SPECjbb. For multiprogrammed workloads,
we created 350 mixes of four SPEC applications. We also used 26 I/O-bound workloads
in Hadoop and Matlab running on a single
node. Workload durations range from
minutes to hours. For workload scenarios
with more than 426 applications, we replicated these workloads with equal likelihoods
(1/4 ST, 1/4 MT, 1/4 MP, and 1/4 I/O) and
randomized their interleaving.
We used the applications listed in this section to examine the following scenarios: a lowload scenario with 2,500 randomly chosen
applications submitted with 1-second intervals, a high-load scenario with 5,000 applications submitted with 1-second intervals, and
an oversubscribed scenario where 7,500 workloads are submitted with 1-second intervals
and an additional 1,000 applications arrive in
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
25
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Low load
High load
1.2
Speedup over alone on
best platform
1.2
1.0
0.8
0.6
0.4
0.2
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0
5,00
1,000
1,500
2,000
2,500
1,000
Workloads
2,000
3,000
4,000
5,000
Workloads
(a)
(b)
Oversubscribed
1.2
1.0
0.8
0.6
0.4
0.2
0.0
(c)
Alone on best platform
No heterogeneity (NH)
Paragon (P)
No interference (NI)
Figure 2. Performance comparison between the four schedulers for three workload scenarios on 1,000 Amazon Elastic
Compute Cloud (EC2) servers. Performance is normalized to optimal performance in isolation, and applications are ordered
from worst to best performing.
Evaluation
We evaluated the Paragon scheduler
against the LL, NH, and NI schedulers, with
respect to performance, decision quality,
resource allocation, and cluster utilization.
Performance impact
............................................................
26
micro
IEEE
EC2 cluster. The low-load scenario, in general, does not create signicant performance
challenges. Nevertheless, Paragon outperforms the other three schemes; it preserves
QoS for 91 percent of workloads and
achieves on average 96 percent of the performance of a workload running in isolation
in the best SC. When moving to the highload scenario, the difference between schedulers becomes more obvious. Although the
heterogeneity and interference-oblivious
schemes degrade performance by an average
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
No degradation
100
< 20%
> 20%
100
60
40
20
Low load
LL
NH
NI
P
LL
NH
NI
P
LL
NH
NI
P
0
High load Oversubscribed
80
60
40
20
0
Low load
LL
NH
NI
P
80
LL
NH
NI
P
Application percentage
Application percentage
M
q
M
q
M
q
M
q
MQmags
q
LL
NH
NI
P
micro
IEEE
Figure 3. Breakdown of decision quality for the four schedulers across the three EC2
scenarios. Different colors correspond to different impacts in application performance in
terms of heterogeneity (left) and interference.
Decision quality
Figure 3 shows a breakdown of the decision quality of the different schedulers for
heterogeneity (left) and interference (right)
across the three scenarios. LL induces more
Resource allocation
Figure 4 shows why this deviation exists.
The solid black line in each graph represents
the required core count based on the applications running at a snapshot of the system,
while the other lines show the allocated cores
by each of the schedulers. Because Paragon
optimizes for increased utilization within QoS
constraints, it follows the application requirements closely. It only deviates when the
required core count exceeds the resources
available in the system (oversubscribed case).
NH has mediocre accuracy, whereas NI and
LL either signicantly overprovision the number of allocated cores, or oversubscribe certain
servers. There are two important points in
.............................................................
MAY/JUNE 2014
micro
IEEE
27
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
High load
Oversubscribed load
7,000
6,000
6,000
6,000
5,000
4,000
3,000
2,000
Core count
7,000
Core count
Core count
Low load
7,000
5,000
4,000
3,000
5,000
4,000
3,000
2,000
2,000
1,000
1,000
1,000
0
0
(a)
100
200 300
Time (minutes)
400
0
0 100 200 300 400 500 600 700 800
Time (minutes)
500
(b)
(c)
Required
No interference (NI)
No heterogeneity (NH)
Paragon (P)
Figure 4. Resource allocation for the three workload scenarios. Each line corresponds to the number of allocated computing
cores at each point during the execution of the scenario. Although the heterogeneity-oblivious (NH), interference-oblivious
(NI), and least-loaded (LL) schedulers under- or overestimate the required resources, Paragon closely follows the application
resource requirements.
Least loaded
Paragon
1,000
100
100
1,000
90
90
80
800
80
800
70
60
50
40
400
Servers
Servers
70
600
60
600
50
400
40
30
30
20
200
200
20
10
10
0
100
200
300
Time (minutes)
(a)
400
500
100
200
300
400
Time (minutes)
500
(b)
Figure 5. CPU utilization heat maps for the high-load scenario for the least-loaded system and Paragon. Utilization is averaged
across the cores of a server and is sampled every 5 seconds. Darker colors correspond to higher CPU utilization in the heat
maps.
............................................................
28
micro
IEEE
Cluster utilization
Figure 5 shows the cluster utilization in
the high-load scenario for LL and Paragon in
the form of heat maps. Utilization is shown
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
feedback on earlier versions of this manuscript. This work was partially supported by
a Google-directed research grant on energyproportional computing. Christina Delimitrou was supported by a Stanford Graduate
Fellowship.
....................................................................
References
Acknowledgments
We sincerely thank John Ousterhout,
Mendel Rosenblum, Byung-Gon Chun,
Daniel Sanchez, Jacob Leverich, David Lo,
and the anonymous reviewers for their
.............................................................
MAY/JUNE 2014
micro
IEEE
29
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
resource efficiency, and datacenter application analysis and modeling. Delimitrou has
an MS in electrical engineering from
Stanford University. She is a student member of IEEE and the ACM.
Christos Kozyrakis is an associate professor
in the Departments of Electrical Engineering and Computer Science at Stanford University, where he investigates hardware
architectures, system software, and programming models for systems ranging from
cell phones to warehouse-scale datacenters.
His research focuses on resource-efficient
cloud computing, energy-efficient multicore
systems, and architectural support for security. Kozyrakis has a PhD in computer
science from the University of California,
Berkeley. He is a senior member of IEEE
and the ACM.
Direct questions and comments about this
article to Christina Delimitrou, Gates Hall,
353 Serra Mall, Room 316, Stanford, CA
94305; ____________
cdel@stanford.edu.
________________
________________________
............................................................
30
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
MQmags
q
..................................................................................................................................................................................................................
..................................................................................................................................................................................................................
......
micro
IEEE
for servers and for the general-purpose market, leading to extreme inefciency in todays
datacenters. Moreover, both general-purpose
and traditional server processor designs follow a trajectory that benets scale-up workloads, a trend that was established for desktop
processors long before the emergence of scaleout workloads.
In this article, based on our paper for the
17th International Conference on Architectural Support for Programming Languages
and Operating Systems,3 we observe that
scale-out workloads share many inherent
characteristics that place them into a workload class distinct from desktop, parallel, and
traditional server workloads. We perform a
detailed microarchitectural study of a range
of scale-out workloads, nding a large mismatch between the demands of the scale-out
workloads and todays predominant processor microarchitecture. We observe signicant overprovisioning of the memory
hierarchy and core microarchitectural resources for the scale-out workloads.
We use performance counters to study the
behavior of scale-out workloads running on
Michael Ferdman
Stony Brook University
Almutaz Adileh
Ghent University
Onur Kocberber
Stavros Volos
Mohammad Alisafaee
Djordje Jevdjic
Cansu Kaynak
Adrian Daniel Popescu
Anastasia Ailamaki
Babak Falsafi
Ecole Polytechnique Federale
de Lausanne
.............................................................
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
31
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
............................................................
32
micro
IEEE
Todays datacenters are built around conventional desktop processors whose architecture was designed for a broad market. The
dominant processor architecture has closely
followed the technology trends, improving
single-thread performance with each processor generation by using the increased clock
speeds and free (in area and power) transistors provided by progress in semiconductor
manufacturing. Although Dennard scaling
has stopped,1,2,6,7 with both clock frequency
and transistor counts becoming limited by
power, processor architects have continued to
spend resources on improving single-thread
performance for a broad range of applications
at the expense of area and power efciency.
In this article, we study a set of applications that dominate todays cloud infrastructure. We examined a selection of Internet
services on the basis of their popularity. For
each popular service, we analyzed the class
of application software used by major providers to offer these services, either on their
own cloud infrastructure or on a cloud
infrastructure leased from a third party.
Overall, we found that scale-out workloads
have similar characteristics. All applications
we examined
operate on large data sets that are distributed across a large number of
machines, typically into memoryresident shards;
serve large numbers of completely
independent requests that do not
share any state;
have application software designed
specifically for the cloud infrastructure, where unreliable machines may
come and go; and
use connectivity only for high-level
task management and coordination.
Specically, we identied and studied the
following workloads: an in-memory object
cache (Data Caching); a NoSQL persistent
data store (Data Serving); data ltering,
transformation, and analysis (MapReduce); a
video-streaming service (Media Streaming); a
large-scale irregular engineering computation
(SAT Solver); a dynamic Web 2.0 service
(Web Frontend); and an online search engine
node (Web Search). To highlight the differences between scale-out workloads and
traditional workloads, we evaluated cloud
workloads alongside the following traditional
benchmark suites: Parsec 2.1 Parallel workloads, SPEC CPU2006 desktop and engineering workloads, SPECweb09 traditional web
services, TPC-C traditional transaction processing workload, TPC-E modern transaction
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
Details
Processor
Reorder buffer
128 entries
Load-store queue
Reservation stations
48/32 entries
36 entries
Level-1 caches
Level-2 cache
access latency
256 Kbytes per core, six-cycle access latency
Memory
Methodology
We conducted our study on a PowerEdge
M1000e enclosure with two Intel X5670 processors and 24 Gbytes of RAM in each blade,
using Intel VTune to analyze the systems
microarchitectural behavior. Each Intel X5670
processor includes six aggressive out-of-order
processor cores with a three-level cache hierarchy: the L1 and L2 caches are private to
each core; the last-level cache (LLC)the L3
cacheis shared among all cores. Each core
includes several simple stride and stream
prefetchers, labeled as adjacent-line, HW
prefetcher, and DCU streamer in the
processor documentation and system BIOS
settings. The blades use high-performance
Broadcom server network interface controllers
(NICs) with drivers that support multiple
transmit queues and receive-side scaling. The
NICs are connected by a built-in M6220
switch. For bandwidth-intensive benchmarks,
2-Gbit NICs are used in each blade.
Table 1 summarizes the blades key architectural parameters. We limited all workload
congurations to four cores, tuning the
workloads to achieve high utilization of the
cores (or hardware threads, in the case of
the SMT experiments), while maintaining
the workload quality-of-service requirements.
To ensure that all application and operating
Results
We explore the microarchitectural behavior of scale-out workloads by examining the
commit-time execution breakdown in Figure 1. We classify each cycle of execution as
Committing if at least one instruction was
committed during that cycle, or as Stalled
otherwise. We note that computing a breakdown of the execution-time stall components
of superscalar out-of-order processors cannot
be performed precisely because of overlapped
work in the pipeline. We therefore present
execution-time breakdown results based on
the performance counters that have no overlap. Alongside the breakdown, we show the
Memory cycles, which approximate time
spent on long-latency memory accesses, but
potentially partially overlap with instruction
commits.
The execution-time breakdown of scaleout workloads is dominated by stalls in both
the application code and operating system.
Notably, most of the stalls in scale-out workloads arise because of long-latency memory
accesses. This behavior is in contrast to the
CPU-intensive desktop (SPEC2006) and
parallel (Parsec) benchmarks, which stall execution signicantly less than 50 percent
of the cycles and experience only a fraction
.............................................................
MAY/JUNE 2014
micro
IEEE
33
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
100%
75%
50%
25%
Stalled (OS)
Stalled (Application)
Committing (Application)
Committing (OS)
nd
-E
ke
Ba
c
TP
W
eb
-C
C
TP
C
(c
PA
pu
RS
)
EC
SP
(m
EC
em
20
)
06
SP
EC
(c
pu
20
)
06
(m
em
SP
)
EC
w
eb
09
PA
RS
E
ar
ch
Se
te
n
lv
Fr
on
So
W
eb
W
eb
er
g
in
SA
T
uc
M
ed
ia
St
re
a
in
ap
M
at
a
Se
Re
d
rv
hi
ac
D
C
at
a
D
0%
ng
TOP PICKS
Memory
Figure 1. Execution-time breakdown and memory cycles of scale-out workloads (left) and traditional benchmarks (right).
Execution time is further broken down into its application and operating system components.
............................................................
34
micro
IEEE
of the stalls due to memory accesses. Furthermore, although the execution-time breakdown of some scale-out workloads (such
as MapReduce and SAT Solver) appears
similar to the memory-intensive Parsec and
SPEC2006 benchmarks, the nature of these
workloads stalls is different. Unlike the scaleout workloads, many Parsec and SPEC2006
applications frequently stall because of pipeline ushes after wrong-path instructions,
with much of the memory access time not on
the critical path of execution.
Scale-out workloads show memory system
behavior that more closely matches traditional online transaction processing workloads (TPC-C, TPC-E, and Web Backend).
However, we observe that scale-out workloads differ considerably from traditional
online transaction processing (TPC-C), which
spends more than 80 percent of the time
stalled, owing to dependent memory accesses.
We nd that scale-out workloads are most
similar to the more recent transaction processing benchmarks (TPC-E) that use more complex data schemas or perform more complex
queries than traditional transaction processing. We also observe that a traditional enterprise web workload (SPECweb09) behaves
differently than the Web Frontend workload,
representative of modern scale-out congurations. Although the traditional web workload
is dominated by serving static les and a
few dynamic scripts, modern scalable web
Front-end inefficiencies
There are three major front-end inefciencies:
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
117
146
100
L1-I (OS)
L1-I (Application)
L2 (OS)
L2 (Application)
75
50
25
d
ck
en
-E
Ba
C
W
eb
TP
-C
C
TP
eb
w
EC
SP
SP
EC
20
09
06
C
SE
Se
PA
R
ar
ch
nd
eb
W
eb
Fr
on
te
ol
ve
r
tS
Sa
M
ed
ia
St
Re
re
a
du
ce
rv
in
g
a
at
M
ap
Se
ac
C
a
at
D
m
in
g
hi
ng
M
q
M
q
M
q
M
q
MQmags
q
Figure 2. L1 and L2 instruction miss rates for scale-out workloads (left) and traditional benchmarks (right). The miss rate is
broken down into its application and operating system components.
stalls serve as a fundamental source of inefciency for both area and power, because the
core real estate and power consumption are
entirely wasted for the cycles that the front
end spends fetching instructions.
Figure 2 presents the instruction miss
rates of the L1 instruction cache and the L2
cache. In contrast to desktop and parallel
benchmarks, the instruction working sets of
many scale-out workloads considerably
exceed the capacity of the L1 instruction
cache, resembling the instruction-cache
behavior of traditional server workloads.
Moreover, the instruction working sets of
most scale-out workloads also exceed the L2
cache capacity, where even relatively infrequent instruction misses incur considerable
performance penalties. We nd that modern
processor architectures cant tolerate the
latency of the L1 instruction caches misses,
avoiding front-end stalls only for applications
whose entire instruction working set ts into
the L1 cache. Furthermore, the high L2
instruction miss rates indicate that the L1
instruction caches capacity experiences a signicant shortfall and cant be mitigated by
the addition of a modestly sized L2 cache.
The disparity between the needs of
the scale-out workloads and the processor
architecture are apparent in the instructionfetch path. Although exposed instruction-
.............................................................
MAY/JUNE 2014
micro
IEEE
35
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
4
SMT
Application MLP
Application IPC
Baseline
3
2
1
0
Baseline
SMT
6
4
2
at
a
D Ca
at c
a h
M M Se ing
e d a p rv
ia Re ing
St du
re c
W SAT ami e
eb S ng
F o
W ro lve
eb nte r
Se nd
PA
ar
R
ch
PA S
E
SP RS C
(
E
SP EC C cp
EC 20 (m u)
20 06 em
0 (c )
SP 6 (m pu
EC e )
w m)
eb
0
T
P 9
W
eb T C-C
Ba PC
c k -E
en
d
D
at
a
D Ca
at c
a h
M Ma Se ing
e d p rv
ia Re ing
St du
re c
a e
S
W AT min
eb S g
F o
W ron lver
eb te
Se nd
PA
ar
c
PA RS
EC h
R
SP S
E (
SP EC C cpu
EC 200 (me )
20 6 m
0 (c p )
SP 6 (m u)
EC em
w )
eb
TP 09
W
eb T C-C
Ba PC
c k -E
en
d
Figure 3. The instructions per cycle (IPC) and memory-level parallelism (MLP) of a simultaneous multithreading (SMT) enabled
core. Application IPC for systems with and without SMT out of a maximum IPC of 4 (a). MLP for systems with and without
SMT (b). Range bars indicate the minimum and maximum of the corresponding group.
Core inefficiencies
There are two major core inefciencies:
............................................................
36
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
Data-access inefficiencies
There are two major data-access inefciencies:
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
37
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
User IPC
normalized to baseline
1.0
0.9
0.8
0.7
0.6
0.5
4
10
11
Server
SPEC2006 (mcf)
100%
L2 hit ratio
75%
50%
25%
at
a
C
D ac
at
a hin
g
S
M e
M ap rvi
ed R ng
ia ed
St uc
re
a e
SA min
W TS g
eb o
Fr lve
r
W onte
eb
nd
Se
ar
ch
PA
R
SP SE
EC C
SP 2
EC 00
w 6
eb
0
TP 9
C
-C
W
eb TPC
Ba -E
ck
en
d
0%
............................................................
38
micro
IEEE
the polluter threads achieve nearly a 100 percent hit ratio in the LLC, effectively reducing
the cache capacity available for the workload
running on the remaining cores of the same
processor.
We plot the average system performance
of scale-out workloads as a function of the
LLC capacity, normalized to a baseline system with a 12-Mbyte LLC. Unlike in the
memory-intensive desktop applications (such
as SPEC2006 mcf), we nd minimal performance sensitivity to LLC size above 4 to 6
Mbytes in scale-out and traditional server
workloads. The LLC captures the instruction
working sets of scale-out workloads, which
are less than 2 Mbytes. Beyond this point,
small shared supporting structures may consume another 1 to 2 Mbytes. Because scaleout workloads operate on massive datasets
and service a large number of concurrent
requests, both the dataset and the per-client
data are orders of magnitude larger than the
available on-chip cache capacity. As a result,
an LLC that captures the instruction working
set and minor supporting data structures
achieves nearly the same performance as an
LLC with double or triple the capacity.
In addition to leveraging MLP to overlap
demand requests from the processor core,
modern processors use prefetching to speculatively increase MLP. Prefetching has been
shown effective at reducing cache miss rates
by predicting block addresses that will be referenced in the future and bringing these
blocks into the cache prior to the processors
demand, thereby hiding the access latency. In
Figure 5, we present the hit ratios of the L2
cache when all available prefetchers are
enabled (Baseline), as well as the hit ratios
after disabling the prefetchers. We observe a
noticeable degradation of the L2 hit ratios of
many desktop and parallel applications when
the adjacent-line prefetcher and L2 hardware
prefetcher are disabled. In contrast, only one
of the scale-out workloads (MapReduce) signicantly benets from these prefetchers,
with the majority of the workloads experiencing negligible changes in the cache hit rate.
Moreover, similar to traditional server workloads (TPC-C), disabling the prefetchers
results in an increase in the hit ratio for some
scale-out workloads (Data Caching, Media
Streaming, and SAT Solver). Finally, we note
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
10.0%
23%
Application
OS
7.5%
5.0%
2.5%
at
a
D Cac
at
a h in
M Se g
M ap rvi
ed R ng
ia ed
St uc
re
a e
SA mi
W T S ng
eb o
Fr lve
r
W ont
eb en
Se d
ar
ch
P
SP AR
EC SE
w C
eb
0
TP 9
C
-C
W
eb TP
Ba C-E
ck
en
d
0.0%
Bandwidth inefficiencies
The major bandwidth inefciencies are
Increasing core counts have brought parallel programming into the mainstream,
highlighting the need for fast and high-bandwidth inter-core communication. Multithreaded applications comprise a collection
of threads that work in tandem to scale up
the application performance. To enable effective scale-up, each subsequent generation of
processors offers a larger core count and
improves the on-chip connectivity to support
faster and higher-bandwidth core-to-core
communication.
We investigate the utility of the on-chip
interconnect for scale-out workloads in
Figure 6. To measure the frequency of readwrite sharing, we execute the workloads on
cores split across two physical processors in
separate sockets. When reading a recently
modied block, this conguration forces
accesses to actively shared read-write blocks
to appear as off-chip accesses to a remote processor cache. We plot the fraction of L2
misses that access data most recently written
by another thread running on a remote core,
breaking down each bar into Application and
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
39
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
20%
Off-chip memory
bandwidth utilization
Application
OS
15%
10%
5%
at
a
D Ca
at ch
a
in
M Ma Ser g
ed p vi
R
ia e ng
St du
re c e
S am
W AT in
eb S g
F o lv
W ro n e r
eb te
Se nd
PA
ar
ch
PA RSE
RS C
SP E (c
E C p
SP C2 (m u)
EC 00 em
2 0 6 (c )
0
p
SP 6 ( u)
EC me
w m)
eb
0
TP 9
C
W
e b T -C
Ba PCck E
en
d
0%
............................................................
40
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
....................................................................
References
Acknowledgments
We thank the reviewers and readers for
their feedback and suggestions on all earlier
.............................................................
MAY/JUNE 2014
micro
IEEE
41
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
............................................................
42
micro
IEEE
3D-die stacking, with an emphasis on locality and energy efficiency. Jevdjic has an MSc
in electrical and computer engineering from
the University of Belgrade.
Cansu Kaynak is a PhD candidate in the
School of Computer and Communication
Sciences at Ecole
Polytechnique Federale de
Lausanne. Her research focuses on server
systems, especially memory system design.
Kaynak has a BSc in computer engineering
from TOBB University of Economics and
Technology.
Adrian Daniel Popescu is a PhD candidate
in the School of Computer and Communi
cation Sciences at Ecole
Polytechnique
Federale de Lausanne. His research focuses
on the intersection of database management
systems with distributed systems, specifically
query performance prediction. Popescu has
an MSc in electrical and computer engineering from the University of Toronto.
Anastasia Ailamaki is a professor in the
School of Computer and Communication
Sciences at Ecole
Polytechnique Federale de
Lausanne. Her research interests include
optimizing database software for emerging
hardware and I/O devices and automating
database management to support scientific
applications. Ailamaki has a PhD in computer science from the University of
Wisconsin-Madison.
Babak Falsafi is a professor in the School of
Computer and Communication Sciences at
Ecole
Polytechnique Federale de Lausanne
and the founding director of EcoCloud, an
interdisciplinary research center targeting robust, economic, and environmentally friendly
cloud technologies. Falsafi has a PhD in computer science from the University of Wisconsin-Madison.
Direct questions and comments about this
article to Michael Ferdman, Stony Brook University, 1419 Computer Science, Stony Brook,
NY 11794; ___________________
mferdman@cs.stonybrook.edu.
____________
_______
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
MQmags
q
..................................................................................................................................................................................................................
..................................................................................................................................................................................................................
......
micro
IEEE
H
X
tc h;
h1
Tushar Krishna
Chia-Hsin Owen Chen
Woo-Cheol Kwon
Li-Shiuan Peh
Massachusetts Institute of
Technology
.............................................................
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
43
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
H
X
tc h;
h1
Background
............................................................
44
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
with multiple clock domains across a die,6 where each hop can incur
significant synchronization delay. They aim to remove this synchronization delay. This leads them to propose sending a clock signal with
the data so that the data can be latched correctly at the destination
router. However, unlike Smart, bypass and buffer modes cannot be
switched cycle by cycle, and flits must be speculatively latched at
every hop.
References
1. W. Dally and B. Towles, Principles and Practices of Interconnection Networks, Morgan Kaufmann, 2003.
2. J. Kim et al., Flattened Butterfly Topology for On-Chip
Networks, Proc. 40th Ann. IEEE/ACM Intl Symp. Microarchitecture, 2007, pp. 172-182.
3. Y.-H. Kao et al., CNoC: High-Radix Clos Network-on-Chip,
IEEE Trans. Computer-Aided Design of Integrated Circuits
and Systems, vol. 30, no. 12, 2011, pp. 1897-1910.
4. J. Kim et al., Microarchitecture of a High-Radix Router,
Proc. 32nd Ann. Intl Symp. Computer Architecture (ISCA
05), 2005, pp. 420-431.
5. T. Bjerregaard and J. Sparso, A Router Architecture for Connection-Oriented Service Guarantees in the MANGO Clockless Network-on-Chip, Proc. Conf. Design, Automation and
Test in Europe (DATE 05), 2005, pp. 1226-1231.
6. T.N.K. Jain et al., Asynchronous Bypass Channels: Improving Performance for Multisynchronous NoCs, Proc. 4th
ACM/IEEE Intl Symp. Networks-on-Chip (NOCS 10), 2010,
pp. 51-58.
.............................................................
MAY/JUNE 2014
micro
IEEE
45
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
bypass path
Cout
Flit pipeline
Route
Time
RC*
SA
Routern VS*
Nin
Nout
Sin
Sout
Eout Router
n+1
Ein
bypass path
RC*
SA
VS*
Wout
Route
Crossbar
Switch alloc.
VC select.
credit in
Router n+i
RC* BW
SA SA
VS*
ST+LT
Win
creditout
ST+LT
Cin
Energy (fj/bit/mm)
51
48 Clocked
driver
45
42 45-nm PnR
39
36
33
30
27
24
21
18
15
0 1 2 3 4 5
45 nm (Place-and-Route)
45 nm (Projected)
32 nm (Projected)
22 nm (Projected)
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Length/period (mm/ns)
............................................................
46
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
Cin
Cin_xb
XB sel
Win
bypass
Asynchronous
repeater
0 BMsel
Eout_xb
local
BWena
E out
Win_xb
Xbar
free_vc
(a)
Time
Routern
VS* + BW
SSR+SA-G
RC*
SA-L
ST+LT
SSR+SA-G
ST+LT
SSR+SA-G
ST+LT
Routern+1
Routern+2
Router n+i
Flit pipeline
SSR pipeline
*only required for
head flits
SSR+SA-G
VS* + BW
RC*
Routern+HPCmax
ST+LT
VS* + BW
SSR+SA-G
RC*
SA-L
ST+LT
(b)
Figure 3. Changes to the router and pipeline to support single-cycle multihop traversals.
Smart router microarchitecture (a) and pipeline (b). BWena , BMsel , and XBsel are set up
during the control path (SSRSA-G). During the datapath (STLT), the flit can cross multiple
routers in a cycle if BWena is 0, and BMsel is set to bypass, and gets latched at the router
where BWena is 1.
.............................................................
MAY/JUNE 2014
micro
IEEE
47
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
R0
R1
R2
R3
R4
C in
Win
E out
SA-G
SSR
1h2h3h
BWen
BMsel
XBsel
h = hop
SA-L
0h
Figure 4. k-ary 1-mesh with dedicated Smart-hop setup request (SSR) links going up to
HPCmax (3 in this example) hops in each direction. The switch allocation local (SA-L) grant
sends SSRs. The switch allocation global (SA-G) unit sets BWena , BMsel , and XBsel based on
the SSRs from 0-hop, 1-hop, 2-hop, and 3-hop neighbors.
............................................................
48
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
Cin
Win
R1
R0
R2
R3
R4
R5
=
Cycle 1
Cycle 2
BWena
BM sel
XBsel
BWena
BM sel
XBsel
0
0
X
0
0
X
FlitA FlitC
SSRD = 2
Flit B FlitD
0
BWena
local
BMsel
Win ->E out
XB sel
BWena
BM sel
XB sel
0
bypass
Win ->E out
BWena
BM sel
XB sel
1
0
X
BWena
BMsel
XB sel
0
0
X
Figure 5. Smart example: no SSR conflict. Cycle 1: FlitD (assumed to have won SA-L in Cycle 0) sends SSRD 2; that is, a
request to bypass R3 and stop at R4. The SA-G units at R2, R3, and R4 set up a single-cycle multihop bypass path. Cycle 2:
FlitD starts at R2 and gets latched at R4.
Cin
Win
R0
Flit E
R1
R2
R3
R4
R5
Cycle 1
Cycle 2
BWena
BM sel
XBsel
0
0
Cin->E out
SSRE = 3
Flit A Flit C
Flit B Flit D
BWena
0
BM sel
bypass
XBsel Win ->E out
SSRD = 2
1
BWena
local
BM sel
XBsel Win ->E out
BWena
0
BM sel
bypass
XBsel
Win ->E out
BWena
BMsel
XBsel
1
0
X
BWena
BMsel
XB sel
0
0
X
Figure 6. Smart example: SSR conflict with PrioLocal. The PrioLocal scheme gives highest priority to the local (buffered)
flit, then the flit from the neighboring router, followed by the flit from the router two hops away, and so on. FlitE from R0 is
prematurely stopped at R2, before its intended destination R3, to allow R2 to send its own local FlitD on its East output link.
.............................................................
MAY/JUNE 2014
micro
IEEE
49
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
is thus used in addition to BWena when deciding whether to buffer). Unlike false positives,
this is not a correctness issue but rather a performance (throughput) issue, because some
links go idle, when they could have been used
by other its if more global information were
available.
Ordering
In Smart, any it can be prematurely
stopped on the basis of the interaction of
SSRs that cycle. We must ensure that this
does not result in reordering between its of
the same packet, or between its from the
same source (if point-to-point ordering is
required in the coherence protocol).
The rst constraint is in routing (relevant
to 2D topologies). Multiit packets and
point-to-point ordered virtual networks
should use only deterministic routes to ensure
that prematurely buffered its do not end up
choosing alternate routes while bypassing its
continue on the old route.
The second constraint is in SA-G priority.
Every input port has a bit to track if there is a
prematurely stopped it among its buffered
its. When an SSR is received at an input
port, and there is either a prematurely buffered head/body it or a prematurely buffered it within a point-to-point ordered
virtual network, the incoming it is stopped.
............................................................
50
micro
IEEE
In a conventional network, a routers output port tracks the IDs of all free VCs at the
neighbors input port. A buffered head it
chooses a free VC ID for its next router
(neighbor) before it leaves the router. The
neighbor signals back when that VC ID
becomes free. In a Smart network, the challenge is that the next router could be any
router that can be reached within a cycle. A
it at a start router choosing the VC ID
before it leaves will not work because it is not
guaranteed to reach its presumed next router,
and multiple its at different start routers
might end up choosing the same VC ID.
Instead, we let the VC selection occur at the
stop router. Every Smart router receives 1 bit
from each neighbor to signal if at least one
VC is free. (If the router has multiple virtual
networks, or vnets, for the coherence protocol, we need a 1-bit free VC signal from the
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
Additional optimizations
We optimize Smart further to push it
toward the ideal (TN 1) NoC.
Bypassing the destination router. So far, we
have assumed that a it starting at an injection
router traverses one (or more) Smart-hops
until it reaches the destination router, where it
gets buffered and requests for the Cout port.
We add an extra ejection-bit in the SSR to
indicate whether the requested stop router
corresponds to the destination router for the
packet, and not any intermediate router on
the route. If a router receives an SSR from
H-hops away with value H (that is, a request
to stop there), H < HPCmax , and the ejection-bit is high, it arbitrates for Cout port during SA-G. If it loses, BWena is made high.
Bypassing SA-L at low load. We add no-load
bypassing1 to the Smart router. If a it comes
into a router with an empty input port and
no SA-L winner for its output port for that
cycle, it sends SSRs directly, in parallel to getting buffered, without having to go through
SA-L. This reduces tr at lightly loaded start
routers to 2, instead of 3, as shown in Figure
3b for Routerni.
With both ejection and no-load bypass
enabled, if HPCmax is larger than the maximum hops in any route, a it will only spend
two cycles in the entire Smart network in the
best case (one cycle for SSR and one for
STLT all the way to the destination NIC).
SA-G remains identical to our earlier description. Each output port has multidrop SSR
wires spanning upto HPCmax -routers along
that dimension. Each input port of a router
receives HPCmax set of SSR wires, one from
each router. The SSR requests a stop or a
bypass along that dimension. Flits with turning routes perform their traversal one dimension at a time, trying to bypass as many routers
as possible, and stopping at the turn routers.
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
51
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
SSR
Only 1 of these
SSRs (from Eout)
will be valid
SSR
Rk
N
N out
Rn
Rj
start routers
Rm
stop routers
start router
intermediate routers
(a)
(b)
(0>1>2 ...)
Nout
35
23
11
00
Intermediate router
12
24
36
0
S in
intermediate router
33
(0>1>2 ...)
21
10
22
31
20
32
33
34
21
10
22
31
20
32
start
routers
30
(c)
30
34
start
routers
(d)
Figure 7. Smart_2D: SSRs and their SA-G priorities. k-ary 2-mesh with SSR wires from shaded start router (a). Conflict
between two SSRs for Nout port (b). Fixed priority at Nout port of inter-router (c). Fixed priority at Sin port of inter-router (d).
Smart implementation
............................................................
52
micro
IEEE
Evaluation
We use the GEMS12 and Garnet13 infrastructure for all our evaluations, which
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
Synthetic traffic
We start by running Smart with synthetic
trafc patterns. We inject 1-it packets to
rst understand the benets of Smart without
secondary effects due to it serialization, and
VC allocation across multiple routers. For
the same reason, we also give enough VCs
(12, derived empirically) to allow both the
baseline and Smart to be limited by links,
rather than VCs for throughput.
Smart across different traffic patterns. Figure 8
compares the performance of three Smart
designs: Smart-8_1D and Smart-8_2D
(which are both achievable designs), and
Smart-15_2D, which reects the best that
Smart can do in an 8 8 mesh (with maximum possible hops 15), against the baseline and ideal. The striking feature about
Smart from is that it pushes low-load latency
to four and two cycles, for Smart_1D and
Smart_2D, respectively, across all trafc patterns, unlike the baseline, for which low-load
latency is a function of the average hops.
Thus, Smart truly breaks the locality barrier.
Smart-8_2D achieves most of the benet of
Smart-15_2D for all patterns, except Bit
Complement (BC), since average hop counts
are 8 for an 8 8 mesh.
Impact of HPC max . Next, we study the impact
of HPCmax on performance. We plot the average it latency for BC trafc (which has high
across-chip communication) for HPCmax from
1 to 12, across 1D and 2D in Figure 9. Smart1_1D is identical to the baseline (tr 1) network (as it does not need SA-G). We make
two key observations. First, at an HPCmax of 8,
Smart shows a 5.4 times reduction in latency.
This means that a 1-GHz Smart NoC can be
beaten by an NoC with one-cycle routers only
32
28
24
20
16
12
8
4
0
0
0.1
0.2
0.3
0.4
0.5
0.2
0.25
32
Average flit latency (cycles)
28
24
20
16
12
8
4
0
0
0.1
0.05
0.15
32
28
24
20
16
12
8
4
0
0
0.05
0.1
0.15
micro
0.2
BASELINE (tr = 1)
SMART-15_2D
SMART-8_1D
IDEAL (TN = 1)
SMART-8_2D
.............................................................
MAY/JUNE 2014
IEEE
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
53
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
32
28
24
20
16
12
8
4
0
0.05
0.1
0.15
0.2
0.25
SMART-2_1D
SMART-4_1D
SMART-8_1D
SMART-4_2D
SMART-8_2D
SMART-12_2D
Full-system traffic
............................................................
54
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
BASELINE (tr = 1)
IDEAL (tr = 1)
SMART-8_1D
SMART-8_2D
SMART-15_2D
IDEAL(TN = 1)
1.2
1.0
0.8
0.6
0.4
swaptions
x264
Average
swaptions
x264
Average
fluidanimate
canneal
blackscholes
water-spatial
water-nsq
radix
nlu
lu
0.2
0
fft
IEEE
Parsec
Splash-2
(a)
0.8
0.6
0.4
Splash-2
fluidanimate
canneal
blackscholes
water-spatial
water-nsq
radix
nlu
lu
0.2
0
fft
1.2
1.0
Parsec
(b)
Figure 10. Full-system application runtime with Smart, normalized to the runtime with
baselinetr 1. Private L2 cache per tile (a). Shared L2 cache slice per tile (b). In Shared L2,
L1 and L2 misses traverse the network to a remote node, making on-chip network latency
more critical than in Private L2.
Acknowledgments
We thank Sunghyun Park from the Massachusetts Institute of Technology and Michael
Pellauer from Intel for useful insights on the
interconnect and pipeline. We acknowledge
the support of DARPA UHPC, SMART
LEES, and MARCO C-FAR.
....................................................................
References
2. A. Kumar et al., A 4.6Tbits/s 3.6GHz Single-Cycle NoC Router with a Novel Switch
Allocator in 65nm CMOS, Proc. 25th Intl
Conf. Computer Design, 2007, pp. 63-70.
3. R. Mullins et al., Low-Latency Virtual-Channel Routers for On-Chip Networks, Proc.
31st Ann. Intl Symp. Computer Architecture, 2004, pp. 188-197.
4. H. Matsutani et al., Prediction Router: Yet
Another Low Latency On-Chip Router
Architecture, Proc. IEEE 15th Intl Symp.
High Performance Computer Architecture
(HPCA 09), 2009, pp. 367-378.
.............................................................
MAY/JUNE 2014
micro
IEEE
55
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
7. Y. Hoskote et al., A 5-GHz Mesh Interconnect for a Teraflops Processor, IEEE Micro,
vol. 27, no. 5, 2007, pp. 51-61.
8. J. Howard et al., A 48-core IA-32 MessagePassing Processor with DVFS in 45nm
CMOS, Proc. IEEE Intl Solid-State Circuits
Conf., 2010, pp. 108-109.
9. B. Kim and V. Stojanovic, Equalized Interconnects for On-Chip Networks: Modeling
and Optimization Framework, Proc. IEEE/
ACM Intl Conf. Computer-Aided Design
Chia-Hsin Owen Chen is a doctoral candidate in the Department of Electrical Engineering and Computer Science at the
Massachusetts Institute of Technology. His
research interests include system-level power
and performance modeling and analysis,
and on-chip networks. Chen has an SM
in electrical engineering and computer
science from the Massachusetts Institute of
Technology.
Networks-on-Chip
Mod-
eling, Proc. IEEE/ACM 6th Intl Symp. Networks-on-Chip, 2012, pp. 201-210.
12. M.M.K. Martin et al., Multifacets General
Li-Shiuan Peh is a professor in the Department of Electrical Engineering and Computer Science at the Massachusetts Institute
of Technology. Her research focuses on networked computing in many-core chips and
mobile wireless systems. Peh has a PhD in
computer science from Stanford University.
She is a member of IEEE and the ACM.
............................................................
56
micro
IEEE
____________
_______
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
MQmags
q
..................................................................................................................................................................................................................
..................................................................................................................................................................................................................
......
As multicore processors nd
increasing adoption in domains such as aerospace and medical devices, where failures have
the potential to be catastrophic, strong performance isolation and security become rstclass design constraints. When cores are used
to run separate pieces of the system, strong
time and space partitioning can help provide
such guarantees. However, as the number of
partitions or the asymmetry in partition
bandwidth allocations grows, the additional
latency incurred by time multiplexing the network can signicantly impact performance.
The difculty in designing such strong
separation functionality into typical networks
on chip (NoCs) is that they have many internal resources that are shared between packets
from different domains, which we would
otherwise wish to keep separate. These
resources include the buffers holding the
packets, the crossbar switches, and the individual ports and channels. Such resource contention introduces interference between
these different domains, which can create a
performance impact on some ows, pose a
security threat by creating an opportunity for
timing channels,1 and generally complicate
the nal verication and certication process
micro
IEEE
.............................................................
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
57
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
is created. As these waves traverse the network, they provide an opportunity for packets
of the corresponding domain to travel unimpeded along with these waves (thus avoiding
excessive latency), while still requiring no
dynamic scheduling between domains (thus
preventing timing corruption or information
leakage). Channels in the same dimension
and direction appear to propagate different
domains such that after passing through the
pipeline of the router, the channel can forward a packet coming from the same dimension and domain without any additional wait
(unless there is contention from packets of the
same domain). In this way, packets surf the
waves in each dimension. We identify the
many potential challenges of achieving noninterference in a modern NoC router microarchitecture using gate-level analysis, discuss
the details and ramications of our surf scheduling methodology, and demonstrate that our
approach truly does not allow even cycle-level
cross-domain interference. (For information
on previous research, see the Related Work
in Noninterference sidebar.)
SurfNoC scheduling
The straightforward way to support timedivision multiplexing (TDM) is to operate
the whole network in time slices that are divided between application domains. That is, a
packet waits at each hop until the network
begins forwarding packets from its domain.
This approach leads to a zero-load latency L0
that is proportional to the number of application domains D, pipeline depth P, and the
number of hops H, as shown in Equation 1:
T0 HP H D 1
............................................................
58
micro
IEEE
scheduling, different routers (in fact, different ports of the same router) can forward
packets from different domains in the same
cycle. In this schedule, a packet waits until it
can be forwarded in one dimension (that is,
its output channel is forwarding packets from
its domain in this cycle) and then does not
experience any wait at any downstream
router in this dimension (assuming there is
no contention from packets from the same
domain). After nishing the rst dimension,
the packet might experience another wait
until it can be forwarded in the next dimension. We call this schedule surf scheduling
because a packet is like a surfer who waits to
ride a wave to some location and then waits
to ride another wave. Equation 2 shows the
maximum zero-load latency and clearly
shows that the overhead is additive, not multiplicative as in the straightforward approach:
T0max HP n 1 2D 1
2
The term n 1 2 comes from the
n 1 transition between dimensions and two
waits during injection and ejection. Note that
this is the maximum wait, not the typical one,
because the schedule might require less wait.
The way to implement these different
waves is by scheduling different directions
in a router independentlyan idea inspired
by dimension-slicing used in dimensionordered routing in meshes and tori. We used
what we call direction-slicing of the pipelines,
such that each direction has its own pipeline.
This pipeline is a virtual one going through
different routers (not in the same router). We
will describe this idea in the case of a 2D
mesh or torus.
In a 2D mesh or torus, each dimension
has two directions (east and west for the xdimension; north and south for the y-dimension). The pipelines of directions of the same
dimension (that is, north, south, east, and
west) run in opposite directions, as Figure 1
shows. In this technique, each port of a router
is scheduled independently of all other ports
in a pipelined way such that the downstream
router in the same direction will forward
packets from the same domain after P cycles,
where P is the routers pipeline depth. These
schedules are imposed on each routers
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
Noninterference in NoCs
Noninterference in NoCs has been studied in the system-on-chip
domain to provide composability and fault containment as well as predictability of latency for real-time performance guarantees.15,16 Composability means that the system can be analyzed as a set of
independent components, which allows for easier verification of the
References
1. O. Aciicmez, Yet Another MicroArchitectural Attack:
Exploiting I-Cache, Proc. 2007 ACM Workshop Computer
Security Architecture (CSAW 07), 2007, pp. 11-18.
.............................................................
MAY/JUNE 2014
micro
IEEE
59
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Networks, Proc. 35th Ann. Intl Symp. Computer Architecture (ISCA 08), 2008, pp. 89-100.
pp. 83-93.
1999.
pp. 225-242.
5. O. Aciicmez, C.K. Koc, and J.-P. Seifert, On the Power of
Network on Chip: Concepts, Architectures, and Implementations, Design & Test of Computers, vol. 22, no. 5,
pp. 493-504.
On-Chip Networks, Proc. 6th IEEE/ACM Intl Symp. Networks on Chip (NoCS 12), 2012, pp. 142-151.
Router microarchitecture
............................................................
60
micro
IEEE
Ensuring a timing-channel-free contention between packetsthat is, contention can occur between packets
from the same domain but not between
packets from different domains.
Scheduling the output channels of
each router in a way that maintains
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
(a)
(b)
(c)
R
(d)
M
q
M
q
M
q
M
q
MQmags
q
(e)
(f)
Figure 1. Surf scheduling in a 16-node 2D mesh with three application domains (denoted by white, gray, and black) assuming
single-cycle routers for illustration purpose. The schedule runs as white, white, gray, and black and repeats, giving the white
domain half the bandwidth. A packet (the white box under the node S) belongs to the white domain and is sent from the node
marked S to the node marked R. The figure contains six consecutive cycles. At T 1, the packet is forwarded on the S port in
the y-dimension (which is scheduled to forward white packets). It keeps moving in the y-dimension until T 3, when it needs
to move in the x-dimension on the W port. The packet waits two cycles (T 4 and T 5) until it is the white domains turn on
the W port, and finally it is forwarded to its destination on T 6. Another wait may happen again in the destination router (R) to
forward the packet on the ejection port waiting for the white domains turn.
.............................................................
MAY/JUNE 2014
micro
IEEE
61
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Noninterference verification
............................................................
62
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
Baseline-small
Baseline-fast
12
32
See Table 2
Buffers per VC
Input speedup
4
1
4
32
See Table 2
See Table 2
Router delay
SW and VC allocators
4 cycles
Separable (input-first)
Routing
Dimension-ordered routing
16
32
16
8
16
8
16
8
32
4
32
4
32
4
16
32
Evaluation
We evaluate the performance of our SurfNoC scheme and compare the area and power
overhead to a mesh network without noninterference support. A more detailed evaluation can be found in the original paper.10
Experimental setup
We implemented a model of the SurfNoC
router in BookSim 2.0,11 a cycle-level interconnection network simulator. The simulator
is warmed up until steady state is reached and
statistics are reset, then a sample of the packets is measured from the time it enters the
source queue until it is received. For latency
measurements, the simulation runs until all
packets under measurement leave the network. Table 1 lists the simulation parameters
used for different schemes. We evaluated four
schemes, two that do not provide separation
guarantees, and two that support strong separation. The nonseparation baselines are an
input-queued router with minimal resources,
which achieves almost 40 percent saturation
throughput (baseline-small), and a similar
router that has many more resources (buffers
and input-speedup in the crossbar switch),
which we call baseline-fast. We used two
baselines because the separation-supporting
router includes more resources and would
achieve more throughput than a baseline
with minimal area, which will hide the lost
throughput due to the static scheduling. The
noninterference-supporting schemes are a
straightforward time-division multiple access
(TDMA), where the whole network forwards
packets from the same domain, and an
input-queued router, which enforces the surf
schedule (Surf). Table 2 shows the different
congurations used for different numbers of
domains for Surf and TDMA.
.............................................................
MAY/JUNE 2014
micro
IEEE
63
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
200
180
Baseline-small,64
160
TDMA,64
Baseline-fast,64
Surf,64
140
120
100
80
60
40
20
0
2
12
17
22
Number of domains
27
32
27
32
(a)
400
Baseline-small,256
350
TDMA,256
Baseline-fast,256
Surf,256
300
250
200
150
100
50
0
2
12
17
22
Number of domains
(b)
Impact on latency
............................................................
64
micro
IEEE
We rst examine the impact of our noninterference support on latency with a different
number of domains, and a different number
of nodes under the uniform random trafc
pattern. To understand the effect of TDM of
channels, we measure zero-load latency
(latency at offered load of 0.1 percent of
capacity for only one domain) and plot it for
different numbers of domains in Figure 2. In
this gure, we plot latency in cycles (y-axis)
versus number of domains on the x-axis for
network sizes of 64 nodes (Figure 2a) and
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
200
Baseline-small
180
TDMA
160
Baseline-fast
Surf
140
120
100
80
60
40
20
Throughput
50
100
150
200
100
Baseline-fast
Baseline-small
90
TDMA
Surf
80
70
60
50
40
30
0
0.1
0.2
0.3
0.4
0.5
0.6
.............................................................
MAY/JUNE 2014
micro
250
Number of nodes
In Figure 4, although we can see that saturation throughput is reduced by about 11.7
percent, aggregate throughput loss is limited
to 4.9 percent for two domains. Noninterference congurations have a higher saturation
throughput than the small baseline because
they use more resources, and lower than the
fast baseline that includes the same resources
because of unused time slots due to schedule
enforcement.
To verify the benets of assigning bandwidth nonuniformly, we performed an
experiment on a 2D mesh network with 64
nodes and three domains. Bandwidth (VCs
and time slots in the schedule) is assigned as
follows: a quarter of the bandwidth is
assigned to domain 0 and domain 1, each;
and half of the bandwidth is assigned to
domain 2. This nonuniform allocation is
done by devising a schedule with four slots
and assigning domain 3 time slots to domain
2. Saturation throughput, as expected, is 0.09
for both domain 0 and 1, and 0.21 for
domain 2. Latency at a 5 percent injection
rate is 36 (53) cycles for domain 2 and 39
(53) cycles for domains 0 and 1 using surf
scheduling (straightforward TDMA). This
shows that our scheme can have both latency
and throughput benets by designing a nonuniform surf schedule.
IEEE
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
65
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
............................................................
66
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
....................................................................
References
Strict
and
Provable
Information
Flow
Proc.
Intl
Conf.
Computer
digitalcommons.usu.edu/smallsat/2012/
__________________________
all2012/10/.
______
Model for Resource-Constrained and Dynamic Space-Based Computing Environments, Proc. 16th IEEE Intl Symp. Object/
Component/Service-Oriented
Real-Time
Distributed Computing, 2013; www.isis.
______
vanderbilt.edu/node/4552.
_______________
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
67
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
__________________
____________________
............................................................
68
micro
IEEE
____________
_______
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
MQmags
q
..................................................................................................................................................................................................................
..................................................................................................................................................................................................................
......
micro
IEEE
Inderpreet Singh
Qualcomm
Arrvindh Shriraman
Simon Fraser University
Wilson W.L. Fung
University of British Columbia
Mike OConnor
AMD Research
Tor M. Aamodt
University of British Columbia
.............................................................
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
69
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Reconv. stacks
Wavefront schedulers
GPU
core
GPU
core
GPU
core
Interconnection network
Memory
partition
Off-chip
GDDR
Register file
ALU
Memory
Local
mem
store
Coalesc. unit
Memory
partition
Off-chip
GDDR
MSHRs
L1 data
cache
L2 cache
bank
Atomic
op.
unit
Memory controller
GPU architecture
............................................................
70
micro
IEEE
GPU coherence
In recent GPUssuch as Nvidias Fermi
series and the Advanced Micro Devices
(AMD) Southern Islands seriesthe noncoherent private L1 caches can exploit memory
access locality in GPU applications without
requiring the programmer to explicitly manage data transfer to/from software-managed
scratchpad memory. However, these noncoherent caches might contain stale versions of
global data; this can introduce errors in applications that expect threads to communicate
updated data via a coherent memory system.
Disabling these L1 caches provides trivial
coherence for these applications, but at a cost
of performance and energy efciency. Figure
2a shows the signicant performance drawback of this naive solution. The performance
of our trivial cache-disabled GPU (NO-L1) is
88 percent worse than a GPU with an idealized coherence protocol (IDEAL-COH). In
other words, by enabling private caching for
this GPU applications set, hardware coherence
lets us improve performance by 88 percent.
Implementation challenges
Because coherent L1 caches give GPUs a
clear performance benet, the next question
is: Can existing CPU cache-coherence protocols be equally effective on GPUs? We now
discuss the three signicant overheads that
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
Coherence traffic
Storage requirements
The massively parallel GPU memory system can also introduce signicant storage
overhead for buffering inight coherence requests. Coherence implementations leverage
NO-L1
NO-COH
IDEAL-COH
GPU-VI
2.0
Performance
1.5
1.5
1.0
0.5
micro
MESI
2.27
1.0
0.5
0.0
Applications
that require
coherence
(a)
Applications that
don't require
coherence
(b)
Protocol complexity
Coherence protocols are highly complex,
requiring numerous states and transitions.4
Many of these states are transient states added
to the protocols to guard against a potential
protocol race between accesses to the same
cache block. For example, the MESI protocol
must handle two cores requesting exclusive
state to the same cache block. GPUs
.............................................................
MAY/JUNE 2014
IEEE
M
q
M
q
M
q
M
q
MQmags
q
Interconnect
traffic
micro
IEEE
71
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
State type
L1 cache
L2 cache
Noncoherent
GPU-VI
GPU-VIni
MESI
TC-Weak
Stable
Transient cache
Transient coherent
2
0
2
1
2
1
2
4
2
1
Total L1 states
10
Stable
Transient cache
2
2
3
2
5
2
4
3
4
2
Transient coherent
Total L2 states
15
16
Temporal coherence
............................................................
72
micro
IEEE
insight that single-chip systems can implement synchronized counters to enable lowcost transfer of coherence information.
Specically, if the lifetime of a memory
addresss current epoch can be predicted
and shared among all readers when the
location is read, then these counters allow
the readers to self-invalidate synchronously,
eliminating the need for end-of-epoch
invalidation messages. Compared to traditional CPU cache coherence, TC requires
fewer modications to GPU hardware and
enables greater memory-level parallelism.
Figure 3 compares the invalidation handling of the GPU-VI directory protocol and
TC. The gure shows a read by processors
C1 and C2, followed by a store from C1, all
to the same memory location. Figure 3a
shows the events sequence that occurs for the
write-through GPU-VI directory protocol.
Processor C1 issues a load request to the
directory (1) and receives data. Processor C2
issues a load request (2) and receives the data
as well. C1 then issues a store request (3).
The directory, which stores an exact list of
sharers, sees that C2 must be invalidated
before the write can complete and sends an
invalidation request to C2 (4). C2 receives
the invalidation request, invalidates the block
in its private cache, and sends an acknowledgment back (5). The directory receives the
invalidation acknowledgment from C2 (6),
completes C1s store request, and sends C1
an acknowledgment (7).
Figure 3b shows how TC handles the invalidation for this example. When C1 issues a
load request to the L2 cache, it predicts that
the read-only epoch for this address will end
at time T 1510 . The L2 receives C1s
load request and epoch lifetime prediction,
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
VI coherence
C1
Dir
Temporal coherence
C2
2
load
R
D
time
read-only epoch
C2
1'
3
store
W
4
Inv
6
7
L2
C1
load,
predict R, T=
15
3'
2'
5 T=15
load
5
1
=
predict
D, T
0 T=20
2
=
10
R, T
4'
D, T
=20
5'
15
selfinvalidate
6'
20
self7'
invalidate
store
W
25
8'
read-only epoch
load
1
(a)
M
q
M
q
M
q
M
q
MQmags
q
Ack
5
k
Ac
Inv
Ac
(b)
Figure 3. Coherence invalidation mechanisms. The events sequence for the write-through
GPU-VI directory protocol (a). Temporal Coherence (TC) invalidation handling for the same
example (b). (R read, D data, W write, and Inv invalidation.)
.............................................................
MAY/JUNE 2014
micro
IEEE
73
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
L1@ C3
L1@ C2
B
A
L1@ C2
WA
C1
L1@ C3
WB
C1
WA
Time
WB
Time
(a)
(b)
warps
Time
L1 cache line
Valid
bit
Local
timestamp
Tag
Data
L2 cache line
L2$ bank
Time
Mem. partition
(a)
State
Global
Dirty timestamp
Tag
Data
(b)
Figure 5. Hardware extensions for the TC-Weak implementation. GPU cores and memory
partitions with synchronized counters. A Global Write Completion Times (GWCT) table added
to each GPU core (a). L1 and L2 cache lines with timestamp field (b).
TC-Strong coherence
............................................................
74
micro
IEEE
TC-Strong implements release consistency with write atomicity.6 With TCStrong, each GPU core has a private, writethrough L1 cache, and the cores share a
write-back L2 cache. It requires synchronized timestamp counters at the GPU cores
and L2 controllers shown in Figure 5a to
provide the components with the current
system time. A small timestamp eld is
added to each cache line in the L1 and
L23caches, as Figure 5b shows. The local
timestamp value in the L1 cache line indicates the time until the particular cache line
is valid. An L1 cache line with a local timestamp less than the current system time is
invalid. The global timestamp value in the
L2 indicates a time by which all L1 caches
will have self-invalidated this cache line.
Every load request checks both the tag
and the local timestamp of the L1 line. It
treats a valid tag match that has an expired
local timestamp as a miss; self-invalidating an
L1 block doesnt require explicit events. A
load miss at the L1 generates a request to the
L2 with a lifetime prediction. The L2 controller updates the global timestamp to the
maximum of the current global timestamp
and the requested local timestamp to accommodate the time period requested. The L2
responds to the L1 with the data and the
global timestamp. The L1 updates its data
and local timestamp with values in the
response message before completing the load.
A store request writes through to the L2
where its completion is delayed until the
global timestamp has expired.
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
Core C1
Core C2
L1: r1 = flag
B1: if(r1 6 SET)goto L1
L2: r2 = data
(a)
Write stalling at L2 (TC-Strong)
Fence waiting for pending requests (both)
Fence waiting for GWCT (TC-Weak)
TC-Weak
TC-Strong
C1
L2
S1
2 F1
C2
C1
flag
data
NULL | 60
OLD | 30
S1
2' F1
20
Time
Ac
selfinvalidate
S2
10
C2
1'
4'
L2
flag
data
NULL | 60
OLD | 30
3'
=
CT
GW 30
30 6'
S2
5'
40
7'
selfinvalidate
=
CT
GW 60
50
Ac
C1's requests
(b)
8
selfinvalidate
60
selfinvalidate
C1's requests
(c)
Figure 6b shows how TC-Strong maintains coherence. The example code snippet4
in Figure 6a represents a common programming idiom used to implement nonblocking
queues in pipeline parallel applications.7 Figure 6b shows the memory requests generated
by core C1 on the left, and the state of the
two memory locations, ag and data, in C2s
L1 on the right. Initially, C2 has ag and
.............................................................
MAY/JUNE 2014
micro
IEEE
75
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
TC-Weak coherence
............................................................
76
micro
IEEE
Evaluation
We modeled a cache-coherent GPU architecture by extending GPGPU-Sim version
3.1.29 with the Ruby memory system model
from the General Execution-Driven Multiprocessor Simulator (GEMS).10 The baseline
noncoherent memory system and all
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
BH
CC
CL
DLB
STN
VPR
3.5
1.2
3.0
1.0
2.5
0.0
HM
0.0
SR
0.2
RG
HM
VPR
STN
DLB
CL
CC
0.5
LPS
0.4
1.0
BH
0.6
KMN
1.5
0.8
Speedup
2.0
HSP
3.8
NDL
Application
(b)
(a)
NO-L1
NO-COH
MESI
GPU-VI
TCW
Figure 7. Performance of GPU memory systems. Coherent protocols compared with a baseline GPU with L1 caches disabled
(NO-L1) (a). The same protocols compared against a noncoherent baseline with L1 caches enabled (NO-COH) (b). (HM
harmonic mean.)
communication that run correctly with noncoherent private L1 caches. Our other paper
offers details on our simulation conguration
and benchmarks.2
Figure 7a compares the performance of
coherence protocols against a baseline GPU
with L1 caches disabled (NO-L1) for applications with inter-workgroup communication.
Figure 7b compares them against the noncoherent baseline protocol with L1 caches
enabled (NO-COH) for applications with
intra-workgroup communication. Figures 8a
and 8b show the breakdown of interconnect
trafc between different coherence protocols.
.............................................................
MAY/JUNE 2014
micro
IEEE
77
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
(a)
STN
RCL=0.16
REQ=0.63
2.27
1.5
1.0
0.0
HSP
KMN
LPS
NDL
RG
SR
NO-COH
MESI
GPU-VI
GPU-Vini
TCW
0.5
NO-COH
MESI
GPU-VI
GPU-Vini
TCW
AVG
RCL=0.15 RCL=0.99
REQ=0.63 REQ=0.55
NO-COH
MESI
GPU-VI
GPU-Vini
TCW
VPR
0.0
RCL=0.25
REQ=0.55
NO-COH
MESI
GPU-VI
GPU-Vini
TCW
NO-L1
MESI
GPU-VI
GPU-Vini
TCW
NO-L1
MESI
GPU-VI
GPU-Vini
TCW
DLB
NO-L1
MESI
GPU-VI
GPU-Vini
TCW
CL
NO-L1
MESI
GPU-VI
GPU-Vini
TCW
CC
NO-L1
MESI
GPU-VI
GPU-Vini
TCW
NO-L1
MESI
GPU-VI
GPU-Vini
TCW
NO-L1
MESI
GPU-VI
GPU-Vini
TCW
0.5
LD
2.0
Interconnect Traffic
1.0
ST
NO-COH
MESI
GPU-VI
GPU-Vini
TCW
2.0
1.5
BH
ATO
NO-COH
MESI
GPU-VI
GPU-Vini
TCW
REQ
NO-COH
MESI
GPU-VI
GPU-Vini
TCW
INV
RCL
RCL=0.03
INV=0.03
REQ=0.68
AVG
(b)
Figure 8. Breakdown of interconnect traffic for coherent and noncoherent GPU memory systems. Inter-workgroup
communication (a); intra-workgroup communication (b).
............................................................
78
micro
IEEE
Acknowledgments
We thank Mark Hill, Hadi Jooybar,
Timothy Rogers, and the anonymous
reviewers for their invaluable comments.
This work was partly supported by funding
from the Natural Sciences and Engineering
Research Council of Canada and Advanced
Micro Devices.
....................................................................
References
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
Inderpreet Singh is an engineer at Qualcomm. His research interests include memory models and GPU computing. Singh has
_____________
_______
.............................................................
MAY/JUNE 2014
micro
IEEE
79
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
................................................................................................................................................................................................................
................................................................................................................................................................................................................
DRAM CACHE ARCHITECTURES. THE PROPOSED APPROACH CAN PROVIDE VARYING LEVELS
OF PROTECTION, FROM FINE-GRAINED SINGLE-BIT UPSETS TO COARSER-GRAINED FAULTS
WITHIN THE CONSTRAINTS OF COMMODITY NON-ERROR-CORRECTING CODE DRAM
STACKS.
......
Jaewoong Sim
Georgia Institute of Technology
Gabriel H. Loh
Vilas Sridharan
Mike OConnor
Advanced Micro Devices
.......................................................
80
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
81
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Row decoder
29-way set-associative
T T T D D D D D D D D D D D D D D D D D D D D D D D D D D D D D
DRAM
cells
3 x 64B for
29 tag entries
29 x 64B for
29 data blocks
D D D D D D D D D D D D D D D D D D D D D D D D D D D D
Single direct-mapped set (tag+data)
Figure 1. A DRAM bank with 2-Kbyte row size. When used as a cache, the row can be organized as a 29-way set-associative
set (top) or as 28 individual direct-mapped sets (bottom).
............................................................
82
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
28-bit tag
64B data
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
T D D D D D D D D T D D D D D D D D T D D D D D D D D T
D D D D
SECDED ECC (28 blocks per row)
(b)
Command bus
ACT
Data bus
DRAM$
controller
RD RD
update tag*
* if needed
update tag*
WR
DQ
compute new
ECC for tag*
pipelined ECC
check on data
(c)
Figure 2. Providing simple error correction for a DRAM cache. Contents of one tag entry and one 64-byte data block,
along with single-error correction, double-error detection (SECDED) error-correcting codes (ECC) for each, respectively (a).
Contents of a 2-Kbyte DRAM row, with eight tag entries packed into a 64-byte block and the corresponding eight data
blocks following (b). Timing diagram for reading a 64-byte cache block (c).
.............................................................
MAY/JUNE 2014
micro
IEEE
83
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
28-bit tag
9-bit SEC
32B data
32B data
9-bit SEC
(a)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
T D D D D T D D D D T D D D D T D D D D T D D D D T D D D D T
D
SEC+CRC (25 blocks per row)
(b)
Command bus
ACT
RD RD
update tag*
WR
* if needed
tRCD
Data bus
tCAS
DQ DQ DQ
DRAM$
controller
DQ
update tag*
compute new ECC
and CRC for tag*
CRC check on
(corrected) data
(c)
Figure 3. Providing strong, multibit error detection for a DRAM cache. Contents of one tag entry and one 64-byte data block
(treated as two 32-byte chunks for protection purposes), along with SEC ECC and cyclic redundancy check (CRC) codes (a).
Contents of a 2-Kbyte DRAM row, with four tag entries (tagSECCRC) packed into a 64-byte block and followed by the
corresponding four data blocks (b). Timing diagram for reading a 64-byte cache block (c).
............................................................
84
micro
IEEE
the DRAM caches error-correction capabilities, they greatly increase its error-detection
capability, thereby drastically reducing silent
data corruption (SDC) rates.
Figure 3a shows the layout of the tag
blocks including CRCs. Here, we only use
SEC ECCs (not SECDED); the CRCs provide multibit error detection and so the parity
bit for double-error detection is not needed.
We divide the 64-byte data into two 32-byte
protection regions, each covered by its own
SEC and CRC codes, which allows up to two
errors to be corrected if each error occurs in a
separate 32-byte region.
The storage requirement for the original
tag plus the SEC and CRC codes is 112 bits.
Therefore, tag information for four cache
blocks can be placed in a 64-byte block. Figure 3b shows the overall layout for a 2-Kbyte
DRAM row, with a 64-byte tag block containing four tags (including SEC/CRC),
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
Row 110001
Row decoder
E(X)
E(Y)
X
Y
Row 101110
Row 101111
Row 110000
accessed row
Row 110001 requested row
Row 110010
ECC
No error!
(a)
Data
10111010101001001001010110110101
ECC
01001011
ECC
n copies of
row index
Bitwise XOR
0111100110101000101001010111011001000111
1100011100011100011100011100011100011100
n copies of
row index
Bitwise XOR
10111110101101001101010010110001 01011011
ECC
Multibit error!
(c)
Figure 4. Handling row decoder faults. Row decoder error that selects the
incorrect row, which is undetectable using within-row ECC (a). Process for
folding in the row index (row 1100002) (b), and usage of the folded row
index to detect a row-decoder error (c).
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
85
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Duplicate on write
............................................................
86
micro
IEEE
Experimental results
The inclusion of additional error-detecting and -correcting code, and duplicates of
modied blocks in the DRAM cache, reduce
the effective capacity of the DRAM cache.
Checking the ECC, computing new codes,
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
DRAM cache
Bank 0
Multibit
error
Bank 1
Main memory
Bank 2
Clean copy
Bank 3
A, B, C, D,
C
A
Y
X
(X, Y are
stale)
ECC
1.25
ECC+CRC
1.20
1.15
1.10
1.05
micro
g.
12
Av
11
L-
L-
.............................................................
MAY/JUNE 2014
IEEE
10
L-
LW
LW
LW
LW
LW
LW
LW
L-
1.00
L-
No RAS
1.30
Methodology
1.35
Figure 5. Example DRAM cache contents in which clean data are backed up by main
memory, but dirty data are duplicated into other banks. If corrupted, blocks A through D can
be refetched from main memory, whereas modified blocks X and Y rely on in-cache
duplicates X 0 and Y 0 for error recovery.
87
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
1.35
ECC+CRC
1.30
ECC+CRC+RAID-1
1.25
ECC+CRC+DOW
1.20
1.15
1.10
1.05
W
L1
W
L2
W
L3
W
L4
W
L5
W
L6
W
L7
W
L8
W
L9
W
L1
W 0
L1
W 1
L12
Av
g.
1.00
No RAS
ECC-only
ECC1CRC
Single bit
Column
0
0
100
85
100
99.9993
100
99.9993
Row
50
99.9993
99.9993
Bank
50
99.9993
99.9993
DOW
No RAS
ECC-only
ECC1CRC
Single bit
100
100
100
Column
Row
0
0
85
0*
85*
0*
99.9993
99.9993
Bank
0*
0*
99.9993
DOW
No RAS
ECC-only
ECC1CRC
DOW
234 to
41 to 410
0.0008 to
0.0008 to
............................................................
88
micro
IEEE
scheme retains much of the overall performance benet of having a DRAM cache (on
average, only 2.5 percent and 0.8 percent
compared to no RAS and ECCCRC,
respectively) while providing substantial RAS
improvements.
2,335
0
37 to 368
0.0075
0.0075
52 to 518
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
Discussion
As we discussed earlier, providing resilience to stacked DRAM will become an
important problem because DRAM stacks
are not directly serviceable, making the tolerance of soft and hard errors critical for largescale systems. Our proposal will be valuable
for two primary reasons.
First, our proposal provides the benet for
memory vendors of having to support only a
single (non-ECC) DRAM chip design. A key
to conventional ECC DIMMs is that the
same silicon design can be deployed for both
ECC and non-ECC DIMMs. Forcing memory vendors to support two distinct silicon
designs (one ECC, one non-ECC) greatly
increases their engineering efforts and costs
and complicates their inventory management.
Second, at the implementation level, the
DRAM cache is just a commodity memory
component with no knowledge of how the
data storage is being used. Our proposal enables system designers (OEMs) to take a
single-processor (with stacked DRAM) design
but congure different levels of protection for
different deployments of the design. In a
commodity consumer system, a designer
might choose to turn off ECC entirely and
....................................................................
References
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
89
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Mike OConnor is a senior research scientist at NVIDIA. He performed the work for
this article at Advanced Micro Devices. His
research interests include GPU, heterogeneous processors, and memory systems.
OConnor has an MS in electrical and computer engineering from the University of
Texas at Austin. He is a senior member of
IEEE and a member of the ACM.
__________________
___________
_____________________
____________
_______
............................................................
90
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
MQmags
q
..................................................................................................................................................................................................................
..................................................................................................................................................................................................................
......
micro
IEEE
Somayeh Sardashti
David A. Wood
University of
Wisconsin-Madison
.............................................................
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
91
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
apache
jbb
oltp
zeus
ammp
applu
equake
mgrid
wupwise
black
canneal
freqmine
m1
m2
m3
m4
m5
m6
m7
m8
GEOMEAN
X
C
-2
X
VS
VS
-4
VS
-3
f
-In
C
VS
Pa
te
By
Id
ea
ck
This article
contributions:
............................................................
92
micro
IEEE
makes
the
following
Although some data (and most instructions) are difcult to compress, most workloads are highly compressible. In this
article, we use C-PACKZ, a dictionarybased algorithm4 with nine-cycle decompression latency. C-PACKZ achieves an
average compression ratio (that is, the original size over compressed size) of 3.9. Thus,
compression has the potential to nearly
quadruple cache size (shown as Ideal in
Figure 1).
Previous compressed cache designs fail to
achieve this potential for three main reasons.
First, caches must compact compressed
blocks into sets, which introduces an internal
fragmentation problem. In Figure 1, BytePack represents an idealized compressed
cache with innite tags, which compacts
compressed blocks on arbitrary byte boundaries. BytePack degrades normalized effective
capacity to 3.1 on average. Second, practical
compressed caches introduce another internal
fragmentation problem by compacting compressed blocks into one or more sub-blocks,
rather than storing compressed data on arbitrary byte boundaries.2 Variable-size compression (VSC) techniques relax the mapping
constraint between tags and data and compact compressed blocks into a variable number of contiguous sub-blocks.2 The column
labeled VSC-Inf in Figure 1 illustrates that
compacting compressed blocks into zero to
four 16-byte sub-blocks (but with innite
tags per set) degrades normalized effective
capacity from 3.1 to 2.6 on average. Third,
practical compressed caches have a xed
number of tags per set. The remaining columns in Figure 1 illustrate that reducing the
number of tags, from innite to a more practical two times the baseline, degrades the
average normalized effective capacity from
2.6 to 1.7. Furthermore, VSC is not energy
efcient. It must repack the sub-blocks in a
set whenever a blocks size changes to make
contiguous free space. This action can
increase LLC dynamic energy by a factor of
nearly three, on average.
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
.............................................................................................................................................................................................
Superblocks (also known as sectors) have long exploited coarsegrained spatial locality to reduce tag overhead. Superblocks associate
Reference
one address tag with multiple cache blocks, replicating only the per1. A. Seznec, Decoupled Sectored Caches: Conciliating Low
block metadata, such as the coherence state. Figure A1 shows one
Tag Implementation Cost and Low Miss Ratio, Proc. 21st
set of a four-way associative sectored cache (SC), with four-block
Ann. Intl Symp. Computer Architecture, 1994, pp. 384-393.
superblocks. Using four-block superblocks reduces the tag area by 70
percent compared with a conventional cache. However, Figure A1 illustrates that singletons, pairs,
and triossuch as superblocks D, C, and A, respectivelyresult in internal fragmentation, which can
Data array
Tag array
Data array
Tag array
lead to significantly higher miss rates.
A
B
C
D
E
F
GH
H0
B0 A0 C0
B
A
C
D
B0
A0
C0
Seznec showed that decoupling superblock tags
A1 E1 F1 B1
E:<E3,E1>
from data blocks helps reduce internal fragmentaA1 B1
A:<A2,A1,A0>
F:<F1>
B:<B3,B2,B1,B0>
tion.1 Decoupled sectored (or superblock) caches
C2
H2 A2 B2
C2
A2 B2
G:<G3>
C:<C2,C0>
(DSC) increase the number of superblock tags per
H:<H2,H0>
D3 E3 G3 B3
D:<D3>
B3
D3
set and use per-block back pointers to identify the
Reused space
Unused space
corresponding tag. Figure A2 illustrates how decou(1)
(2)
pling can reduce fragmentation by letting two singletons (that is, blocks F1 and G3) share the same
superblock. DSC uses more tag space than SC but
Figure A. Sectored cache (1) and decoupled sectored cache (2). DSC
less than a conventional cache because back
reduces internal fragmentation and can fit more blocks in the cache (Block E
pointers are small.
to Block H).
.............................................................
MAY/JUNE 2014
micro
IEEE
93
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Tag array
Sub-blocked data
array
Tag A,
Blk #0
Cstate3
Comp3
Cstate2
Comp2
Cstate1
Comp1
Cstate0
Comp0
Super-block
tag
A0.0
5
A0.1
Index A
Tag ID Blk#
1b
...
(e) Address:
Super-block
tag
2b
Set
Blk# Byte
index
6b
Lookup
back pointer array
Super-block
miss
Replace victim
Tag match?
super-block
NO
YES
Block
Valid block?
miss
NO
Replace victim
super-block
YES
Update LRU, tag, and BPEs
Read? Write?
Read sub-blocks
and decompress
Compress and
write sub-blocks
Figure 2. A decoupled compressed cache. DCC cache layout (a); one tag entry (b); one back
pointer entry (BPE) (c); address space (d); address (e); and DCC lookup process (f). DCC
exploits superblocks and manages the cache at multiple granularities: coarse-grained
superblocks, singular cache blocks, and fine-grained sub-blocks.
............................................................
94
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
Co-DCC
(%)
FixedC/
VSC-2X (%)
Tag array
Back pointer array
2.1
4.4
11.3
5.4
6.3
0
Compressors
0.6
0.6
0.6
Decompressors
Total area overhead
1.2
8.3
1.2
18.5
1.2
8.1
Components
.............................................................
MAY/JUNE 2014
micro
IEEE
95
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
SR3
SR2
A0.3
A0.2
SR1
SR0
Set addr
A phase flop
A0.0
B phase flop
B1.1
A phase flop
B phase flop
N
A0.1
B1.0
Read data
4 SR0 addr
4 SR1 addr
4 SR2 addr
4 SR3 addr
Figure 3. DCC data array organization. The data array is divided into four
sequential regions, each containing a sub-block of a cache block.
............................................................
96
micro
IEEE
Evaluation
We evaluate DCC using a full-system
simulator based on GEMS.7 We model a
multicore system with eight out-of-order
cores; per-core private 32-Kbyte, 8-way L1
instruction and data caches;; per-core private
256-Kbyte, 8-way L2 caches; and one shared
8-Mbyte, 16-way L3 cache.8 We use
CACTI 6.59 to model power at 32 nm. We
also use a detailed DRAM power model based
on Micron Technologys power model.10 In
this section, we report total system energy,
which includes energy consumption of processors (cores and caches), on-chip network,
and off-chip memory. For DCC and CoDCC, we use four-block superblocks, 64-byte
blocks, and 16-byte sub-blocks. With these
parameters, DCC has similar area overhead as
FixedC, which doubles the number of tags
and compresses a block to half, if possible,
and VSC-2X, which doubles tags but compresses a block into zero to four 16-byte
sub-blocks (see the Compressed Cache Overheads sidebar).
Our evaluations use representative multithreaded and multiprogrammed workloads
from commercial workloads (apache, jbb,
oltp, zeus),11 SPEC-OMP (ammp, applu,
equake, mgrid, wupwise),12 Parsec (blackscholes, canneal, freqmine),13 and mixes of
SPEC CPU2006 benchmarks denoted
as m1 to m8 (bzip2, libquantum-bzip2,
libquantum, gcc, astar-bwaves, cactus-mcfmilc-bwaves, gcc-omnetpp-mcf-bwaves-lbmmilc-cactus-bzip, omnetpp-lbm).
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
Access
tag array
Tag
match
t cycles m cycles
(a)
Access
tag array
Access
BPA
(b)
M
q
M
q
M
q
M
q
MQmags
q
t cycles
Tag
match
Access data
array
d cycles
Access data
array
Sub-block
selection
m cycles 1
Decompression
d cycles
1
9 cycles
Figure 4. Timing of a conventional cache (a) and DCC (b). A 64-byte block is returned in a burst
of four cycles on the same data bus. With DCC, only the matching sub-blocks are read and
fed directly into the decompression logic.
.............................................................
MAY/JUNE 2014
micro
IEEE
97
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
4.00
C
C
o-
Fi
2X
xe
D
oC
C
C
e
D
in
el
2X
Ba
s
-2
dC
VS
Fi
o-
xe
C
C
e
D
in
el
-2
C
Ba
s
2X
VS
xe
Fi
(a)
0.40
in
0.40
dC
1.00
0.50
0.50
0.60
1.50
el
0.60
0.70
-2
0.70
0.80
Ba
s
2.00
0.80
0.90
2.50
0.90
1.00
VS
Norm runtime
3.00
1.00
3.50
dC
TOP PICKS
(b)
(c)
apache
jbb
oltp
zeus
ammp
applu
equake
mgrid
wupwise
black
canneal
freqmine
m1
m2
m3
m4
m5
m6
m7
m8
GEOMEAN
Figure 5. Normalized LLC effective capacity (a); normalized runtime (b); normalized total system energy (c). DCC and Co-DCC
improve LLC utilization, resulting in higher performance and energy improvements than previous work and 2x baseline.
............................................................
98
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
....................................................................
References
12. V. Aslot et al., SPEComp: A New Benchmark Suite for Measuring Parallel Computer
Proc. 5th Ann. Workshop Modeling, Benchmarking and Simulation, 2009, pp. 47-55.
Technology
Roadmap
for
____________
_______
.............................................................
MAY/JUNE 2014
micro
IEEE
99
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
................................................................................................................................................................................................................
................................................................................................................................................................................................................
......
Richard Sampson
University of Michigan
Ming Yang
Siyuan Wei
Chaitali Chakrabarti
Arizona State University
Thomas F. Wenisch
University of Michigan
.......................................................
100
micro
IEEE
Much as every medical professional listens beneath the skin with a stethoscope
today, we foresee a time when handheld medical imaging will become as ubiquitous
peering under the skin using a handheld
imaging device. Mobile medical imaging is
advancing rapidly to reduce the footprint of
bulky, often room-sized machines to compact
handheld devices. In the last ve years,
research has demonstrated that by combining
the increasing capabilities of mobile processors with intelligent system design, portable
and even handheld imaging devices are not
only possible, but commercially viable. In
particular, ultrasound imaging has proven to
be an especially successful candidate for high
portability due to its safety and low transmit
power, with commercial handheld 2D ultrasound devices marketed and being used in
hospitals today. Newly developed portable
imaging devices have not only led to demonstrated improvements in patient health,1 they
have also enabled new applications for handheld ultrasound, such as disaster relief care2
and battleeld triage.3 However, despite
the increasing capabilities of handheld
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
101
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
(a)
(b)
Image
Space
Point
P
RP
Xi
(c)
(d)
............................................................
102
micro
IEEE
transducer to enhance signal quality. The signal is upsampled using an interpolation lter
to generate additional data points between
the received samples. This process enhances
resolution without the power and storage
overheads of increasing the data sampling
rate of the analog front end. Then, so-called
apodization scaling factors are applied to the
interpolated data to place greater weight on
receivers near the origin of the transmission,
because these signals are more accurate owing
to their lower angle of incidence.
Once the data has been preprocessed
(transformed), the beamsum operation can
begin. In essence, this entails calculating the
round-trip delay between the emitting transducer and all receiving transducers through
each focal point, converting these delays into
indices in each transducers received signal
array, retrieving the corresponding data, and
summing these values. Figure 1c illustrates
this process. These partial images are then
q
1
Rp Rp2 xi2 2xi Rp sin h
c
1
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
To memory
Transducer
bank
12-bit
ADC
SRAM
array
Transform
unit
Select unit
(10 Subunits)
Reduce
unit
Transducer
bank
12-bit
ADC
SRAM
array
Transform
unit
Select unit
(10 Subunits)
Reduce
unit
Transducer
bank
12-bit
ADC
SRAM
array
Transform
unit
Select unit
(10 Subunits)
Reduce
unit
Layer 1: Transducers
Layer 2: ADC/storage
Layer 3: Beamforming
Sonic Millip3De
From memory
Figure 2. Sonic Millip3De hardware overview. Layer 1 comprises 120 88 transducers grouped into banks with one
transducer per bank in each subaperture. Analog transducer outputs from each bank are multiplexed and routed over throughsilicon vias (TSVs) to Layer 2, comprising 1,024 analog-to-digital converter (ADC) units operating at 40 MHz and static RAM
(SRAM) arrays to store incoming samples. The stored data is passed via face-to-face links to Layer 3 for processing in the
three stages of the 1,024-unit beamsum accelerator. The transform stage upsamples the signal to 160 MHz. The 10 units in
the select stage map signal data from the receive time domain to the image space domain in parallel for 10 scanlines. The
reduce stage combines previously stored data from memory with the incoming signal from all 1,024 beamsum nodes over
a unidirectional pipelined interconnect, and the resulting updated image is written back to memory.
computed efciently using only add operations. The algorithms iterative nature lends
itself to an efcient data streaming model,
allowing the proposed hardware to exploit
locality and eliminate inefcient address calculation and memory-access operations that
are a bottleneck in conventional implementations. Our early analysis shows that the delta
function between adjacent focal-point delays
on a scanline forms a smooth curve and indices can be approximated accurately (with
error similar to that introduced by interpolation) over short intervals with quadratic
approximations. We replace these exact delta
curves with a per-transducer precomputed
piecewise quadratic approximation constrained to allow an index error of at most 3
(corresponding to at most a 30-lm error
between the estimated and exact focal point),
thus resulting in negligible blur.
Using ofine image quality analysis, we
have determined that, for a target imaging
depth of 8 cm, we can meet the error constraints with only three piece-wise sections.
Each section requires precalculating three
coefcients and a section cut-off, achieving a
250-times storage reduction relative to an
exhaustive lookup table. Through careful
pipelining of the beamforming process, the
constants can be efciently streamed from
off-chip memory, limiting storage requirements within the beamforming accelerator.
Sonic Millip3De
The Sonic Millip3De system hardware
(shown in Figure 2) is divided into three distinct silicon die layerstransducers and analog components, analog-to-digital converters
(ADCs) and storage, and beamforming computationwhich are connected vertically
using through-silicon vias (TSVs). The 3Dstacked chip connects to separate LPDDR2
(low-power double data rate 2) memory. All
.............................................................
MAY/JUNE 2014
micro
IEEE
103
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
............................................................
104
micro
IEEE
which each correspond to the 1,024 transducer banks of the analog transducer layer.
The ADCs are sampled at 40 MHz, storing
the digital output into corresponding 6-Kbyte
SRAM arrays. The SRAMs are clocked at
1 GHz and connect vertically to a corresponding computational unit on the beamforming accelerator layer, requiring a total
of 24,000 face-to-face TSVs for data and
address signals.
The nal layer is the most complex of the
three, comprising the beamforming accelerator processing units, a unidirectional pipelined interconnect, and a control processor
(M-class ARM core) that interfaces to the
LPDDR2 off-chip memory.
Transform
The transform unit operates on all of the
receive data, performing a 4-times linear
interpolation on the raw receive signals. After
upsampling, a constant apodization is applied providing a weight based on transducer
position as previously described.
Select
The select unit remaps data from the
receive time domain to the image space
domain using the algorithm described previously. The select unit is split into 10 subunits
that concurrently operate on neighboring
scanlines. These subunits each iterate over
the same incoming datastream from a corresponding second-layer SRAM array in a
synchronized fashion, reducing the number
of times data must be read from the SRAM
by a factor of 10. Figure 3 shows a block
diagram of a single subunit.
Data is streamed simultaneously into the
input rst-in, rst-out (FIFO) buffer of each
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
From reduce
unit
A+B
Constant
storage
2A
+
+
+
Section
decrementor
Select
decrementor
Select logic
Input buffer
Output buffer
From transform
unit
Select subunit
To reduce
unit
Figure 3. Select unit microarchitecture. Select units map upsampled echo data from the receive time domain to image focal
points. Sample data arrives from the transform unit at the input buffer, and each sample is either discarded or copied to
the output buffer as determined by our piecewise quadratic approximation algorithm. The constant storage holds the
precomputed constants and boundary for each approximation section. The adder chain calculates the next delta index value
to determine how far ahead the hardware needs to iterate to find the next focal point, with the final adder accumulating
fractional bits from previous additions. The select decrementor is initialized with the integer component of the adder chain.
In each cycle, the head of the input buffer is copied to the output if the decrementor is zero, or discarded if it is nonzero. The
section decrementor tracks when to advance to the next piece-wise section.
Reduce
The nal stage is the reduce unit, which
ties the 1,024 channels together via a pipelined network. Each reduce unit corresponds
to a single node on the network and adds the
.............................................................
MAY/JUNE 2014
micro
IEEE
105
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Value
96
Total transducers
Receive transducers per subframe
10,560
1,024
4,096 12 bits
4,096
10 cm
p/6
Sampling frequency
Interpolation factor
40 MHz
4
160 MHz
1540 m/s
1 frame/s
Table 2. CNR values for ideal system and Sonic Millip3De (SM3D).
Values correspond to cysts shown in Figure 4.
Left column of
Right column of
cysts in Figure 4
cysts in Figure 4
SM3D
Ideal
SM3D
3.59
3.58
1.93
1.85
3.18
2.68
3.21
2.67
1.51
1.94
1.41
1.85
1.61
1.62
2.10
2.01
1.10
0.33
1.18
0.39
2.39
2.43
2.30
2.34
............................................................
106
micro
IEEE
20
30
30
40
40
50
50
60
60
70
70
80
80
90
90
100
10
(a)
Ideal
20
z (mm)
Parameter
z (mm)
0
10
x (mm)
100
10
0
10
x (mm)
(b)
(2- to 10-cm depth). Table 1 shows the relevant ultrasound system parameters. We generate 3D images using both our system
(iterative delay calculation and xed-point
adders) as well as an ideal system (full delay
calculation and double-precision oatingpoint arithmetic). Figure 4 shows a 2D slice
for both images. We quantitatively compare
image quality using contrast-to-noise ratios
(CNRs) for each cyst, shown in Table 2.
Overall, Sonic Millip3Des image quality is
nearly indistinguishable from the ideal case,
providing high image quality and validating
our algorithm design.
We analyze the full system power of Sonic
Millip3De using a register transfer level design
targeting a 45-nm standard cell library and
Spice models of the global interconnect. Using
results from synthesis (SRAM, beamformer,
interconnect) and published values (transducers, analog-to-digital converters, memory
interface, DRAM), we determine that the
design requires 14.6 W in current 45-nm
technology, falling a bit short of the ambitious
5-W power target. However, under current
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
....................................................................
References
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
107
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
_____________________________
............................................................
108
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..................................................................................................................................................................................................................
..................................................................................................................................................................................................................
......
micro
IEEE
Our approach
As the price of memory drops, modern
databases arent typically disk-I/O-bound,1,2
with many databases now either tting into
main memory or having a memoryz-resident
working set. At Facebook, 800 servers supply
over 28 Tbytes of in-memory data to users.3
Despite the relative scarcity of memory
pins, there is ample evidence that these
and other large data workloads dont saturate
the available bandwidth and are largely
Lisa Wu
Raymond J. Barker
Martha A. Kim
Kenneth A. Ross
Columbia University
.............................................................
109
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Date
Date
Qty.
SKU
5/6/11
2/2/11
7/27/11
6/1/11
10/10/11
9/3/11
5/20/11
12/6/11
3/1/11
7/1/11
10/1/11
Input table
Splitters
<
Qty.
SKU
2/2/11
>=
5/6/11
6/1/11
5/20/11
<
>=
<
7/27/11
9/3/11
>=
10/10/11
12/6/11
Partitioned data
Partitioning background
............................................................
110
micro
IEEE
the input table is assigned to exactly one partition on the basis of the value of the key
eld. Figure 1 shows an example table of sales
transactions partitioned using the transaction
date as the key. This work focuses on a particular partitioning method called range partitioning, which splits the space of keys into
contiguous ranges, as illustrated in Figure 1
where sales transactions are partitioned by
quarter. The boundary values of these ranges
are called splitters.
Partitioning a table allows ne-grained
synchronization and data distribution. Moreover, when tables become so large that they
or their associated processing metadata cant
t in the cache, partitioning improves the
performance of many critical database operations, such as joins, aggregations, and
sorts.9-11 Partitioning is also used in databases for index building, load balancing,
and complex query processing.12 More generally, a partitioner can improve locality for
any application that needs to process large
datasets in a divide and conquer fashion,
such as histogramming, image alignment
and recognition, MapReduce-style computations, and cryptoanalysis.
To demonstrate the benets of partitioning, lets examine joins. A join takes a common key from two tables and creates a new
table containing the combined information
from both tables. For example, to analyze
how weather affects sales, we would join the
sales records in SALES with the weather
records in WEATHER, where SALES.date
WEATHER.date. If the WEATHER
table is too large to t in the cache, this
process will have poor cache locality, as the
left side of Figure 2 depicts. On the other
hand, if both tables are partitioned by
date, each partition can be joined in a pairwise fashion, as the right side of Figure 2
illustrates. When each partition of the
WEATHER table ts in the cache, the perpartition joins can proceed more rapidly.
When the data is large, the time spent partitioning is more than offset by the time saved
with the resulting cache-friendly, partitionwise joins.
Join performance is critical because
most queries begin with one or more joins
to cross-reference tables, and as the most
data-intensive and costly operations, their
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
SALES
SALES_3
SALES_2
WEATHER_2
WEATHER_3
SALES_4
WEATHER_1
WEATHER_4
WEATHER
Partition
(WEATHER)
Partition
(SALES)
Join(1)
Join(2)
Join(3)
Join(4)
Join(SALES,WEATHER)
Figure 2. Joining two large tables exceeds cache capacity. Thus, join implementations partition tables first and then compute
partition-wise joins, each of which exhibits substantially improved cache locality.10,11 Joins are extremely expensive on large
datasets, and partitioning represents up to half of the observed join time.11
Other
80
60
40
17
9
11
5
7
8
22
1
2
12
21
18
19
10
15
3
20
4
16
14
13
6
Avg.
20
TPC-H query
.............................................................
MAY/JUNE 2014
micro
IEEE
Join
100
Query execution time (%)
111
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
30
25
1 thread
20
16 threads
Potential system-memory throughput
15
10
100
200
300
No. of partitions
400
500
HARP
(Fig. 6)
Core
Core
SBout
SB in
L1
L1
SB in
L2
Memory
Hardware-accelerated
data partitioning
L2
Memory
controller
Figure 5. Block diagram of a typical two-core system with HardwareAccelerated Range Partitioner (HARP) integration. New components
(HARP and stream buffers) are shaded.
............................................................
112
micro
IEEE
New structures
Software-controlled
data streaming
SBout
conservatively preallocate space for the output tables beforehand to streamline the inner
loop. The partitioning inner loop runs over
an input table reading one record at a time,
computing its partition using a partition
function, and then writing the record to the
destination partition. We implement the
partition function using an equality rangepartitioning implementation,14 which performs a binary search of the splitters.
Overview
Figure 5 shows a block diagram of the
major components in a system with rangepartitioning acceleration. Two stream buffersone running from memory to HARP
(SBin ) and the other from HARP to memory
(SBout )decouple HARP from the rest of
the system. The range-partitioning computation is accelerated in hardware (indicated by
the double arrow in Figure 5), while inbound
and outbound data stream management is
left to software (single arrows in Figure 5),
maximizing exibility and simplifying the
interface to the accelerator. One set of
instructions provides conguration and
control for HARP, which freely pulls data
from and pushes data to the stream buffers,
while a second set of streaming instructions
moves data between memory and the stream
buffers. Because data moves in a pipeline
that is, streamed in from memory via the
streaming framework, partitioned with
HARP, and then streamed back outthe
lowest-throughput component determines
overall system throughput.
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
Serializer
Conveyor
=
<
=
<
=
<
Convert burst
to stream of
records (FSM)
From
SB in
WE
WE
WE
WE
WE
WE
WE
Pull burst of records from the most full partition buffer (FSM)
Merge
To SBout
Figure 6. HARP draws records in bursts, serializing them into a single stream that is fed into a pipeline of comparators. At
each stage of the pipeline, the record key is compared with a splitter value, and the record is either filed in a partition buffer
(downward) or advanced (to the right) according to the comparison outcome. As records destined for the same partition
collect in the buffers, the merge stage identifies and drains the fullest buffer, emitting a burst of records all destined for the
same partition. (WE: write enable.)
HARP accelerator
The HARP acceleration is managed via
three instructions. The instruction set
splitter is invoked once per splitter to
delineate a boundary between partitions;
partition start signals HARP to
start pulling data from the SBin ; and
partition stop signals HARP to stop
pulling data from SBin and drain all inight data to SBout . To program a 15-way
partitioner, for example, HARP uses
seven set splitter instructions to set
values for each splitter value, followed by a
partition start to start partitioning.
Because HARPs microarchitectural state is
not visible to other parts of the machine, the
splitter values are not lost upon interruption.
HARP pulls and pushes records in
64-byte bursts (tuned to match the system
vector width and DRAM burst size). The
HARP microarchitecture consists of three
modules, as Figure 6 depicts, and is tailored
to range partition data highly efciently:
.............................................................
MAY/JUNE 2014
micro
IEEE
113
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
............................................................
114
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
D/P
C/S
SB
A dedicated data
bus to and from
the memory
subsystem
A multiplexer
that steers the
fill data
...
Address
Last-level
cache
(LLC)
Figure 7. Implementation of streaming instructions into existing datapath of a generic lastlevel cache request/fill microarchitecture. Required minimal modifications are shaded.
Evaluation
To evaluate the throughput, power, and
area efciency of our design, we implemented
HARP in Bluespec System Verilog (www.
bluespec.com). The partitioner evaluated
here supports 16-byte records with 4-byte
keys. Assuming 64-byte DRAM bursts, this
works out to four records per burst. We evaluate the overhead of the streaming framework
using CACTI.17 For further details about the
methodology, including synthesis settings,
please refer to the methodology section of our
other work.18
We evaluate the proposed HARP system
in the following categories:
HARP throughput
Figure 8 plots the throughput of three
range partitioner implementations: single-
.............................................................
MAY/JUNE 2014
micro
IEEE
115
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
8
1 thread
16 threads
1 thread + HARP
100
200
300
400
500
TOP PICKS
20
1 thread
16 threads
1 thread + HARP
15
10
100
No. of partitions
6.8
6.4
5.4
5.5
memcpy
ASM (SSE)
4.6
memcpy
X86
ASM (SSE)
Assembly
(ASM, scalar)
Our experiments
Prior results
............................................................
116
micro
IEEE
200
300
400
No. of partitions
500
Streaming throughput
Our results in Figure 9 show that Cs standard library memcpy provides similar throughput to hand-optimized vector code, whereas
scalar codes throughput is slightly lower. For
comparison, we have also included the results
of a similar experiment published by IBM
Research.19 Based on these measurements, we
will conservatively estimate that the streaming
framework can bring in data at 4.6 GBps and
write results to memory at 4.6 GBps with a
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
No. of
Stream buffers
Power
Area
Power
partitions mm2 Xeon (%) Watts Xeon (%) mm2 Xeon (%) Watts Xeon (%)
15
31
0.16
0.31
0.4
0.7
0.01
0.02
0.3
0.4
0.07
0.07
0.2
0.2
0.063
0.079
1.3
1.6
63
0.63
1.5
0.04
0.7
1.30
0.2
0.078
1.6
127
255
1.34
2.83
3.1
6.6
0.06
0.11
1.3
2.3
0.11
0.13
0.3
0.3
0.085
0.100
1.7
2.0
511
5.82
13.6
0.21
4.2
0.18
0.4
0.233
4.7
Energy efficiency
From an energy perspective, this slight
increase in power is overwhelmed by the
improvement in throughput. Figure 10 compares the partitioning energy per gigabyte of
data of software (both serial and parallel)
against HARP-based alternatives. The data
show a 6.3 to 8.7 times improvement in
single-threaded partitioning energy with
HARP.
By design, HARP preserves the record
order. All records in a partition appear in the
4
Partitioning throughput (GBps)
3
2
1
25
50
75
100
.............................................................
MAY/JUNE 2014
micro
IEEE
117
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
............................................................
118
micro
IEEE
streaming framework decouples the microarchitecture of the accelerator from the specics of data layout and management. This
allows seamless integration of the accelerator
into existing software, as well as a clean
mechanism for handling context switches
and interrupts by saving and restoring just
the contents of the stream buffers.
The research demonstrates the potential
of data-oriented specialization. Moving data
through the memory subsystem and CPU
cache hierarchy consumes more than double
the energy of the computation itself.20 With
an application-specic integrated circuit
designed to specically process tables in a
streaming fashion, the HARP system delivers
an order of magnitude improvement in
energy efciency. The overall system design
also makes it easy to introduce other streaming accelerators such as specialized aggregators, joiners, sorters, lters, or compressors to
expand both the use and benets of this
approach.
MICRO
....................................................................
References
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
doc/internals/en/date-and-time-data-type__________________________
representation.html.
_____________
16. N.P. Jouppi, Improving Direct-Mapped
Cache Performance by the Addition of a
Small Fully-Associative Cache and Prefetch
Buffers, Proc. 17th Ann. Intl Symp. Computer Architecture, 1990, pp. 364-373.
17. HP Labs, CACTI, 2008; _________
http://www.hpl.
hp.com/research/cacti.
_____________
Communication
in
Multi-Core
____________
_______
.............................................................
MAY/JUNE 2014
micro
IEEE
119
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
................................................................................................................................................................................................................
................................................................................................................................................................................................................
Angshuman Parashar
Michael Pellauer
Michael Adler
Bushra Ahsan
Neal Crago
Intel
Daniel Lustig
Princeton University
Vladimir Pavlov
Intel
Antonia Zhai
University of Minnesota
Mohit Gambhir
Aamer Jaleel
Randy Allmon
Rachid Rayess
Stephen Maresh
Joel Emer
Intel
.......................................................
120
micro
IEEE
......
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
121
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
..............................................................................................................................................................................................
We classify prior work on architectures for programmable accelerators according to the taxonomy shown in Figure A (although some
have been proposed as stand-alone processors instead of accelerators
complementing a general-purpose CPU). Temporal architectures (class
0 in the taxonomy) are best suited for data-parallel workloads and are
outside of this articles scope. Within the spatial domain (classes 1x),
the trade-offs between logic-grained architectures (class 10), such as
field-programmable gate arrays (FPGAs) and instruction-grained architectures (classes 11x), are well understood.1-3 In this sidebar, we
focus on prior work on instruction-grained spatial architectures with
centralized and distributed control paradigms.
Programmable accelerators
Temporally
programmed
Spatially
programmed
Class 0: SIMT,
SIMD, MIMD
Logic grained
Instruction grained
Distributed control
Non-PC-controlled
need for an explicit control core. Other recent work such as Garp,2
Chimaera,8 and ADRES3 (Architecture for Dynamically Reconfigurable
Embedded System) similarly integrate lookup-table-based or coarsegrained reconfigurable logic controlled by a host processor, either as
a coprocessor or within the processors datapath.
Matrix is an array of 8-bit function units with a configurable network.1 With different configurations, Matrix can support VLIW, SIMD,
or Multiple-SIMD computations. The key feature of the Matrix architecture is its ability to deploy resources for control based on application regularity, throughput requirements, and space available.
PipeRench is a coarse-grained reconfigurable logic system
designed for virtualization of hardware to support high-performance
custom computations through self-managed dynamic reconfiguration.9
It is constructed from 8-bit PEs. The functional unit in each PE contains
eight three-input lookup tables (LUTs) that are identically configured.
In the dataflow computing paradigm, instructions are dispatched
for execution when tokens associated with input sources are ready.
Each instructions execution results in the broadcast of new tokens to
dependent instructions. Classical dataflow architectures used this as
a centralized control mechanism for spatial fabrics.10,11 However,
other projects use token triggering to issue operations in the PEs,5,6
whereas the centralized control unit uses a more serialized approach.
In a dataflow-triggered PE, the microarchitecture manages the
token-ready bits associated with input sources. The triggered-instruction
approach, in contrast, replaces these bits with a vector of architecturally
visible predicate registers. By specifying triggers that span multiple predicates, the programmer use these bits to indicate data readiness or for
other purposes, such as control flow decisions. In a classic dataflow
architecture, multiple pipeline stages are devoted to marshaling tokens,
............................................................
122
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
References
1. E. Mirsky and A. DeHon, MATRIX: A Reconfigurable Computing Architecture with Configurable Instruction Distribution
and Deployable Resources, Proc. IEEE Symp. FPGAs for
Custom Computing Machines, 1996, pp. 157-166.
2. J. Hauser and J. Wawrzynek, Garp: A MIPS Processor with
a Reconfigurable Coprocessor, Proc. IEEE Symp. FPGAs for
Custom Computing Machines, 1997, pp. 12-21.
3. B. Mei et al., ADRES: An Architecture with Tightly Coupled
VLIW Processor and Coarse-Grained Reconfigurable Matrix,
Proc. 13th Intl Conf. Field-Programmable Logic and Applications, 2003, pp. 61-70.
4. J. Hoogerbrugge and H. Corporaal, Transport-Triggering vs.
Operation-Triggering, Compiler Construction, LNCS 786,
Springer-Verlag, 1994, pp. 435-449.
5. D. Burger et al., Scaling to the End of Silicon with EDGE
Architectures, Computer, vol. 37, no. 7, 2004, pp. 44-55.
6. S. Swanson et al., The WaveScalar Architecture, ACM
Trans. Computer Systems, vol. 25, no. 2, 2007, pp. 4:1-4:54.
7. V. Govindaraju, C.-H. Ho, and K. Sankaralingam, Dynamically
Specialized Datapaths for Energy Efficient Computing, Proc.
17th Intl Conf. High Performance Computer Architecture
(HPCA), 2011, pp. 503-514.
8. Z.-A. Ye et al., CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional
Unit, Proc. 27th Intl Symp. Computer Architecture, 2000,
pp. 225-235.
9. H. Schmit et al., PipeRench: A Virtualized Programmable
Datapath in 0.18 Micron Technology, Proc. IEEE Custom
Integrated Circuits Conf., 2002, pp. 63-66.
10. J.B. Dennis and D.P. Misunas, A Preliminary Architecture
for a Basic Data-Flow Processor, Proc. 2nd Ann. Symp.
Computer Architecture, 1975, pp. 126-132.
11. K. Arvind and R.S. Nikhil, Executing a Program on the MIT
Tagged-Token Dataflow Architecture, IEEE Trans. Computers, vol. 39, no. 3, 1990, pp. 300-318.
12. A. Smith et al., Dataflow Predication, Proc. 39th Ann.
IEEE/ACM Intl Symp. Microarchitecture, 2006, pp. 89-102.
13. M. Taylor et al., The Raw Microprocessor: A Computational
Fabric for Software Circuits and General-Purpose Programs,
IEEE Micro, vol. 22, no. 2, 2002, pp. 25-35.
14. Z. Yu et al., An Asynchronous Array of Simple Processors
for DSP Applications, Proc. Solid-State Circuits Conf., 2006,
pp. 1696-1705.
15. D. Truong et al., A 167-Processor Computational Platform in
65 nm CMOS, IEEE J. Solid-State Circuits, vol. 44, no. 4,
2009, pp. 1130-1144.
16. G. Panesar et al., Deterministic Parallel Processing, Intl J.
Parallel Programming, vol. 34, no. 4, 2006, pp. 323-341.
.............................................................
MAY/JUNE 2014
micro
IEEE
123
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
5 83 32
12
PE
14
PE
PE
cur undef
cur = 17
cur = 27
PE
22
PE
14
PE
10 14 88
5 32 83
6 11 24
PE
11 30 72
for x = 1..NPASSES
for y = 1..k
// control loop
............................................................
124
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
check_a: beqz
check_b: beqz
check_o: beqz
beq
beq
cmp.lt
%r0, %in0.first,%in1.first
bnez
%r0, send_a
send_b: enq
%in1
jump
check_a
send_a: enq
deq
jump
a_done: beq
done:
%out0, %in1.first
deq
%out0, %in0.first
%in0
check_a
%in1.first, EOL, done
jump
send_b
deq
%in0
deq
%in1
return;
Static instructions: 18
Average instructions per iteration: 10
Average branches per iteration: 7
.............................................................
MAY/JUNE 2014
micro
IEEE
125
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Table 1. Adding features to a PC-based ISA to improve efficiency for spatial programming.
Feature
Description
Notes
PC (baseline)
RegQueue
memory queues
Expose register-mapped queues to ISA and test via active polling
FusedDeq
Good improvement
RegQSelect
RegQStall
Minimal improvement
Bubbles, overserialization
QMultiThread
Predication
Augmented
start:
beq
beq
cmp.ge
send_b:(p2) enq
send_a:(!p2)enq
jump
a_done:
cmp.ne
(p2) jump
nop
start
p2, %in1.first, EOL
send_b
(deq %in0, deq %in1)
return;
Static instructions: 9
Average instructions per iteration (Issued): 6
Average instructions per iteration (committed): 5
Average branches per iteration: 3
Speedup versus PC+RegQueue (see Figure 3): 1.4 times
............................................................
126
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
rule sendA
when listA.first() != EOL && listB.first() != EOL && listA.data < listB.data do
outList.send(listA.first()); listA.deq();
end rule
rule sendB
when listA.first() != EOL && listB.first() != EOL && listA.data >= listB.data do
outList.send(listB.first()); listB.deq();
end rule
rule drainA
when listA.first() != EOL && listB.first() == EOL do
outList.send(listA.first()); listA.deq();
end rule
rule drainB
when listA.first() == EOL && listB.first() != EOL do
outList.send(listB.first()); listB.deq();
end rule
rule bothDone
when listA.first() == EOL && listB.first() == EOL do
listA.deq(); listB.deq();
end rule
Figure 5. Traditional guarded-action merge sort worker algorithm. This paradigm naturally
separates the representation of data transformation (via actions) from the representation of
control flow (via guards). This results in a higher level of code readability, because the control
decisions related to each action are naturally grouped and isolated.
Triggered instructions
A large degree of the inefciency we have
discussed here stems from the issue of efciently composing Boolean control-ow
decisions. To overcome this, we draw inspiration from the historical computing paradigm
of guarded actions, a eld that has a rich technical heritage including Dijkstras language of
guarded commands,11 Chandy and Misras
Unity,12 and the Bluespec hardware description language.13
Computation in a traditional guardedaction system is described using rules composed of actions (state transitions) and guards
(Boolean expressions that describe when a
certain action is legal to apply). A scheduler is
responsible for evaluating the guards of the
actions in the system and posting ready
Triggered-instruction architecture
A triggered-instruction architecture (TIA)
applies this concept directly to controlling the
scheduling of operations on a PEs datapath at
an instruction-level granularity. In the historical guarded-action programming paradigm,
arbitrary Boolean expressions are allowed
in the guard, and arbitrary data transformations can be described in the action. To
adapt this concept into an implementable
ISA, both must be bounded in complexity.
.............................................................
MAY/JUNE 2014
micro
IEEE
127
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
............................................................
128
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
doCheck:
when (!p0 && %in0.tag != EOL
&& %in1.tag != EOL) do
cmp.ge p1, %in0.data, %in1.data (p0 := 1)
sendA:
when (p0 && p1) do
enq %out0, %in0.data (deq %in0, p0 := 0)
sendB:
when (p0 && !p1) do
enq %out0, %in1.data (deq %in1, p0 := 0)
drainA:
when (%in0.tag != EOL && %in1.tag == EOL) do
enq %out0, %in0.data (deq %in0)
drainB:
when (%in0.tag == EOL && %in1.tag != EOL) do
enq %out0, %in1.data (deq %in1)
bothDone:
when (%in0.tag == EOL && %in1.tag == EOL) do
nop (deq %in0, deq %in1)
Static instructions: 6
Average instructions per iteration: 2
Speedup versus PC+RegQueue (see Figure 3): 5 times
Speedup versus PC+Augmented (see Figure 4): 3 times
Figure 6. The triggered instruction merge sort worker retains the clean
separation of control and data transformation of the generalized guarded
action version shown in Figure 5. The restriction is that the control decisions
must be stored in single-bit predicate registers, and the action is limited to
the granularity of one instruction. As a result, the sendA and sendB rules
are refactored such that the comparison takes place in the earlier doCheck
rule, which sets up predicate register p1 with the result of the comparison.
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
129
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Input links
Input switch
Input channels
gs
l ta
e
nn
Tag
Data
ull
y/f
t
mp
Reg 0
a
Ch nel e
an
Ch
Instruction
triggers
Scheduler
Data
Tag
Reg 1
Reg 2
Data
Tag
Data
Reg 3
Operand select
Operand select
Instruction
Instructions
P0 P1 P2 P3
Tag
Ch
an
ne
ALU
le
pt
y/
Predicate update
Data
update
fu
ll
Output channels
Tag
Data
Tag
Data
Data
Tag
Tag
Data
Output switch
Output links
Figure 7. A PE based on our triggered-instruction architecture (TIA). The PE is preconfigured with a static set of instructions.
Evaluation of workloads
............................................................
130
micro
IEEE
Approach
Our evaluation fabric is a scalable spatial
architecture built from an array of TIA PEs
organized into blocks, which form the granularity of replication of the fabric. Each block
contains a grid of interconnected PEs, a set of
scratchpad slices distributed across the block,
a private L1 cache, and a slice of a shared L2
cache that scales with the number of blocks
on the fabric. Figure 9 provides an illustration of a block and the parameters used in
our evaluation. Each PE has the following
architectural parameters:
Datapath: 32 bits
Sources per instruction: 2
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
Trigger
Instruction
Trigger
Instruction
Trigger
Instruction
Trigger
Instruction
Trigger
Instruction
Trigger
Instruction
Trigger
Instruction
Trigger
Instruction
Triggered instruction
Execute
Datapath
Priority encoder
Instruction ready
Trigger
resolution
Predicate updates
Predicate
registers
Channel status
Tags
Figure 8. Microarchitecture of a TIA scheduler. The Trigger Resolution stage is implemented as combinational logic. This is a
low-power approach because only local state updates and I/O channel activity consume dynamic power.
Registers: 8
Predicates: 8
Maximum triggered instructions: 16
.............................................................
MAY/JUNE 2014
micro
IEEE
131
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
Scratchpad
slices
L1 cache
L2 cache slice
(a)
PEs: 32
Network: Mesh (one-cycle link latency)
Scratchpad: 8 Kbytes (distributed)
L1 cache: 4 Kbytes (four banks, 1 Kbyte per bank)
L2 cache: 24-Kbyte shared slice
DRAM: 200-cycle latency
Estimated clock rate: 2 GHz
(b)
Figure 9. Data used in our evaluation. Block illustration (a) and parameters
(b). Each block contains a grid of interconnected PEs, a set of scratchpad
slices distributed across the block, a private L1 cache, and a slice of a shared
L2 cache that scales with the number of blocks on the fabric.
............................................................
132
micro
IEEE
Performance results
Figure 10 demonstrates the magnitude of
performance improvement that can be
achieved using a spatially programmed accelerator. Across our workloads, we observe
area-normalized speedup ratios ranging from
3 times (fast Fourier transform) to about 22
times (SHA-256) compared to the traditional
cores performance, with a geometric mean of
8 times.
Now lets analyze how much of this benet is attributable to the use of triggered
instructions by comparing the rate-limiting
inner loops of our workloads to implementations on spatial architectures using the
PCRegQueue and PCAugmented control schemes.
Table 3 shows the average frequency of
branches in the dynamic instruction stream
for the PC-based spatial architectures. The
branch frequency ranges from 8 to 70 percent, with an average of 50 percent. These
inner loops are all branchy and dynamicfar
more than traditional sequential code.
This dynamism manifests itself as additional control cycles for both PC-based architectures. Figure 11 shows the dynamic
execution cycles for all architectures broken
down into cycles spent on operations in relevant categories. The cycle counts are all normalized to the number of data computation
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
Berkeley Dwarf16
Domain
implementations
Combinational logic
Cryptography
Knuth-Morris-Pratt
Various
Scientific computing
Spectral methods
Graph traversal
Signal processing
Supercomputing
k-means clustering
Data mining
Merge sort
Map/reduce
Databases
auto-vectorization
Nonpublic optimized implementation
Flow classifier
Networking
SHA-256
Combinational logic
Cryptography
25
Performance ratio
20
15
10
5
micro
ea
n
serialization. Because these are critical ratelimiting loops in the spatial pipeline, there
are fewer opportunities for multiplexing
unrelated work onto shared PEs. Despite
this, the workloads show benets from avoiding overserialization.
Third, the workload that sees the largest
benet from triggered instructions is Merge
Sort. Merge Sort has the highest dynamic
.............................................................
MAY/JUNE 2014
IEEE
er
FF
T
ra
ph
5
k- 00
m
ea
ns
KM
P
se
ar
M
ch
er
ge
so
rt
SH
A25
6
G
Fl
ow
cl
as
s
ifi
D
M
0
AE
S
133
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Table 3. Percentage of dynamic instructions that are branches in rate-limiting step inner loop.
Control
AES
DMM
FFT
Flow
Classifier
Graph-500
k-means
KMP
Search
Merge
scheme
(%)
(%)
(%)
(%)
(%)
(%)
(%)
(%)
(%)
sort (%)
PCRegQ
58
50
36
50
50
69
70
63
50
33
11
50
40
29
14
50
22
28
PCAug
5.0
Q.ops
F.ops
C.ops
Wait
D.ops
4.5
Dynamic cycles
4.0
3.5
3.0
2.5
2.0
1.5
1.0
AES
DMM
Flow
classifier
FFT
k-means
Graph500
KMP search
Merge sort
SHA-256
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
0.0
PC+RegQ
0.5
Mean
Figure 11. Breakdown of dynamic execution cycles in rate-limiting inner loops normalized to data computation operations
(D.ops) executed by PCRegQueue. This demonstrates the ability of triggered instructions to reduce queuing, control and
predicated-false operations, and wait cycles arising from over-serialization.
45
PC+RegQ
PC+Augmented
TIA
40
Static instructions
35
30
25
20
15
10
5
n
M
ea
25
6
rt
ASH
er
ge
so
ch
se
ar
ns
KM
km
ea
00
er
ra
ph
-5
as
s
ifi
T
Fl
ow
cl
FF
M
D
M
AE
Figure 12. Static instruction counts for rate-limiting inner loops. See our
previous work14 for an analysis of why triggered instructions can never
result in an increase in instruction count compared to PC-based approaches.
............................................................
134
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
Non-control
Control
35
30
25
20
15
10
AES
DMM
FFT
Flow
classifier
Graph-500
k-means
KMP search
Merge sort
SHA-256
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
5
PC+RegQ
45
40
Mean
Figure 13. Average dynamic instruction counts for rate-limiting inner loops. In this context, removal of instructions can directly
translate into workload speedup.
....................................................................
References
Reconfigurable
Matrix,
1705.
Processing, Intl J. Parallel Programming,
vol. 34, no. 4, 2006, pp. 323-341.
.............................................................
MAY/JUNE 2014
micro
IEEE
135
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
............................................................
136
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
________________
_______________
________________
_____________
________________
______________
________________
_______________
_______________
.............................................................
MAY/JUNE 2014
micro
IEEE
137
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
................................................................................................................................................................................................................
................................................................................................................................................................................................................
......
Hyojin Sung
Rakesh Komuravelli
Sarita V. Adve
University of Illinois at
Urbana-Champaign
.......................................................
138
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
Background
DeNovo is based on the observation
that although the global address space
offered by shared memory is attractive
to programmers, current wild sharedmemory programming environments that
allow data races, ubiquitous nondeterminism, unstructured parallelism, and complex
consistency models make programming,
testing, and maintaining software difcult.8 This has led to much recent software
research on more disciplined sharedmemory programming models.9 The datarace-free model adopted for C, C, and
Java is one successful initial example of discipline, but more work is required to
address all of the concerns just mentioned.8
The DeNovo project asks this question:
If software becomes more disciplined, can
we rethink the multicore memory hierarchy to provide more complexity-, performance-, and energy-efcient hardware
than the current state of the art? The prior
DeNovo work7 addressed this question for
deterministic programs that contain annotations motivated by the Deterministic Parallel Java (DPJ) language.10 DeNovo
proposed a coherence protocol for such
programs that has no transient states, no
invalidation message trafc, no sharer lists
in directories, and no false sharing. Overall,
compared to state-of-the-art MESI (modied, exclusive, shared, invalid) cache coherence protocols, DeNovo is much simpler
and easier to verify and extend, performs
comparably or better, and is more energy
efcient (since it reduces cache misses and
network trafc) for a range of deterministic
codes.
Figure 1 shows how DeNovo and DeNovoND can simplify coherence activities and
reduce network trafc for deterministic and
nondeterministic accesses, compared to the
conventional hardware MESI coherence protocol, as further elaborated in the rest of this
article.
Software assumptions
There is much recent research on disciplined shared-memory programming models
with explicit and structured parallelism, synchronization, and communication. Although
the details vary, they all share the goal of
making parallel programming easier and
safer, and they expose metadata about program structures and memory access patterns
(either automatically extracted or provided
by programmers) to prove desirable properties of programs.
Determinism. Many disciplined programming models (and DeNovo) target the safety
property of determinism. We assume the following properties about disciplined software
to guarantee determinism:
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
139
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Coherence activity
across cores
DeNovo/DeNovoND
Coherence activity
across cores
MESI
Global
synchronization
(barriers)
Writeable regions
in the previous phase
Atomic regions
in critical sections
Lock
synchronization
(critical sections)
Figure 1. Conceptual comparison of the complexity and key coherence activities of the
MESI, DeNovo, and DeNovoND protocols. The explosion symbols represent invalidation
of data in each cores private cache, and the thin black arrows between explosion
symbols represent network messages. MESI sends invalidation messages on every
write miss assuming other cores may have concurrently read the accessed data.
DeNovo and DeNovoND constrain invalidations to well-defined synchronization points
such as barriers and lock acquires/releases with the data-race freedom guarantee. At
such synchronization points, the reading core performs a local self-invalidation in its
cache for potentially stale data without incurring additional network traffic. The dot-filled
explosion symbols represent self-invalidations by DeNovo. The black-filled explosion
symbols represent additional self-invalidations by DeNovoND to deal with
nondeterministic critical sections. To identify which data is potentially stale, DeNovo
relies entirely on programmer-provided region information, whereas DeNovoND uses
a combination of software information and simple hardware support in the form of
access signatures transferred on a lock hand-off.
............................................................
140
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
Data-race freedom.
Strong isolation of accesses in all
atomic and deterministic parallel
constructs; that is, these constructs
appear to execute atomically.
Determinism by default; that is, any
parallel construct that does not contain an explicit nondeterministic construct provides deterministic output
for a given input.
Sequential composition for deterministic constructs; that is, tasks of a
deterministic construct appear to
occur in the sequential order implied
by the program.
These guarantees not only ensure sequential consistency but also allow programmers
to reason with very high-level, strongly isolated, and composable components such as
complete foreach constructs and all
atomic sections.
DeNovoND assumes that similar to DPJ
programs, atomic sections and accesses to
atomic regions are identied. It also assumes
that the atomic sections are converted to
locks, where the same lock is used to protect a
given atomic region in a given parallel phase.
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
141
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Overall, DeNovo showed that our software-hardware codesign approach can lead to
simpler and more efcient hardware than the
state of the art, but only for deterministic
programs.
Consistency model
............................................................
142
micro
IEEE
Data coherence
The coherence mechanism must simply
ensure that a read returns the value from the
last write as dened by the consistency model
just described. As with DeNovo, we divide
the coherence mechanism into two components: no stale data, and locatable up-to-date
data. DeNovoND implements hardware
mechanisms to meet these requirements as
DeNovo does, dealing with the two issues
separately.
No stale data. For nonatomic accesses, we
take the same approach as DeNovo. Thus, at
the start of a parallel phase, the compiler
inserts self-invalidations for data regions that
could have been written by other cores in the
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
Example
Figure 2 provides an example to illustrate
how DeNovoND uses the Bloom lters. The
code snippet on the left depicts three variables, a, b, and c, in atomic region xR. It then
shows a critical section protected by lock x
with atomic read and write effects on region
xR. The right side of the gure shows an execution with two cores, C1 and C2. It also
shows the signatures at each core, assuming a
perfect hash function.
When a core performs an atomic write, it
inserts the accessed address into its Bloom lter. Thus, at the end of a critical section, all
addresses modied in the section are
recorded in the cores lter; that is, their
entries are nonzero. In Figure 2, on each store
request to a, b, and c in the lightly shaded
critical sections, the Bloom lters on C1 or
C2 are updated. The second critical-section
phase on C2 does not update C2s Bloom lter because it does not have atomic writes.
On an acquire, the access signature at the
releaser is transferred to the acquirer together
with the lock. As a result, all modications
preceding the release associated with the
acquire are made visible to the acquirer. The
acquirer, on receiving the Bloom lter,
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
143
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
C2
Acquire x
St a
St b
Release x
Lock transfer
1
a
1
b
b
1
Acquire x
Id a
Id b
St b
St c
Release x
1
a
1
c
1
b
Lock transfer
Set insertion
Set union
No change
Self-invalidation
Prefetch hit
Acquire x
Id a
Id b
St b
St c
Release x
1
Figure 2. An example of propagating atomic writes using access signatures: a code snippet
and an execution with two cores, C1 and C2. Assume a and b are in the same cache line.
............................................................
144
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
Evaluation
For our evaluations, we use the Simics
full-system functional simulator with the
Wisconsin GEMS (General Executiondriven Multiprocessor Simulator) memory
timing simulator and the Princeton Garnet
network simulator. Table 1 shows the key
parameters of our simulated systems. For the
Quantity
Core frequency
2 GHz
Number of cores
L1 data cache
16
64 Kbytes, 64 bytes per line
Cache Architecture)
Memory
L1 hit latency
1 cycle
L2 hit latency
Remote L1 hit latency
29 to 61 cycles (bank-dependent)
35 to 83 cycles
Network parameters
Bloom filter size
Hash function
4 H3
Results
Figure 3 shows the execution time and
network trafc of our applications for MESI,
DeNovoND with the idealized innite
Bloom lter (DInf), and DeNovoND with a
256-bit Bloom lter (D256). For MESI, we
use the Posix Threads Mutex library for locks,
because implementing distributed queuebased locks on MESI involves signicant
complexity to deal with numerous transient
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
145
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Compute stalls
Barrier stalls
100
75
50
25
Ocean
Water
Fluidanimate
Streamcluster
TSP
K-means
SI
Dl
nf
D2
56
ME
SI
Dl
nf
D2
56
ME
SI
Dl
nf
D2
56
ME
SI
Dl
nf
D2
56
ME
SI
Dl
nf
D2
56
ME
SI
Dl
nf
D2
56
ME
ME
ME
Barnes
SI
Dl
nf
D2
56
SI
Dl
nf
D2
56
125
SSCA2
(a)
125
Store
Queue lock/unlock
Writeback
Invalidation
100
75
50
25
Barnes
Ocean
Water
Fluidanimate
Streamcluster
TSP
K-means
56
D2
SI
ME
56
D2
SI
ME
56
D2
SI
ME
56
D2
SI
ME
56
D2
SI
ME
56
D2
SI
ME
D2
56
SI
ME
D2
ME
56
SI
Load
SSCA2
(b)
Barnes, Ocean, and Water are benchmarks in the SPLASH-2 suite. Fluidanimate and Streamcluster are PARSEC benchmarks.
TSP: Traveling-Salesman Problem. K-means and SSCA2 are from the STAMP benchmark suite.
Figure 3. Performance of DeNovoND versus MESI: execution time (a) and network traffic (b). All bars are normalized to MESI.
In (a), each bar is divided into compute time, stall time due to data memory accesses, barrier sync time, and lock acquire time.
The bars in (b) are divided by message type: load, store, queue lock/unlock, writeback, and invalidation.
............................................................
146
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
Store traffic is reduced in some applications owing to DeNovos writevalidate policy (a store miss does not
bring in the cache line).
The net reduction in load misses
(memory time) due to the lack of
false sharing directly leads to lower
load traffic in several applications.
Load traffic is further reduced
because a load response only contains
valid or registered words of a cache
line. Because the coherence state is
preserved per word, some words may
be invalid at the servicing cache.
....................................................................
References
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
147
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
pp. 471-482.
processors,
Proc.
3rd
Intl
Conf.
548.
12. B.N. Bershad, M.J. Zekauskas, and W.A.
Sawdon, The Midway Distributed Shared
Memory System, Compcon Digest of
Papers, 1993, pp. 528-537.
____________
_______
............................................................
148
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
Awards
................................................................................................................................................................
......
A meandering path
Coming of age at the beginning of the
Vietnam War, I spent my twenties simultaneously protesting the war and supporting it by assisting the U.S. Navy in
deploying computers and navigation systems. This deep internal conict had
many consequences and affected important decisions later in life. But at Texas
Instruments I learned a lot about how
real computers worked from Tom Stringfellow, Quitman Liner, and many others.
Working as an engineer while a graduate student in the 1970s, I was involved
in the development of third-party memory systems attaching to IBM mainframes. I was fortunate that upstart Intel
allowed me a 30-hour work week during
more than ve years employment. I
learned about memory systems by
examining detailed designs of third-party
memory systems for the IBM System/
370 family, struggling particularly with
the problems of how to hold off a
micro
IEEE
.............................................................
149
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
AWARDS
the same project, Wei-chung Hsu recognized that conicts between register allocation and code scheduling could be
handled best by doing both at once,
resulting in work2 Im very proud to claim
despite my minimal contribution.
Working with Mary Vernon and me,
Steve Scott did some excellent work evaluating the Scalable Coherent Interface
(SCI) ring.3 Working with IEEE standards
committees (Futurebus and SCI) over
the next 10 years, I learned a lot about
cache coherence protocols, and with Stefanos Kaxiras, developed a strong belief
that such protocols could be extended
without falling back to a scalablebut
slow directory-based scheme. Gradually it emerged that caches were highly
effective for the sharing of data, particularly if things could get out of order, but
that locks and critical sections could
exhibit disastrous memory behavior for
cache-based memory systems.
Steve Scott also came up with the brilliant notion of pruning caches, exploring
the novel concept of a distributed directory (or cache) that remembered regions
of the network where a line was not
cached. The concept has since come up
repeatedly in my work toward scalable,
non-directory-based cache consistency,
but this work is rarely referenced, perhaps because it ended up in an IEEE journal4 after multiple conference rejections.
I delved into locks and memory ordering, working with Phil Woest and Mary
Vernon to propose the concept of building hardware queues to avoid many of
the problems associated with spinlocks.
We initially called this Queue-on-Sync-Bit
(QOSB, pronounced Cosby), but soon
renamed it Queue-on-Lock-Bit (Colby,
after the Wisconsin town responsible for
a common cheese). This work5 inspired
Michael Scott and John Mellor-Crummey
to propose the popular MCS lock, a software-built queue that captured much of
the benet of QOLB.6 Meanwhile, Alain
Kagi and Doug Burger analyzed the
potential for QOLB,7 concluding that it
could be effective, but required sophisticated and disciplined programming, as
if programming SMPs wasnt hard
............................................................
150
micro
IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
............................................................
References
10. R.
Rajwar
and
J.R.
Goodman,
pp. 294-305.
2. J.R. Goodman and W.-C. Hsu, Code
Scheduling and Register Allocation in
Large Basic Blocks, Proc. 2nd Intl
Conf. Supercomputing, 1988, pp.
11. J.R.
Goodman
and
H.H.J
Hum,
442-452.
of Auckland, https://researchspace.
_____________
3. S.L. Scott, J.R. Goodman, and M.K.
Vernon, Performance of the SCI
Ring, Proc. 19th Ann. Intl Symp.
Computer Architecture (ISCA 92),
1992, pp. 403-414.
auckland.ac.nz/bitstream/handle/2292/
______________________
11593/MESIF-2004.pdf?sequence7.
_____________________
12. J.R. Goodman and H.H.J. Hum,
MESIF: A Two-Hop Cache Coherency
of Auckland, https://researchspace.
auckland.ac.nz/bitstream/handle/2292/
11594/MESIF-2009.pdf?sequence6.
13. F. Tabba et al., NZTM: Non-Blocking
Zero-Indirection Transactional Memory, Proc. 21st Ann. Symp. Parallelism
in
Algorithms
and
Architectures
Systems,
1989,
__________________________________
.............................................................
MAY/JUNE 2014
micro
IEEE
151
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
Micro Economics
................................................................................................................................................................
......
.......................................................
152
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
Two-way relationship
A persons relationship with his or
her university does not end with graduation. Some alumni retain their ties, spend
time with other alumni, and dene their
social existence around their experience.
Universities know about this behavior
in other ways. Some alumni with disposable income and wealth make major
donations to universities. Those funds
contribute to buildings (to which the
donors attach their names), as well
as research institutes inside buildings
(which also display eponymous labels).
Those funds can make an enormous difference to researchers, freeing their time
for experiments.
Private universities have much longer
track records of success tapping into
their alumni base for donations than
public schools, but savvy public schools
have adopted similar practices more
recently. Think of the public institutions
with large, loyal, and nancially successful alumni such as the great state institutions in Berkeley, Los Angeles, Ann
Arbor, Champaign, or Austin.
Not that this is easy to actually pull
off. Business can have a hard time buying friends in academics, and both sides
can easily mess up the relationship. The
two partners seem not to have the same
perception of the universe.
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
________________________________
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q