Micro Computer Architecture

Contents
Zoom in
Zoom out
For navigation instructions please click here
Search Issue
Next Page
The Academic and Business Marriage

p. 152
May/June 2014
The magazine for chip and silicon systems designers
http://www.computer.org/micro
Contents
Zoom in
Zoom out
For navigation instructions please click here
Search Issue
Next Page
micro
IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M
q
M
q
M
q
M
q
MQmags
q
THE WORLDS NEWSSTAND
_____________________
May/June 2014
Volume 34 Number 3
Features
4
Guest Editors Introduction: Top Picks from the 2013 Computer

Architecture Conferences
Mithuna S. Thottethodi and Shubu Mukherjee
Designing and Managing Datacenters Powered by

Renewable Energy
In~igo Goiri, William Katsak, Kien Le, Thu D. Nguyen, and
Ricardo Bianchini
17
Quality-of-Service-Aware Scheduling in Heterogeneous

Datacenters with Paragon
Christina Delimitrou and Christos Kozyrakis
31
A Case for Specialized Processors for Scale-Out Workloads

Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos,
Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel
Popescu, Anastasia Ailamaki, and Babak Falsafi
43
Smart: Single-Cycle Multihop Traversals over a Shared

Network on Chip
Tushar Krishna, Chia-Hsin Owen Chen, Woo-Cheol Kwon, and Li-Shiuan Peh
57
Networks on Chip with Provable Security Properties

Hassan M.G. Wassel, Ying Gao, Jason K. Oberg, Ted Huffmire,
Ryan Kastner, Frederic T. Chong, and Timothy Sherwood
69
Cache Coherence for GPU Architectures

Inderpreet Singh, Arrvindh Shriraman, Wilson W.L. Fung, Mike OConnor,
and Tor M. Aamodt
80
A Configurable and Strong RAS Solution for Die-Stacked

DRAM Caches
Jaewoong Sim, Gabriel H. Loh, Vilas Sridharan, and Mike OConnor
91
Decoupled Compressed Cache: Exploiting Spatial Locality for

Energy Optimization
Somayeh Sardashti and David A. Wood
100 Sonic Millip3De: An Architecture for Handheld 3D Ultrasound

Richard Sampson, Ming Yang, Siyuan Wei, Chaitali Chakrabarti, and
Thomas F. Wenisch
109 Hardware Partitioning for Big Data Analytics

Lisa Wu, Raymond J. Barker, Martha A. Kim, and Kenneth A. Ross
120 Efficient Spatial Processing Element Control
via Triggered Instructions

Angshuman Parashar, Michael Pellauer, Michael Adler, Bushra Ahsan, Neal
Crago, Daniel Lustig, Vladimir Pavlov, Antonia Zhai, Mohit Gambhir,
Aamer Jaleel, Randy Allmon, Rachid Rayess, Stephen Maresh, and Joel Emer
138 DeNovoND: Efficient Hardware for
Disciplined Nondeterminism
Hyojin Sung, Rakesh Komuravelli, and Sarita V. Adve
Departments
2
From the Editor in Chief

Top Picks from 2013
149 Awards
Reflections from the 2013 Eckert-Mauchly Award Recipient
152 Micro Economics
Cover artwork by Giacomo Marchesi

www.GiacomoMarchesi.com
_________________________
________________
IEEE Micro (ISSN 0272-1732) is published bimonthly by the IEEE Computer Society.
IEEE Headquarters, Three Park Ave., 17th Floor, New York, NY 10016-5997; IEEE
Computer Society Headquarters, 2001 L St., Ste. 700, Washington, DC 20036; IEEE
Computer Society Publications Office, 10662 Los Vaqueros Circle, PO Box 3014,
Los Alamitos, CA 90720. Annual subscription rates: IEEE Computer Society members
get the lowest rates, US$45 (print and electronic). Go to http://www.computer.org/
subscribe to order and for more information on other subscription prices. Back issues:
___
members, $20; nonmembers, $148. This magazine is also available on the Web.
Postmaster: Send address changes and undelivered copies to IEEE, Membership
Processing Dept., 445 Hoes Ln., Piscataway, NJ 08855. Periodicals postage is paid
at New York, NY, and at additional mailing offices. Canadian GST #125634188.
Canada Post Corp. (Canadian distribution) Publications Mail Agreement #40013885.
Return undeliverable Canadian addresses to 4960-2 Walker Road; Windsor, ON N9A
6J3. Printed in USA.
Reuse rights and reprint permissions: Educational or personal use of this material is
permitted without fee, provided such use: 1) is not made for profit; 2) includes this notice
and a full citation to the original work on the first page of the copy; and 3) does not imply
IEEE endorsement of any third-party products or services. Authors and their companies
are permitted to post the accepted version of IEEE-copyrighted material on their own
web servers without permission, provided that the IEEE copyright notice and a full
citation to the original work appear on the first screen of the posted copy. An accepted
manuscript is a version which has been revised by the author to incorporate review
suggestions, but not the published version with copy-editing, proofreading, and formatting added by IEEE. For more information, please go to http://www.ieee.org/
publications_standards/publications/rights/paperversionpolicy.html.
_________________________
Permission to reprint/republish this material for commercial, advertising, or promotional purposes or for creating new collective works for resale or redistribution must be
obtained from IEEE by writing to the IEEE Intellectual Property Rights Office,
445 Hoes Lane, Piscataway, NJ 08854-4141 or pubs-permissions@ieee.org.
__________
Copyright # 2014 IEEE. All rights reserved.
Abstracting and library use: Abstracting is permitted with credit to the source.
Libraries are permitted to photocopy for private use of patrons, provided the
per-copy fee indicated in the code at the bottom of the first page is paid through
the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
Editorial: Unless otherwise stated, bylined articles, as well as product and service descriptions, reflect the authors or firms opinion. Inclusion in IEEE Micro does not necessarily
constitute an endorsement by IEEE or the Computer Society. All submissions are subject to
editing for style, clarity, and space. IEEE prohibits discrimination, harassment, and bullying.
For more information, visit http://www.ieee.org/web/aboutus/whatis/policies/p9-26.html.
The Academic and Business Marriage

_______
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Erik R. Altman
Thomas J. Watson Research Center
ealtman@us.ibm.com
___________
ASSOCIATE EDITOR IN CHIEF
Lieven Eeckhout
Ghent University
lieven.eeckhout@ugent.be
_____________
ADVISORY BOARD
David H. Albonesi, Pradip Bose, Kemal Ebcioglu,
Michael Flynn, Ruby B. Lee, Yale Patt, James E.
Smith, and Marc Tremblay
EDITORIAL BOARD
Alper Buyuktosunoglu
IBM
Pradeep Dubey
Intel Corp.
Sandhya Dwarkadas
University of Rochester
Babak Falsafi
Ecole Polytechnique Federale de Lausanne
Krisztian Flautner
ARM
R. Govindarajan
Indian Institute of Science
Shane Greenstein
Northwestern University
Lizy Kurian John
University of Texas at Austin
Stephen W. Keckler
University of Texas at Austin
Margaret Martonosi
Princeton University
Richard Mateosian
Shubu Mukherjee
Cavium Networks
Toshio Nakatani
IBM
Vojin G. Oklobdzija
New Mexico State University
Ronny Ronen
Intel Corp.
Kevin W. Rudd
US Naval Academy
Andre Seznec
INRIA Rennes
Richard H. Stern
Olivier Temam
INRIA
Mateo Valero
Technical University of Catalonia
Tilman Wolf
University of Massachusetts, Amherst
Xiaodong Zhang
Ohio State University
EDITORIAL STAFF
Editorial Management
Molly Gamborg
Contributing Editors
Amber Ankerholz, Thomas Centrella,
Kristine Kelly, Keri Schreiner,
Dale Strok, and Joan Taylor
__________
___________
Director, Products & Services

Evan Butterfield
Senior Manager, Editorial Services
Robin Baldwin
Associate Manager, Peer Review & Periodical
Administration
Hilda Carman
Senior Business Development Manager
Sandra Brown
Senior Advertising Coordinator
Marian Anderson
EDITORIAL OFFICE
PO Box 3014, Los Alamitos, CA 90720;
(714) 821-8380; ____________
r.baldwin@computer.org
Submissions:
https://mc.manuscriptcentral.com/micro-cs
Author guidelines:
http://www.computer.org/micro
IEEE COMPUTER SOCIETY
PUBLICATIONS BOARD
Vice President
Jean-Luc Gaudiot
Magazine Operations Chair
Paolo Montuschi
Transactions Operations Committee
Laxmi N. Bhuyan
Digital Library Operations Committee
Frank Ferrante
Plagiarism Chair
David S. Ebert
Executive Director
Angela R. Burgess
Members-at-Large
Alain April, Greg Byrd, Robert Dupuis,
Linda I. Shafer, H.J. Siegel, and Per Stenstrom
__________
_________
_________
___________
COMPUTER SOCIETY MAGAZINE

OPERATIONS COMMITTEE
Paolo Montuschi (Chair)
Erik R. Altman, Maria Ebling, Miguel Encarnacao,
Lars Heide, Cecilia Metra, San Murugesan, Shari
Lawrence Pfleeger, Michael Rabinovich, Yong Rui,
Forrest Shull, George K. Thiruvathukal, Ron Vetter,
and Daniel Zeng
MAY/JUNE 2014
micro
M
q
M
q
MQmags
q
EDITOR IN CHIEF
IEEE
M
q
M
q
1
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
From the Editor in Chief
................................................................................................................................................................
Top Picks from 2013
ERIK R. ALTMAN
Thomas J. Watson Research Center
......
This double issue features our

annual Top Picks from the microarchitecture conferences held in 2013. I thank
Guest Editors Mithuna S. Thottethodi
and Shubu Mukherjee for their outstanding job in all aspects of running the Program Committee and arriving at these
selections. I am also happy to report that
we received a record 101 submissions,
from which 12 were selected for publication here.
Like last year, it seemed an interesting exercise to compare the topics
of 2013 Top Picks articles with topics
covered in the inaugural 2003 Top
Picks issue. In 2003, Guest Editors
Charles Moore, Kevin W. Rudd, Ruby
B. Lee, and Pradip Bose divided
articles into six categories. I have
assigned this years articles to those
same six categories, as shown in
Table 1. In doing so, only one article
did not seem a good t to any of the
2003 categories. That article focuses
on datacenters, and in 2003 there was
no datacenter or cloud computing
category. (Other articles this year
also touch on datacenters, but have
aspects
that
t
within
2003
categories.)
The inability to continue Dennard scaling has yielded a major increase in articles
in the Unconventional architectures
category, whereas Building on conventional microarchitectures dropped
.......................................................
micro
IEEE
to zero articles, as did Performance analysis, with other categories staying

roughly similar.
It is sometimes a point of confusion
about how the Top Picks articles published here differ from the original conference publications. Like all IEEE
publications, IEEE Micro requires at
least 30 percent new content over any
previous publication. Top Picks articles
generally meet this requirement via a
three-page summary (in the initial submission), summarizing the paper and
arguing for the potential of the work to
have long-term impact. (Indeed, for the
upcoming Top Picks to be published in
2015, Program Committee Chairs and
Guest Editors Luis Ceze and Karin
Strauss ask what the citation of your
paper would be if it won the test of
time award in 10 years.) In addition,
IEEE Micro has a 5,000-word limit, so
authors often have to condense their
original paper. As a result, the IEEE
Micro version of Top Picks papers generally provides more context and a
slightly higher-level overview of the
work, with the original conference
paper serving as a deeper reference for
readers interested in more detail. This
approach inverts the historical practice
of journals providing a more detailed
record of conference papers, but we
think that this Top Picks approach has
served IEEE Micro well.
Published by the IEEE Computer Society
This Top Picks issue is also unique

among IEEE Micro editions (and possibly among all IEEE Computer Society
publications) in that the Manuscript
Central/ScholarOne reviewing system
is not used for initial submissions. Instead, the Program Chairs deploy their
preferred reviewing system. Papers recommended by the Program Committee
for acceptance are then entered into
Manuscript Central for the nal stages of
processing. This separate reviewing system makes it easier to manage the large
volume of submissions.
Why go into this detail about reviewing
software? The IEEE Computer Society
constantly works with Thomson Reuters
the owner of ScholarOne, to improve its
capabilities. As part of that effort, ScholarOne maintains two websites to suggest
ideas for its reviewing system and to vote
on suggestions of others:
Offer
Suggestions:
http://
____
scholaroneideas.force.com/
ideaListCustom
__________
Rate Ideas of Others: http://
____
mchelp.manuscriptcentral.com/
ScholarOneIdeas/howto.html
__________________
I encourage any of you who author

articles for IEEE Micro, or who serve as
reviewers, to visit these sites and help
improve ScholarOne.
Finally, this issue continues our
recent practice, led by Associate Editor
0272-1732/14/$31.00 c 2014 IEEE
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
Table 1. Mapping 2013 Top Picks articles to 2003 Top Picks categories.
Category
Unconventional
No. of 2003
No. of 2013
articles in
category
articles in
category
architectures
Articles in this issue

A Case for Specialized Processors for Scale-Out Workloads
Smart: Single-Cycle Multihop Traversals over a Shared Network
on Chip
Efficient Spatial Processing Element Control via Triggered
Instructions
DeNovoND: Efficient Hardware for Disciplined Nondeterminism
Networks on Chip with Provable Security Properties
Sonic Millip3De: An Architecture for Handheld 3D Ultrasound
Power- and
temperature-aware
Hardware Partitioning for Big Data Analytics

Designing and Managing Datacenters Powered by
Renewable Energy
design

Energy Optimization*
Reliability
A Configurable and Strong RAS Solution for
Cache, memory,
Die-Stacked DRAM Caches*

Cache Coherence for GPU Architectures
and multiprocessor
optimizations
Energy Optimization*
A Configurable and Strong RAS Solution for Die-Stacked
DRAM Caches*
Building on conventional
microarchitectures
N/A
Performance analysis
N/A
None of the above
Quality-of-Service-Aware Scheduling in Heterogeneous

Datacenters with Paragon
...................................................................................................................................
*
These articles fit in two categories from 2003.
in Chief Lieven Eeckhout, of noting major

awards. More specically, this issue
includes a column by James Goodman
about the work that led to his EckertMauchly Award. Jim has many interesting and broad-ranging observations about
his life and career, and I hope you will
enjoy it as much as I did.
With that, as with the Top Picks

articles, happy reading!
mas J. Watson Research Center. Contact him at ealtman@us.ibm.com.

_____________
Erik R. Altman
Editor in Chief
IEEE Micro
Erik R. Altman is the manager of the
Dynamic Optimization Group at the Tho-
___________
_______
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
Guest Editors Introduction
................................................................................................................................................................................................................
TOP PICKS FROM THE 2013

COMPUTER ARCHITECTURE
CONFERENCES
......
We received a total of 101 submissions.

The full program committee of 30 members
(see the sidebar The Selection Committee)
reviewed all submissions. Each paper received
at least four reviews (with many receiving ve
reviews) from the program committee. In
cases where one Selection Committee chair
had a conict of interest with a submission,
the other chair handled the review assignments. There were no papers on which both
Selection Committee chairs had conicts. In
addition to the Selection Committee reviews,
four external reviews were also sought for
unique cases where we felt specic outside
expertise was needed. Papers with high variance in scores were also targeted for additional online discussion and, in some cases,
additional reviews. We thank the committee
and the external reviewers for their time and
effort toward this valuable service to the computer architecture community.
Note that, in addition to papers published
in 2013, selected papers published in 2012
were also eligible for inclusion in this years

issue of Top Picks because of the conict handling rules of Top Picks. Under these rules,
the selection committee chairs may not submit their own papers in the year they serve as
chair. However, their papers are eligible for
full consideration in the following year.
We selected 41 top-ranked papers (based
on the average overall merit score for each
paper) for discussion at the PC meeting. Furthermore, to minimize the impact of variations in reviewer generosity, we veried that
the 41 papers included the top-ranked papers
of most individual committee members. We
encouraged the committee to champion
other papers for discussion that may have
been among the top papers in their assigned
reviews if such papers had not automatically
qualied for discussion based on the overall
score. Consequently, one additional paper
was added to the discussion list, taking the
total to 42.
The Selection Committee discussed all 42
papers at a meeting in Boston on 10 January
(with 28 members attending physically and
two participating via teleconference). Committee members with conicts left the room
before papers were discussed. The meeting
was conducted in two phases. In the rst
phase, the committee voted to accept or
reject papers without regard to the total number of papers with the explicit understanding
that we may overshoot the target. In the second phase, the committee revisited the specic shortlisted papers to arrive at the nal
list of 12 papers (see the Top Picks of 2013
0272-1732/14/$31.00 c 2014 IEEE
It gives us great pleasure to introduce the special issue of the top picks from
the computer architecture conferences of
2013. The special issue presents a selection
of 12 papers that describe novel, exciting
research directions in areas as diverse as
design of datacenters, processors and accelerators, networks on chip, programmabilityenhancing frameworks, and emerging large
caches.
The review process
Mithuna S. Thottethodi
Purdue University
Shubu Mukherjee
Cavium
.......................................................
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
The Selection Committee

Tor Aamodt, University of British Columbia

David Albonesi, Cornell University
David August, Princeton University
Rajeev Balasubramonian, University of Utah
Pradip Bose, IBM
Doug Burger, Microsoft
John Carter, IBM
Joel Emer, Intel and Massachusetts Institute of Technology
Babak Falsafi, Ecole Polytechnique Federale de Lausanne
Antonio Gonzalez, Intel
Sudhanva Gurumurthi, University of Virginia and Advanced Micro
Devices
Dan Jimenez, Texas A&M University
David Kaeli, Northeastern University
Alvin Lebeck, Duke University
Hsien-Hsin Lee, Georgia Institute of Technology
sidebar). We congratulate the authors on this

well-deserved accolade.
The selected papers

The selected papers are responsive to
many of the pressing problems that we face
today. The emergence of cloud computing
fueled by social media networks is leading to
innovations in datacenters. The continuous
need to improve the energy efciency of these
clouds of processors, memory, and disks has
led to high performance-per-watt mechanisms, such as accelerator engines, better
scheduling of datacenter resources, and new
styles of processor, cache, memory, and network design more suited for datacenters and
future workloads. Security continues to be an
overriding concern in this world of public
clouds and mobile computing, which has led
to innovation in the security architecture of
todays processors. As co-chairs of this IEEE
Micro Top Picks issue, we are excited to
present to our audience a glimpse of how
architects envision solving todays challenging
computing problems.
Maximizing the use of renewable energy
to power these large datacenters is important
from a sustainability perspective. Designing
and Managing Datacenters Powered by
Renewable Energy by I~
nigo Goiri et al.
responds to this challenge by developing
strategies to optimally use renewable energy

Gabriel Loh, Advanced Micro Devices

Margaret Martonosi, Princeton University
Kathryn Mc Kinley, Microsoft and University of Texas at Austin
Milo Martin, University of Pennsylvania
Trevor Mudge, University of Michigan
Satish Narayanaswamy, University of Michigan
Eric Rotenberg, North Carolina State University
Karu Sankaralingam, University of WisconsinMadison
Yanos Sazeides, University of Cyprus
Simha Sethumadhavan, Columbia University
Andre Seznec, INRIA
Dan Sorin, Duke University
Dean Tullsen, University of California, San Diego
T.N. Vijaykumar, Purdue University
Sudhakar Yalamanchili, Georgia Institute of Technology
from sources that fall under the commonly

used colocation/self-generation model.
In addition to energy efciency, it is
important to efciently schedule available
hardware resources to maximize performance in datacenters, especially in challenging environments where hardware is
typically heterogeneous (due to rolling
upgrades), and application performance is
interference prone. In Quality-of-ServiceAware Scheduling in Heterogeneous Datacenters with Paragon, Christina Delimitrou and Christos Kozyrakis develop a
novel scalable scheduling technique that is
heterogeneity and interference aware to signicantly boost performance (compared to
an oblivious scheduling approach).
Although the computing landscape has
changed dramatically from a desktop-andlocal-software regime to cloud-based computing, processor designs have more or less
remained the same. A Case for Specialized
Processors for Scale-Out Workloads by
Michael Ferdman et al. argues that there is a
mismatch between modern processor hardware and the requirements of emerging cloud
workloads. This work suggests directions in
processor design for emerging cloud workloads. (The conference version of this paper
was published in 2012; but it was eligible for
Top Picks this year, per the conict handling
rules we described earlier.)
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
GUEST EDITORS INTRODUCTION
..............................................................................................................................................................................................
Top Picks of 2013

Designing and Managing Datacenters Powered by Renewable

Energy by I~nigo Goiri, William Katsak, Kien Le, Thu D. Nguyen,
and Ricardo Bianchini
Quality-of-Service-Aware Scheduling in Heterogeneous Datacenters with Paragon by Christina Delimitrou and Christos Kozyrakis
A Case for Specialized Processors for Scale-Out Workloads by
Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos,
Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian
Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi
Smart: Single-Cycle Multihop Traversals over a Shared Network
on Chip by Tushar Krishna, Chia-Hsin Owen Chen, Woo-Cheol
Kwon, and Li-Shiuan Peh
Networks on Chip with Provable Security Properties by Hassan
M.G. Wassel, Ying Gao, Jason K. Oberg, Ted Huffmire, Ryan Kastner, Frederic T. Chong, and Timothy Sherwood
Cache Coherence for GPU Architectures by Inderpreet Singh,
Arrvindh Shriraman, Wilson W.L. Fung, Mike OConnor, and Tor
M. Aamodt
............................................................
micro
IEEE

A Configurable and Strong RAS Solution for Die-Stacked DRAM

Caches by Jaewoong Sim, Gabriel H. Loh, Vilas Sridharan, and
Mike OConnor
Energy Optimization by Somayeh Sardashti and David A. Wood
Sonic Millip3De: An Architecture for Handheld 3D Ultrasound
by Richard Sampson, Ming Yang, Siyuan Wei, Chaitali Chakrabarti, and Thomas F. Wenisch
Hardware Partitioning for Big Data Analytics by Lisa Wu,
Raymond J. Barker, Martha A. Kim, and Kenneth A. Ross
Efficient Spatial Processing Element Control via Triggered
Instructions by Angshuman Parashar, Michael Pellauer, Michael
Adler, Bushra Ahsan, Neal Crago, Daniel Lustig, Vladimir Pavlov,
Antonia Zhai, Mohit Gambhir, Aamer Jaleel, Randy Allmon,
Rachid Rayess, Stephen Maresh, and Joel Emer
DeNovoND: Efficient Hardware for Disciplined Nondeterminism
by Hyojin Sung, Rakesh Komuravelli, and Sarita V. Adve
Given that most of the cloud servers are

multicore servers and given the increasing
importance of the network-on-chip (NoC)
fabric in such servers (the NoC latency is on
every L1 cache miss path), the performance
of the NoC becomes critical. In Smart: Single-Cycle Multihop Traversals over a Shared
Network on Chip, Tushar Krishna et al.
design an NoC that opportunistically bypasses multiple routers in a single cycle in the
absence of contention. Under ideal conditions, the router effectively mimics the
latency of a fully connected network even though the packets traverse several hops.
To ensure privacy and to prevent information leakage through timing channels, it is
important to provably ensure complete timing isolation. Networks on Chip with Provable Security Properties by Hassan M.G.
Wassel et al. solves this problem for NoCs.
Unlike prior QoS approaches (where a guaranteed minimum performance is adequate),
the provable timing isolation shown in this
article achieves stronger isolation to ensure
that there are no timing interactions among
different domains.
As GPUs move toward providing more
sophisticated memory models, the lack of
viable coherence implementations remains a
stumbling block. Cache Coherence for
GPU Architectures by Inderpreet Singh

et al. argues that revisiting the idea of temporal coherence might hold the key to efcient
cache coherence implementations for GPU
architectures.
Die-stacked DRAM, which is on the cusp
of widespread adoption, has received signicant attention regarding its role in the memory hierarchy. However, little attention has
been paid to its RAS characteristics. Jaewoong Sim et al., in their article A Congurable and Strong RAS Solution for DieStacked DRAM Caches, show that rather
than carrying over RAS solutions from traditional DRAM, novel RAS solutions that are
customized for die stacked DRAM are
preferable.
Last-level caches are a precious resource
and, as such, there is strong motivation to use
compression to squeeze out more effective
capacity. The article Decoupled Compressed Cache: Exploiting Spatial Locality for
Energy Optimization by Somayeh Sardashti
and David A. Wood overcomes key limitations of prior compression techniques in
terms of fragmentation and tag limits by leveraging decoupled organization.
In the context of domain specic computing, Richard Sampson et al. develop a lowpower, high-performance solution for 3D
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
ultrasound in their article, Sonic Millip3De:

An Architecture for Handheld 3D Ultrasound. Beyond the immediate application of
3D ultrasound imaging, the article is a case
study for accelerator design. The solution,
which relies on hardware-algorithm codesign,
develops a new accelerator architecture to
bring the 3D beamforming problem within
the desired performance/power envelope.
Continuing with the same theme of novel
accelerators, Hardware Partitioning for Big
Data Analytics by Lisa Wu et al. describes a
low-area-overhead hardware accelerator that
signicantly improves data partitioning performance for the important class of database
workloads.
In Efcient Spatial Processing Element
Control via Triggered Instructions, Angshuman Parashar et al. target spatial accelerators
and develop a novel approach to control ow
that eliminates the performance problems
associated with program counter-based control ow used in prior spatial accelerators and
architectures.
The article DeNovoND: Efcient Hardware for Disciplined Nondeterminism by
Hyojin Sung et al. proposes a design that
simplies coherence implementation via disciplined coding while still allowing key nondeterminism features (which is critical for
lock-based codes).
memory hierarchies. Thottethodi has a PhD

in computer science from Duke University.
He is a member of IEEE and the ACM.
Shubu Mukherjee is a distinguished engineer and the lead architect for the ARMv8
processor core at Cavium. His research
interests include innovation confluencing
and computer architecture. Mukherjee has a
PhD in computer science from the University of Wisconsin-Madison. He is a Fellow
of IEEE and the ACM.
Direct questions and comments about this
issue to Mithuna S. Thottethodi at ____
mithuna@purdue.edu or to Shubu Mukherjee at
___________
shubumukherjee001@gmail.com.
_____________________
_____________
_______
e hope that you enjoy reading these

articles, as well as their original conference versions, and we welcome your feedMICRO
back on this issue.
Acknowledgments
We thank Erik Altman for his support.
We thank the web chairs Ahmed AbdelGawad, Timothy Pritchett, and Eric Villasenor, who helped ensure a stable and
glitch-free experience with the conference
software.
Mithuna S. Thottethodi is an associate
professor in the School of Electrical and
Computer Engineering at Purdue University. His research interests include parallel
programming, parallel architecture, interconnection networks, storage, and multicore
___________________
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
................................................................................................................................................................................................................
DESIGNING AND MANAGING

DATACENTERS POWERED BY
RENEWABLE ENERGY
................................................................................................................................................................................................................
ON-SITE RENEWABLE ENERGY HAS THE POTENTIAL TO REDUCE DATACENTERS CARBON

FOOTPRINT AND POWER AND ENERGY COSTS. THE AUTHORS BUILT PARASOL, A SOLARPOWERED DATACENTER, AND GREENSWITCH, A SYSTEM FOR SCHEDULING WORKLOADS,
TO EXPLORE THIS POTENTIAL IN A CONTROLLED RESEARCH SETTING.
......
I~nigo Goiri
William Katsak
Kien Le
Thu D. Nguyen
Ricardo Bianchini
Rutgers University
.......................................................
micro
IEEE
Datacenters range from a few

servers in a machine room to thousands of
servers housed in warehouse-size installations.1 Estimates for 2010 indicate that, collectively, datacenters consume around 1.5
percent of the total electricity used worldwide.1 This translates into high carbon emissions, as most of this electricity comes from
fossil fuels. A 2008 study estimated that datacenters emit 116 million metric tons of carbon, slightly more than the entire country of
Nigeria.2
With increasing societal demand for
cleaner products and services, several companies have announced plans to build green
datacentersthat is, datacenters partially or
completely powered by renewables such as
solar or wind energy. These datacenters will
either generate their own renewable energy
(self-generation) or draw it directly from an
existing nearby plant (colocation). For example, Apple and McGraw-Hill have built large
solar arrays for their datacenters, whereas
Green House Data is a small cloud provider
that operates entirely on renewables. Although there are other approaches, these
examples suggest that many datacenters that
seek to lower emissions will prefer colocation

or self-generation. In our paper for the 18th
International Conference on Architectural
Support for Programming Languages and
Operating Systems (ASPLOS 2013),3 we discuss the current and expected future cost and
space needs of on-site solar and wind
generation.
Colocation and self-generation pose an
interesting research challenge: solar and wind
energy are intermittent, which requires
approaches for tackling the energy supply
variability. One approach is to use batteries
and/or the electrical grid as a backup for the
renewable energy. It might also be possible to
adapt the workload (the energy demand) to
match the renewable energy supply.4-8 For
the highest benets, green datacenter operators must intelligently manage their workloads and the energy sources at their disposal.
For example, when the workload is deferrable
(that is, it can be delayed within a time
bound), it might be appropriate to delay
some of the load and store the freed-up
renewable energy in the batteries for later use
(for example, to shave an expected load peak
when the renewable energy is not available).
0272-1732/14/$31.00 c 2014 IEEE
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
(a)
Free
cooling
Air
conditioner
Inverter M
M
AC
M
AC
Main
electrical
panel
M
Grid
M AC electrical
Electrical
panel
grid
DC
AC
Solar panels
AC
PDU
M AC
IT
AC
Battery
M
M DC
controller
Batteries
(b)
Figure 1. Parasol: outside view showing the solar panels, container, and air conditioning unit
(a); power distribution and monitoring infrastructure (b). The cooling system can be powered
solely by the grid, or by the main electrical panel that receives power from all sources. Meters
(M) are available for measuring the power flowing into and out of every component.
As far as we know, green datacenter operators

do not currently manage their energy sources
and workloads in this manner.
We set out to build software and hardware
to explore these issues. This article overviews
two of our main efforts: Parasol and
GreenSwitch.
Parasol
Figure 1a shows Parasol, a solar-powered
datacenter that we built as a research platform to study colocation and self-generation.
Parasol comprises a steel structure, a small
custom container housing two racks of servers and networking equipment, an air-side
economizer free-cooling unit and a directexpansion air conditioner, 16 solar panels
(producing up to 3.2 kW AC), two DC/AC
inverters, 16 lead-acid batteries (storing up
to 32 kWh), two charge controllers, and an
electricity grid tie. Parasol currently houses

64 Atom-based servers (consuming at most
30 W each), but it is large enough to house
150 of them. It uses free cooling whenever
outside temperatures and humidity are low
enough, and air conditioning otherwise. Parasol can use solar energy directly, store it in
its batteries, or feed it to the grid for credit
(net metering). We thought about adding
a wind turbine to Parasol, but historical
weather data shows that our location (Piscataway, N.J.) is not windy enough.
Figure 1b shows Parasols power distribution and monitoring infrastructure. Because
Parasol was built as a research instrument for
studying power management in green datacenters, it is critical that we understand the
power usage of each component, as well as
power losses. Thus, we have power meters
(labeled M in the gure), either internal to
components (for example, the DC/AC
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
1.6
40
1.4
35
1.2
30
1.0
25
0.8
20
0.6
15
0.4
10
0.2
0.0
Temperature (C)
Energy (MWh)
TOP PICKS
0.2
5
Apr. 2012 July 2012 Oct. 2012 Jan. 2013 Apr. 2013 July 2013 Oct. 2013 Jan. 2014
Grid use
Solar use
Net meter
Inside
Outside
Figure 2. Energy consumption, net metering, and temperatures from April 2012 to January
2014. The figure shows the seasonal patterns for both renewable energy generation and
temperature.
inverters) or added on externally (for example, the cooling-system meter), for measuring
the power owing into and out of every component. Parasol also includes a switch that
allows for powering the cooling system from
the main electrical panel or only from the
grid. This enables experimentation with or
without the cooling system loading the solar
system and batteries.
We describe our rationale for the Parasol
design and the mistakes we made while
building it over 16 months (at a total cost of
$300,000) in our ASPLOS paper.3 In this
article, we report on data gathered from operating Parasol over 22 months. Specically,
solar generation and the IT equipment
became operational in April 2012, and Parasol became fully operational in June 2012.
Energy production and usage
............................................................
10
micro
IEEE
Figure 2 shows energy usage, net-metered

energy, and the average inside and outside
temperatures from April 2012 to January
2014. We computed a power usage effectiveness (PUE) of 1.06 to 1.08, depending on
the computing load, owing to losses from
various conversions. April through June 2012
show little or no grid energy consumption,
because the external meters did not become
operational until the end of June 2012. Note
that total solar energy production is the
sum of solar energy consumed and solar

energy net metered. This data shows that
during the summer months Parasol produces
more than 500 kWh every month, whereas
during the winter this production is reduced
to less than half. For the year spanning July
2012 through June 2013, we computed an
average solar capacity factor of 16 percent.
During this time, Parasol supported workloads used for studying GreenSwitch and six
other research projects.
Interestingly, grid energy consumption in
July 2012 was signicantly lower than in
other months because we were experimenting
with GreenSwitch, transitioning machines to
sleep, and using batteries (charged with solar
energy) to reduce brown energy consumption. Starting in November 2012, we raised
the internal setpoint temperature from 27 C
to 30 C.
Cooling
Figure 3 shows the operation of the cooling system in Parasol during the second half
of August 2012. In this time period, the setpoint for internal temperature was 30 C; the
dashed line shows the actual internal temperature, whereas the solid line shows the outside temperature. The light gray area shows
the operation of the free-cooling unit,
whereas the dark gray area shows the
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
40
We now discuss our research on managing

Parasol. Specically, we describe GreenSwitch,
a system for scheduling workloads, selecting
Temperature (C)
75
20
50
10
25
Figure 3. Cooling system operation from 15 August 2013 through 30 August

2013. The setpoint for internal temperature was 30 C; the air conditioner
only ran during two days, when the outside temperature exceeded 30 C.
100
3.0
2.5
2.0
1.5
IT load
Battery charge level
Battery discharge
Battery charge
Grid use
Solar use
75
50
1.0
25
0.5
0
Wind speed (m/s)
0.0
20
15
10
5
0
Sunday
28 Oct. 2012
Monday
29 Oct. 2012
Tuesday
Wednesday
30 Oct. 2012 31 Oct. 2012
Figure 4. Parasols operation during Hurricane Sandy. Parasol used its

batteries and solar energy to operate normally during a power outage of
more than 20 hours.
which source of energy to use (renewable, battery, and/or grid), and choosing the renewable
energy storage medium (battery or grid) at
each point in time. GreenSwitch seeks to minimize the overall cost of grid electricity
(including both grid energy and peak grid
power), while respecting the characteristics of
.............................................................
MAY/JUNE 2014
micro
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Battery charge level (%)
GreenSwitch
30
Off-grid operation: Hurricane Sandy

In late October 2012, the US East Coast
was hit by Hurricane Sandy. The storm
reached Rutgers University on 29 October,
and the grid power and network suffered outages for more than 20 hours. Figure 4 shows
the behavior of Parasol and the wind speed at
our location from 28 October to 1 November. Rutgers lost power on a Monday afternoon, at the height of the measured wind
speed (> 70 km/h), and it did not come back
until the afternoon of the next day. During
this time, Parasol used its batteries and solar
energy to operate normally (although we did
transition half of the machines to sleep
because they were not being used). This experience demonstrates the potential for green
datacenters to operate through power outages
(or in remote locations without a reliable grid
power source).
100
Inside
Outside
Free cooling
Air conditioner
Speed (%)
operation of the air conditioner. Note that

even though this time period is in the
summer, the air conditioner only ran during
two days, when the outside temperatures
exceeded 30 C. Much of the time, the freecooling unit ran below 25 percent fan speed.
The average PUE when including both
conversion losses and cooling overheads for
Parasol has been lower than 1.13, showing
that free cooling is very effective at keeping
cooling overheads low. The air conditioner
has run for less than 20 days in a year, and
less than 1 percent of the total time. Most of
the time, our setpoint has been 30 C, and
the typical temperatures inside Parasol (> 95
percent) have ranged between 22 C and
30 C. We have also been experimenting with
novel cooling policies and pushing the limits
of Parasol. During these experiments, the
internal temperature at the control sensor has
ranged between 15 C and 36 C.
Thus far, we have replaced ve hard disk
drives, two solid-state drives, and one motherboard. Although this data is not statistically
signicant, it is possible that our experiments
have decreased the reliability of the IT
equipment.
IEEE
M
q
M
q
M
q
M
q
MQmags
q
Power (kW)
micro
IEEE
11
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
GreenSwitch
Battery
charge level
Energy source
schedule
Workload
prediction
Solver
Configurer
Workload
schedule
Predictor
Energy
availability
prediction
Parasol
Figure 5. GreenSwitch architecture. Rectangles with round edges are data structures. Rectangles with square borders are
processes.
the workload and battery lifetime constraints.

It can also manage workloads and energy sources during grid outages.
Architecture
............................................................
12
micro
IEEE
Figure 5 illustrates the GreenSwitch architecture. The predictor forecasts the workload
and the renewable energy production one
day into the future at the granularity of one
hour. The solver takes these predictions and
the current battery charge level as input, and
outputs a workload schedule and an energy
source and storage schedule. To compute
these schedules, the solver uses analytical
models of workload behavior, battery use,
and grid electricity cost. The congurer
effects the changes prescribed by the solver.
The changes may involve transitioning some
servers between power states and/or changing
the conguration of the energy sources. (We
have identied conguration parameters to
the inverters and charge controllers that give
us nearly full dynamic control of every source
of energy available to Parasol.)
A full iteration of GreenSwitch occurs
every 15 minutes, which enables it to properly control peak grid power use. (Utilities
typically compute peak grid power use in
windows of 15 minutes.) However, Green-
Switch checks the production of solar energy

every 3 minutes. During each of these checks,
GreenSwitch runs a full iteration if there has
been an unexpected change in production.
GreenSwitch evaluation on Parasol

We perform day-long experiments with
Parasol and an implementation of GreenSwitch for the Hadoop MapReduce framework. We study two widely different Hadoop
traces, called Facebook and Nutch. The
former derives from a larger batch-job trace
from Facebook,9 whereas the latter is the
indexing part of a Web search system.10 We
instantiate our models with the on-peak/
off-peak grid energy prices and the peak grid
power charges at our location. We assume the
utility pays the wholesale price of electricity
for net metering.
In the Facebook trace, jobs arrive
throughout the day.9 Figure 6 shows the
GreenSwitch behavior when the jobs in the
trace are deferrable (each job can be delayed
by up to 1 day), on 1 July 2012. The ll colors represent the use of the different energy
sources, whereas the lines are the solar energy
production (full), the IT load (dots), the grid
energy price (dashes, y-axis on the right), and
the current peak grid power draw (dashes
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
3.0
Solar available
Battery discharge
Battery charge
Grid use
Net metering
Solar use
IT load
2.5
Grid energy price

Peak grid power
0.20
1.5
0.10
Price ($/kWh)
0.15
2.0
Power (kW)
M
q
M
q
M
q
M
q
MQmags
q
1.0
0.05
0.5
0
00:00
04:00
08:00
12:00
16:00
20:00
0
00:00
Figure 6. GreenSwitch on deferrable Facebook workload. Most of the load during the night was delayed until renewable
energy became available. Batteries were used when no renewable energy was available.
and dots). The white ll represents solar

energy that was produced but lost because of
inefciency. The gure shows that GreenSwitch transitioned many servers to sleep in
the early hours of the day and deferred some
of the load until solar energy was available.
When there was no solar energy, GreenSwitch drew energy from the batteries, since
they stored enough capacity for the load that
was not deferred. We also see that the solar
energy was enough to power the workload,
charge the batteries, and feed energy to the
grid. Compared to a grid-only datacenter,
GreenSwitch produced a prot of 9 percent
in grid electricity cost. Given this prot,
GreenSwitch would amortize the cost of the
solar setup and batteries in only 7.6 years.
Despite seeking primarily to minimize
grid electricity cost, GreenSwitch is also successful at reducing carbon footprints. It
achieves reductions in grid energy use between 36 and 100 percent in our experiments
with Facebook and Nutch, compared to a
grid-only datacenter.
Main lessons learned

We have learned many important lessons
in building Parasol and GreenSwitch. First,
we learned that engineering contractors are
unfamiliar with the state-of-the-art in data-
center design or with research prototypes.

Our inability to bridge this knowledge gap
quickly (or at all) caused delays. This is a
challenge for organizations that want to build
datacenters but lack the expertise.
Because Parasol was a major undertaking,
its design needed to enable research on many
topics (such as solar energy, free cooling, and
wimpy servers). However, because we had
not yet started to research every topic, we
ended up designing more features and exibility into Parasol than we might eventually
need. This increased costs.
We also found that the need to collect
ne-grained power measurements and accurately estimate energy losses led to extra
design complexity. In addition, placing Parasol on the roof of a building (instead of on
the ground) prevented shading from other
buildings. Moreover, the cost of the roof
placement was roughly the same as that of
extending networking and power to ground
locations far enough away from buildings.
We learned that the wimpy fans in wimpy
servers can generate nontrivial temperature
differences across a free-cooled datacenter.
Finally, and most importantly, we learned
that building a real prototype is critical for
completely understanding green datacenters.
For example, in designing GreenSwitch, we
.............................................................
MAY/JUNE 2014
micro
IEEE
13
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
detected instability in our charge controllers

when switching power sources. As a result,
GreenSwitch performs these switches in steps,
with some idle time in between. Such effects
would have been overlooked in simulation.
Potential long-term impact

We expect Parasol and GreenSwitch to
have a lasting impact on both academia and
industry for several reasons.
Renewable energy
As we mentioned earlier, several companies are starting to invest in datacenter colocation and self-generation. Regardless of
whether theyre making these investments for
market positioning, public relations, cost, or
environmental reasons, the fact is that they
are expecting bottom-line benets from
them. Moreover, despite their decreasing but
still-high capital costs, exploiting renewables
in datacenters could reduce overall energy
costs, peak grid power costs, or both, as our
ASPLOS paper explains. We expect that an
increasing number of companies will see benets in exploiting renewables.
Some research groups have also started
studying colocated and self-generating datacenters.4,5,7,11,12 These studies have been
attracting the attention of a growing community, with publications in venues such as
the International Symposium on Computer
Architecture (ISCA) and the International
Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS). We expect that our design
and experience with Parasol will accelerate
this growth, as researchers realize that they
can build nontrivial prototypes at relatively
low cost. Moreover, our analysis of solar and
wind energy cost and space requirements suggests that green datacenters will become
increasingly attractive.3
More broadly than datacenters, our experience will likely encourage more researchers
to consider the implications of external signals (such as variable-electricity pricing and
availability) on computing and communication in general.
Green datacenter prototype
............................................................
14
micro
IEEE
There has been a dearth of real platforms

for the study of colocated and self-generating
green datacenters. Parasol addresses this need

and is the rst platform of its kind. Prior
studies have had to resort to simulations or
small implementations. In our ASPLOS
paper,3 we list instances in which such alternatives would have hidden important effects.
We mentioned instability issues earlier.
Another example is that energy losses (for
example, in power conversion) are highly
dependent on load, rather than a xed percentage, as often assumed in simulation.
These instances will encourage researchers to
build prototypes for their studies. We expect
the Parasol design to serve as a model for
these future research prototypes. Moreover,
Parasol enables research on various important
topics, including solar energy and its impact
on computing, energy storage and its ability
to lower costs, free cooling and its impact on
reliability, wimpy servers and their performance/energy trade-offs, and the development
of distributed storage systems using solidstate drives. These topics are of interest to
both industry and academia.
In its current form, Parasol is a blueprint
for industry to build small-scale, low-density
green datacenters for enterprises and educational institutions. Self-generating containers
are cheaper and more practical to operate,
and can be placed in less-valuable locations
than in machine rooms inside existing buildings. Parasol is also suitable for remote
deployments with poor or no access to electricity (networking might need to take place
over satellite in this case).
Energy source and storage manager for

green datacenters
GreenSwitch simultaneously manages
workload demand, multiple energy sources
(renewable, battery, and grid), and multiple
energy stores (battery and grid). Our results
show that it is consistently effective at reducing
grid electricity costs and carbon footprints.
Although often overlooked in academia,
simplicity and adaptability are key requirements for practical adoption by industry. We
designed GreenSwitch to have both properties. Specically, it uses simple models of
solar energy availability, energy demand, and
battery behavior. In addition, although our
current implementation targets Hadoop,
GreenSwitch is modular in that only one
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
component (the congurer) is specic to the

underlying computing framework.
Research avenues
Parasol and GreenSwitch create many new
research avenues. For example, Parasol enables the study of the interplay between solar
energy and free cooling; interestingly, solar
energy is most abundant when the outside
temperature is hottest (that is, when standard
chiller-based cooling might be necessary in
warm climates). As another example, GreenSwitch demonstrates the benets of aggressive
and coordinated management of energy sources and stores and workload execution, as well
as the interplay between using batteries for
powering the workload and for storing renewable energy. Prior work on aggressive use of
batteries did not consider renewables.13
atacenters that are partially powered

by renewable energy represent an
increasingly interesting research topic from
many perspectives. In this article, we have
described Parasol, a solar-powered datacenter
that we have built as a research platform, and
our experience in constructing and operating
Parasol. We have also described GreenSwitch,
a workload and power source management
system. As we mentioned earlier, Parasol and
GreenSwitch enable the exploration of many
research avenues. We are currently studying
the behavior and management of free-cooled
datacenters, as well as the interaction between
solar energy and free cooling. We are also
studying the design of green energy-aware
latency-sensitive applications, such as cloudbased distributed storage systems. Specically, we are exploring how to design systems
that can maintain service-level objectives (for
example, a desired 99th percentile response
time), while maximizing usage of renewable
energy and minimizing usage of brown
energy. In conclusion, we hope that our experience with Parasol and GreenSwitch will
entice other researchers and practitioners to
MICRO
consider these datacenters.
ments that helped us improve this article. We

are also grateful to our sponsors, NSF grant
CSR-1117368, and the Rutgers Green Computing Initiative. Finally, we are indebted to
Joan Stanton, Heidi Szymanski, Jon Tenenbaum, Chuck Depasquale, SMA America,
and Michael J. Pazzani for their extensive help
in building and funding Parasol.
....................................................................
References
1. J. Koomey, Growth in Data Center Electricity Use 2005 to 2010, Analytic Press, 2011.
2. J. Mankoff, R. Kravets, and E. Blevis,
Some Computer Science Issues in Creating a Sustainable World, Computer, vol.
41, no. 8, 2008, pp. 102-105.
3. I. Goiri et al., Parasol and GreenSwitch:
Managing Datacenters Powered by Renewable Energy, Proc. 18th Intl Conf. Architectural Support for Programming Languages
and Operating Systems (ASPLOS 13), 2013,
pp. 51-64.
4. B. Aksanli et al., Utilizing Green Energy
Prediction to Schedule Mixed Batch and
Service Jobs in Data Centers, Proc. 4th
Workshop Power-Aware Computing and
Systems (HotPower 11), 2011, article no. 5.
5. I. Goiri et al., GreenSlot: Scheduling Energy
Consumption in Green Datacenters, Proc.
Intl Conf. High Performance Computing,
Networking, Storage and Analysis (SC 11),
2011, article no. 20.
6. I. Goiri et al., GreenHadoop: Leveraging
Green Energy in Data-Processing Frameworks, Proc. 7th ACM European Conf.
Computer Systems (EuroSys 12), 2012,
pp. 57-70.
7. A. Krioukov et al., Integrating Renewable
Energy Using Data Analytics Systems: Challenges and Opportunities, Data Eng. Bulletin, vol. 34, no. 1, 2011, pp. 3-11.
8. Z. Liu et al., Renewable and Cooling Aware
Workload Management for Sustainable
Data Centers, Proc. 12th ACM SIGMETRICS/PERFORMANCE
Joint
Intl
Conf.
Measurement and Modeling of Computer
Acknowledgments
We thank Abhishek Bhattacharjee, David
Meisner, Santosh Nagarakatte, Anand Sivasubramaniam, and Thomas F. Wenisch for com-
Systems, 2012, pp. 175-186.

9. Y. Chen et al., The Case for Evaluating
MapReduce Performance Using Workload
Suites,
Proc.
Modeling,
Analysis
&
.............................................................
MAY/JUNE 2014
micro
IEEE
15
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Simulation of Computer and Telecommunication Systems (MASCOTS), 2011, pp. 390-
zation. Goiri has a PhD in computer science

from the Universitat Politecnica de Catalunya.
399.
10. EPFL, CloudSuite, 2012; http://parsa.epfl.
ch/cloudsuite/cloudsuite.html.
_________________
11. C. Li, A. Qouneh, and T. Li, iSwitch: Coordinating and Optimizing Renewable Energy
Powered Server Clusters, Proc. 39th Ann.
Intl Symp. Computer Architecture (ISCA
12), 2012, pp. 512-523.
William Katsak is a PhD student in the

Department of Computer Science at Rutgers
University. His research focuses on power
management of datacenters. Katsak has an
MS in computer science from Rutgers University. He is a student member of IEEE and
the ACM.
12. N. Sharma et al., Blink: Managing Server

Clusters on Intermittent Power, Proc. 16th
Intl Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS 11), 2011, pp. 185-198.
13. S. Govindan et al., Leveraging Stored
Energy for Handling Power Emergencies in
Aggressively Provisioned Datacenters, Proc.
Kien Le is a software engineer at A10 networks. His research focuses on building a

cost-aware load distribution framework to
reduce energy consumption and promote
renewable energy. Le has a PhD in computer
science from Rutgers University, where he
completed the work for this article.
17th Intl Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS 12), 2012, pp. 75-86.
I~
nigo Goiri is a research associate in the
University. His research interests include
energy-efficient datacenter design and virtuali-
Thu D. Nguyen is an associate professor in

the Department of Computer Science at
Rutgers University. His research interests
include green computing, distributed and
parallel systems, operating systems, and
information retrieval. Nguyen has a PhD in
computer science and engineering from the
University of Washington. He is a member
of IEEE and the ACM.
Ricardo Bianchini is a professor in the
University. He is currently on leave from
Rutgers and working as the chief efficiency
strategist at Microsoft. His research interests
include the power, energy, and thermal management of servers and datacenters. Bianchini
has a PhD in computer science from the University of Rochester. He is an ACM distinguished scientist and a senior member of
IEEE.
________________

article to I~
nigo Goiri, Department of Computer Science, Rutgers University, 110 Frelinghuysen Road, Piscataway, NJ 088548019; goiri@cs.rutgers.edu.
_____________
____________
............................................................
16
micro
IEEE
_______
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..................................................................................................................................................................................................................
QUALITY-OF-SERVICE-AWARE
SCHEDULING IN HETEROGENEOUS
DATACENTERS WITH PARAGON
..................................................................................................................................................................................................................
PARAGON, AN ONLINE, SCALABLE DATACENTER SCHEDULER, ENABLES BETTER CLUSTER

UTILIZATION AND PER-APPLICATION QUALITY-OF-SERVICE GUARANTEES BY LEVERAGING
DATA MINING TECHNIQUES THAT FIND SIMILARITIES BETWEEN KNOWN AND NEW
APPLICATIONS. FOR A 2,500-WORKLOAD SCENARIO, PARAGON PRESERVES PERFORMANCE
CONSTRAINTS FOR 91 PERCENT OF APPLICATIONS, WHILE SIGNIFICANTLY IMPROVING
UTILIZATION. IN COMPARISON, A BASELINE LEAST-LOADED SCHEDULER ONLY PROVIDES
SIMILAR GUARANTEES FOR 3 PERCENT OF WORKLOADS.
......
Efciency is a rst-class requirement and the main source of scalability concerns both for small and large systems.1,2
Achieving high efciency is not only a matter
of sensible design, but also a function of how
the system is managed, which becomes essential as the hardware grows progressively heterogeneous and parallel and applications get
dynamic and diverse. Architecture has traditionally been about efcient system design.
As efciency increases in importance, architecture should be about both design and
management for systems of any scale.
In this article, we focus on improving efciency while guaranteeing high performance
in large-scale systems. Although an increasing
amount of computing now happens in public
and private clouds, such as Amazon Elastic
Compute Cloud (EC2; see http://aws.
amazon.com/ec2) or vSphere (www.vmware.
0272-1732/14/$31.00 c 2014 IEEE
micro
IEEE
com/products/vsphere), datacenters continue

to operate at utilizations in the single digits.1,3 This lessens the two main advantages
of cloud computingexibility and cost efciency both for cloud operators and end
usersbecause not only are the machines
underutilized, they are also operating in a
non-energy-proportional region.1,4
There can be several reasons why machines are underutilized. Two of the most
prominent obstacles are interference between
coscheduled applications and heterogeneity
in server platforms. For more information,
see the Interference and Heterogeneity
sidebar.
In our paper presented at the 18th International Conference on Architectural Support for Programming Languages and
Operating Systems (ASPLOS 2013),5 we
introduced Paragon, an online and scalable
Christina Delimitrou
Christos Kozyrakis
Stanford University
.............................................................
17
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
..............................................................................................................................................................................................
Interference occurs as coscheduled applications contend in shared
resources. Coscheduled applications may interfere negatively even if
they run on different processor cores because they share caches,
memory channels, storage, and networking devices.1,2 If unmanaged,
interference can result in performance degradations of integer factors,2 especially when the application must meet tail latency guarantees apart from average performance.3 Figure A shows that an
interference-oblivious scheduler will slow workloads down by 34 percent on average, with some running more than two times slower. This
is undesirable for both users and operators.
Heterogeneity is the natural result of the infrastructures evolution, as servers are gradually provisioned and replaced over the typical
15-year lifetime of a datacenter.4-7 At any point in time, a datacenter
may host three to five server generations with a few hardware configurations per generation, in terms of the processor speed, memory,
storage, and networking subsystems. Managing the different hardware incorrectly not only causes significant performance degradations
to applications sensitive to server configuration, but also wastes
resources as workloads occupy servers for significantly longer, and
gives a low-quality signal to hardware vendors for the design of future
platforms. Figure A shows that a heterogeneity-oblivious scheduler
will slow applications down by 22 percent on average, with some running nearly 2 times slower (see the Methodology section in the
main article).
Finally, a baseline scheduler that is oblivious to both interference
and heterogeneity and which schedules applications to least-loaded
servers is even worse (48 percent average slowdown), causing some
workloads to crash due to resource exhaustion on the server. Unless
interference and heterogeneity are managed in a coordinated fashion,
the system loses both its efficiency and predictability guarantees. Previous research has identified the issues of heterogeneity6 and interference,2 but while most cloud management systemssuch as
Mesos8 or vSphere (www.vmware.com/products/vsphere)have
____________________
some notion of contention or interference awareness, they either use
empirical rules for interference management or assume long-running
workloads (for example, online services), whose repeated behavior
can be progressively modeled. In this article, we target both heterogeneity and interference and assume no a priori analysis of the application. Instead, we leverage information the system already has about
the large number of applications it has previously seen.
Speedup over alone on best platform
Interference and Heterogeneity
Alone on best platform

No heterogeneity
1.0
No interference
Least loaded
0.8
0.6
0.4
0.2
0.0
1,000
2,000
3,000
Workloads
4,000
5,000
Figure A. Performance degradation for 5,000 applications

on 1,000 Amazon Elastic Compute Cloud (EC2) servers with
heterogeneity-oblivious, interference-oblivious, and
baseline least-loaded schedulers compared to ideal
scheduling (application runs alone on best platform).
Results are ordered from worst- to best-performing
workload.
2. J. Mars et al., Bubble-Up: Increasing Utilization in Modern
Warehouse Scale Computers via Sensible Co-locations,
Proc. 44th Ann. IEEE/ACM Intl Symp. Microarchitecture,
2011, pp. 248-259.
3. D. Meisner et al., Power Management of Online Data-Intensive Services, Proc. 38th Ann. Intl Symp. Computer Architecture (ISCA 11), 2011, pp. 319-330.
4. L.A. Barroso and U. Holzle, The Datacenter as a Computer:
An
Introduction to
the
Design
of
Warehouse-Scale
Machines, Morgan and Claypool Publishers, 2009.

5. C. Kozyrakis et al., Server Engineering Insights for Large-Scale
Online Services, IEEE Micro, vol. 30, no. 4, 2010, pp. 8-19.
6. J. Mars, L. Tang, and R. Hundt, Heterogeneity in Homogeneous Warehouse-Scale Computers: A Performance Opportunity, IEEE Computer Architecture Letters, vol. 10, no. 2,
2011, pp. 29-32.
7. R. Nathuji, C. Isci, and E. Gorbatov, Exploiting Platform Heterogeneity for Power Efficient Data Centers, Proc. 4th Intl
Conf. Autonomic Computing (ICAC 07), 2007, doi:10.1109/
References
ICAC.2007.16.
1. S. Govindan et al., Cuanta: Quantifying Effects of Shared

On-Chip Resource Interference for Consolidated Virtual
8. B. Hindman et al., Mesos: A Platform for Fine-Grained
Machines, Proc. 2nd ACM Symp. Cloud Computing, 2011,

article no. 22.
Conf. Networked Systems Design and Implementation,
Resource Sharing in the Data Center, Proc. 8th USENIX

............................................................
18
micro
IEEE
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
datacenter scheduler that accounts for heterogeneity and interference. The key feature of
Paragon is its ability to quickly and accurately
classify an unknown application with respect
to heterogeneity (which server congurations
it will perform best on) and interference
(how much interference it will cause to
coscheduled applications and how much
interference it can tolerate itself in multiple
shared resources). Unlike previous techniques
that require detailed proling of each incoming application, Paragons classication
engine exploits existing data from previously
scheduled workloads and requires only a
minimal signal about a new workload. Specically, it is organized as a low-overhead recommendation system similar to the one
deployed for the Netix Challenge,6 but
instead of discovering similarities in users
movie preferences, it nds similarities in
applications preferences with respect to heterogeneity and interference. It uses singular
value decomposition (SVD) to perform collaborative ltering and identify similarities
between incoming and previously scheduled
workloads.
Once an incoming application is classied, a greedy scheduler assigns it to the server
that is the best possible match in terms of
platform and minimum negative interference
between all coscheduled workloads. Even
though the nal step is greedy, the high accuracy of classication leads to schedules that
achieve both fast execution time and efcient
resource usage. Paragon scales to systems
with tens of thousands of servers and tens of
congurations, running large numbers of
previously unknown workloads. We implemented Paragon and showed that it signicantly improves cluster utilization, while
preserving per-application quality-of-service
(QoS) guarantees both for small- and largescale systems. For more information on
related work, see the Research Related to
Paragon sidebar.
Fast and accurate classification

The key requirement for heterogeneity
and interference-aware scheduling is to
quickly and accurately classify incoming
applications. First, we need to know how fast
an application will run on each of the tens of
server congurations (SCs) available. Second,

we need to know how much interference it
can tolerate from other workloads in each of
several shared resources without signicant
performance loss and how much interference
it will generate itself. Our goal is to perform
online scheduling for large-scale systems
without any a priori knowledge about incoming applications. Most previous schemes
address this issue with detailed but ofine
application characterization or long-term
monitoring and modeling.7-9 Paragon takes a
different approach. Its core idea is that,
instead of learning each new workload in
detail, the system leverages information it
already has about applications it has seen to
express the new workload as a combination
of known applications. For this purpose, we
use collaborative ltering techniques that
combine a minimal proling signal about the
new application with the large amount of
data available from previously scheduled
workloads. The result is fast and accurate
classication of incoming applications with
respect to heterogeneity and interference.
Within a minute of its arrival, an incoming
workload is scheduled on a large-scale cluster.
Background on collaborative filtering

Collaborative ltering techniques are frequently used in recommendation systems.
We use one of their most publicized applications, the Netix Challenge,6 to provide a
quick overview of the two analytical methods
we rely on, SVD and PQ reconstruction.10
In this case, the goal is to provide valid movie
recommendations for Netix users given the
ratings they have provided for various other
movies.
The input to the analytical framework is a
sparse matrix A, the utility matrix, with one
row per user and one column per movie. The
elements of A are the ratings that users have
assigned to movies. Each user has rated only
a small subset of movies; this is especially true
for new users, who might only have a handful
of ratings, or even none. Although techniques
exist that address the cold-start problem (that
is, providing recommendations to a completely fresh user with no ratings), we focus
here on users for whom the system has some
minimal input. If we can estimate the values
of the missing ratings in the sparse matrix A,
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
19
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
..............................................................................................................................................................................................
Research Related to Paragon
We discuss work relevant to Paragon in the areas of datacenter

scheduling, virtual machine (VM) management, workload rightsizing,
and scheduling for heterogeneous multicore chips.
Datacenter scheduling
Recent work on datacenter scheduling has highlighted the importance of platform heterogeneity and workload interference. Mars
et al. showed that the performance of Google workloads can vary by
up to 40 percent because of heterogeneity, even when considering
only two server configurations, and by up to 2 times because of interference, even when considering only two colocated applications.1,2
Govindan et al. also present a scheme to quantify the effects of cache
interference between consolidated workloads.3 In Paragon, we extend
the concepts of heterogeneity- and interference-aware scheduling by
providing an online, scalable, and low-overhead methodology that
accurately classifies applications for both heterogeneity and interference across multiple resources.
Resource management and rightsizing

There has been significant work on resource allocation in virtualized and nonvirtualized large-scale datacenters. Mesos performs
resource allocation between distributed computing frameworks such
as Hadoop or Spark.4 Rightscale (http://www.rightscale.com) automatically scales out three-tier applications to react to changes in the
load in Amazons cloud service. DejaVu serves a similar goal by identifying a few workload classes and, based on them, reusing previous
resource allocations to minimize reallocation overheads.5 In general,
Paragon is complementary to rightsizing systems. Once such a system
determines the amount of resources needed by an application, Paragon can classify and schedule it on the proper hardware platform in a
way that minimizes interference.
micro
IEEE
References
1. J. Mars, L. Tang, and R. Hundt, Heterogeneity in Homogeneous Warehouse-Scale Computers: A Performance Oppor2011, pp. 29-32.
Systems such as vSphere (http://www.vmware.com/products/

vsphere) or the VM platforms on public cloud providers can schedule
____
diverse workloads submitted by users on the available servers. In general, these platforms account for application resource requirements
that they expect the user to express or they learn over time by monitoring workload execution. Paragon can complement such systems by
making scheduling decisions on the basis of heterogeneity and interference and detecting when an application should be considered for
rescheduling.
20
Scheduling in heterogeneous CMPs shares some concepts and

challenges with scheduling in heterogeneous datacenters; thus, some
of the ideas in Paragon can be applied in heterogeneous CMP scheduling as well. Shelepov et al. present a scheduler for heterogeneous
CMPs that is simple and scalable,6 whereas Craeynest et al. use performance statistics to estimate which workload-to-core mapping is
likely to provide the best performance.7 Given the increasing number
of cores per chip and coscheduled tasks, techniques similar to the
ones used in Paragon can be applicable when deciding how to schedule applications in heterogeneous CMPs as well.
tunity, IEEE Computer Architecture Letters, vol. 10, no. 2,
VM management
............................................................
Scheduling for heterogeneous multicore chips
2. J. Mars et al., Bubble-Up: Increasing Utilization in Modern

Warehouse Scale Computers via Sensible Co-locations,
Proc. 44th Ann. IEEE/ACM Intl Symp. Microarchitecture,
2011, pp. 248-259.
3. S. Govindan et al., Cuanta: Quantifying Effects of Shared
On-Chip Resource Interference for Consolidated Virtual
Machines, Proc. 2nd ACM Symp. Cloud Computing, 2011,
article no. 22.
4. B. Hindman et al., Mesos: A Platform for Fine-Grained
Resource Sharing in the Data Center, Proc. 8th USENIX
Conf. Networked Systems Design and Implementation,
5. N. Vasic et al., DejaVu: Accelerating Resource Allocation in
Virtualized Environments, Proc. 17th Intl Conf. Architectural Support for Programming Languages and Operating
Systems, 2012, pp. 423-436.
6. D. Shelepov et al., HASS: A Scheduler for Heterogeneous
Multicore Systems, ACM SIGOPS Operating Systems
Rev., vol. 43, no. 2, 2009, pp. 66-75.
7. K. Craeynest et al., Scheduling Heterogeneous Multi-Cores
through Performance Impact Estimation (PIE), Proc. 39th
Ann. Intl Symp. Computer Architecture (ISCA 12), 2012,
pp. 213-224.
we can make movie recommendations; that

is, we can suggest that users watch the movies
for which the recommendation system estimates they will give high ratings to with high
condence.
The rst step is to apply SVD, a matrix

factorization method used for dimensionality
reduction and similarity identication. Factoring A produces the decomposition to the
following matrices of left (U) and right (V)
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
singular vectors and the diagonal

singular values (R):
0
a1;1 a1;2
B a2;1 a2;2
B
Am;n B
..
..
B ..
@ .
.
.
am;1 am;2
U R V where
0
u1;1 u1;r
B .
..
..
B
.
.
@ ..
matrix of
1
a1;n
a2;n C
C
C
.. C
. A
am;n
Umr
1
C
C;
A
um;1 um;r
1
v1;1 v1;r
B .
..
.. C
C
B
.
. A;
@ ..
vn;1 vn;r
0
1
r1 0
B . .
C
. . ... C
B
@ ..
A
0 rr
0
V nr
Rrr
Dimension r is the rank of matrix A, and

it represents the number of similarity concepts identied by SVD. For instance, one
similarity concept might be that certain movies belong to the drama category, while
another might be that most users who liked
the movie The Lord of the Rings: The Fellowship of the Ring also liked The Lord of the
Rings: The Two Towers. Similarity concepts
are represented by singular values ri in
matrix R and the condence in a similarity
concept by the magnitude of the corresponding singular value. Singular values in R are
ordered by decreasing magnitude. Matrix U
captures the strength of the correlation
between a row of A and a similarity concept.
In other words, it expresses how users relate
to similarity concepts such as the one about
liking drama movies. Matrix V captures the
strength of the correlation of a column of A
to a similarity concept. In other words, to
what extent does a movie fall in the drama
category? The complexity of performing
SVD on a m n matrix is minn2 m; m2 n.
SVD is robust to missing entries and imposes
relaxed sparsity constraints to provide accuracy guarantees.
Before we can make accurate score estimations using SVD, we need the full utility
matrix A. To recover the missing entries in A,
we use PQ reconstruction. Building from the

decomposition of the initial sparse A matrix,
T
we have Qmr U and Prn
R VT.
The product of Q and P T gives matrix R,
which is an approximation of A with the
missing entries. To improve R, we use stochastic gradient descent (SGD), a scalable
and lightweight latent-factor model that iteratively recreates A:
8rui , where rui is an element of the reconstructed matrix R
2ui rui qi pTu
qi
qi g2ui pu kqi
pu g2ui qi kpu
pu
q
P
2
until j2jL2
u;i j2ui j becomes marginal.
In this process, g is the learning rate and
k is the regularization factor. The complexity of PQ is linear with the number of rui
and in practice takes up to a few milliseconds for matrices whose m and n equal
about 1,000. Once the dense utility matrix
R is recovered, we can make movie recommendations. This involves applying SVD
to R to identify which of the reconstructed
entries reect strong similarities that enable
making accurate recommendations with
high condence.
Classification for heterogeneity

We use collaborative ltering to identify
how well a previously unknown workload
will run on different hardware platforms.
The rows in matrix A represent applications,
the columns represent server congurations
(SCs), and the ratings represent normalized
application performance on each SC. As part
of an ofine step, we select a small number of
applications and prole them on all the different SCs. This provides some initial information to the classication engine to address
the cold-start problem that would otherwise
occur. It only needs to happen once in the
system.
During regular operation, when an application arrives, we prole it for 1 minute on
any two SCs, insert it as a new row in matrix
A, and use the process described previously to
derive the missing ratings for the other server
congurations. In this case, R represents similarity concepts such as the fact that applications that benet from SC1 will also benet
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
21
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
from SC3. U captures how an application

correlates to the different similarity concepts,
and V shows how an SC correlates to them.
Collaborative ltering identies similarities
between new and known applications. Two
applications can be similar in one characteristic (for instance, they both benet from high
clock frequency) but different in others (for
example, only one benets from a large L3
cache). This is especially common when scaling to large application spaces and hardware
congurations. SVD addresses this issue by
uncovering hidden similarities and ltering
out the ones less likely to have an impact on
the applications behavior.
As incoming applications are added in A,
the density of the matrix increases and the
recommendation accuracy improves. Note
that online training is performed only on two
SCs. This reduces the training overhead and
the number of servers needed for it compared
to exhaustive search. In contrast, if we
attempted an exhaustive application proling, the number of proling runs would
equal the number of SCs. For a cloud service
with high workload arrival rates, this would
be infeasible to support. On a productionclass Xeon server, classication takes 10 to 30
milliseconds for thousands of applications
and tens of SCs. We can perform classication for one application at a time or for small
groups of incoming applications (batching) if
the arrival rate is high without impacting
accuracy or speed.
Performance scores. We use the following performance metrics according to the application type:

............................................................
22
micro
IEEE
Single-threaded workloads: We use

instructions committed per second
(IPS) as the initial performance metric. Using execution time would
require running applications to completion during profiling, increasing
overheads. We have verified that IPS
leads to similar classification accuracy
as using time to completion. For
multiprogrammed workloads, we use
aggregate IPS.
Multithreaded workloads: In the presence of spinlocks or other synchronization schemes, IPS can be deceptive.
We address this by detecting active

waiting and weight such execution
segments out of the IPS computation. We verified that using this
useful IPS leads to similar classification accuracy as using the full execution time.
The choice of IPS is inuenced by our
current evaluation, which focuses on singlenode CPU-, memory-, and I/O-intensive
programs. The same methodology can be
extended to higher-level metrics, such as
queries per second (QPS), which cover complex multitier workloads as well.
Validation. We evaluate the accuracy of heterogeneity classication on a 40-server cluster
with 10 SCs with a large set of diverse applications. The ofine training set includes 20
randomly selected applications. Using the
classication output for scheduling improves
performance by 24 percent for singlethreaded workloads, 20 percent for multithreaded workloads, 38 percent for multiprogrammed workloads, and 40 percent for
I/O workloads, on average, while some applications have a 2 performance difference.
Table 1 summarizes key statistics on the validation study. It is important to note that the
accuracy does not depend on the SCs selected
for training, which matched the topperforming conguration only for 20 percent
of workloads. We also compare performance
predicted by the recommendation system to
performance obtained through experimentation. The deviation is 3.8 percent on average.
Classification for interference

We are interested in two types of interference: that which an application can tolerate
from preexisting load on a server, and that
which the application will cause on that load.
We detect interference due to contention
and assign a score to the sensitivity of an
application to a type of interference. To
derive sensitivity scores, we develop several
microbenchmarks (sources of interference, or
SoIs), each stressing a specic shared resource
with tunable intensity.11 SoIs span the core,
memory, and cache hierarchy and network
and storage bandwidth. We run an application concurrently with a microbenchmark
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
Table 1. Validation of heterogeneity classification.

Applications
Single
threaded (%)
Multithreaded
(%)
Multiprogrammed
(%)
I/O bound
(%)
Selected best platform

Selected platform within 5% of best
86
91
86
90
83
89
89
92
Correct platform ranking (best to worst)
67
62
59
43
90% correct platform ranking

Training and best selected platform match
78
28
71
24
63
18
58
22
Metric
Table 2. Validation of interference classification.

Metric
Percentage (%)
Average estimation error of sensitivity across all examined resources

Average estimation error for sensitivities > 60%
5.3
3.4
Applications with < 5% estimation error
59.0
Resource with highest estimation error: L1 instruction cache

Frequency L1 instruction cache used for training
15.8
14.6
Resource with lowest estimation error: Storage bandwidth
and progressively tune up its intensity until

the application violates its QoS. Applications
with high tolerance to interference (for example, a sensitivity score over 60 percent) are
easier to coschedule than applications with
low tolerance. Similarly, we detect the sensitivity of a microbenchmark to the interference the application causes by tuning up its
intensity and recording when the microbenchmarks performance degrades by 5 percent compared to its performance in
isolation. In this case, high-sensitivity scores
correspond to applications that cause a lot of
interference in the specic shared resource.
Collaborative filtering for interference. We
classify applications for interference tolerated
and caused, using twice the process described
earlier. The two utility matrices have applications as rows and SoIs as columns. The elements of the matrices are the sensitivity
scores of an application to the corresponding
microbenchmark. Similarly to classication
for heterogeneity, we prole a few applications ofine against all SoIs and insert them
as dense rows in the utility matrices. In the
online mode, each new application is proled
0.9
against two randomly chosen microbenchmarks for one minute, and its sensitivity
scores are added in a new row in each of the
matrices. Then, we use SVD and PQ reconstruction to derive the missing entries and the
condence in each similarity concept.
Validation. We evaluated the accuracy of
interference classication using the same
workloads and systems as before. Table 2
summarizes key statistics on the classication
quality. The average error in estimating both
tolerated and caused interference across SoIs
is 5.3 percent. For high values of sensitivity
(that is, applications that tolerate and cause a
lot of interference), the error is even lower
(3.4 percent).
Putting it all together

Overall, Paragon requires two short runs
(approximately 1 minute) on two SCs to classify incoming applications for heterogeneity.
Another two short runs against two microbenchmarks on a high-end SC are needed for
interference classication. Running for 1
minute provides some signal on the new
workload without introducing signicant
.............................................................
MAY/JUNE 2014
micro
IEEE
23
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Classification for heterogeneity (SVD+PQ)

App
arrival
5 4
3
2 4
5
2 3
3
2
3
3
5
1
5
3
1
1
2
1
5
1
5
1
5
3
4
3
5
2
2
4
2
3
4
5
5
4
4
5
3
2
3
1
4
3
3
5
5
1
2
5
1
5
3
2
5
5
3
4
U V
Classification for interference (SVD+PQ)
Heterogeneity
scores
B
C
Selection of colocation
candidates
State: N*16B
2x
5 4
3
2 4
5
2 3
3
2
3
3
1
5
1
5
3
1
1
2
1
5
1
5
1
5
3
4
3
5
2
2
4
2
3
4
5
5
4
4
5
3
2
3
1
4
3
3
5
5
1
2
5
1
5
3
2
5
5
3
4
U V
Step 1: Application classification
Interference
scores
Step 2: Server selection
D
E
C
S
D
ES
E F
A
F
A
F
A
A
DC servers
Figure 1. The components of Paragon and the state maintained by each component. Overall, the state requirements are
marginal and scale linearly or logarithmically with the number of applications (N), servers (M), and configurations. (PQ: PQ
reconstruction; SVD: singular value decomposition; DC: datacenter.)
proling overheads. In our full paper,5 we discuss the issue of workload phases (that is,
transient effects that do not appear in the
1-minute proling period). Next, we use collaborative ltering to classify the application
in terms of heterogeneity and interference.
This requires a few milliseconds even when
considering thousands of applications and
several tens of SCs or SoIs. Classication for
heterogeneity and interference is performed
in parallel. For the applications we considered, the overall proling and classication
overheads are 1.2 and 0.09 percent on
average.
Using analytical methods for classication
has two benets. First, we have strong analytical guarantees on the quality of the information used for scheduling, instead of relying
mainly on empirical observation. The analytical framework provides low and tight error
bounds on the accuracy of classication, statistical guarantees on the quality of colocation
candidates, and detailed characterization of
system behavior. Moreover, the scheduler
design is workload independent, which
means that the properties the scheme provides hold for any workload. Second, these
methods are computationally efcient, scale
well with the number of applications and
SCs, and do not introduce signicant scheduling overheads.
Paragon
............................................................
24
micro
IEEE
Once an incoming application is classied

with respect to heterogeneity and interference,
Paragon schedules it on one of the available

servers. The scheduler attempts to assign each
workload to the server of the best SC and colocate it with applications so that interference is
minimized for workloads running on the
same server.
Scheduler design
Figure 1 presents an overview of Paragons
components and operation. The scheduler
maintains per-application and per-server
state. The per-application state includes the
classication information; for a datacenter
with 10 SCs and 10 SoIs, it is 64 bytes per
application. The per-server state records the
IDs of applications running on a server and
the cumulative sensitivity to interference
(roughly 64 bytes per server). The per-server
state is updated as applications are scheduled
and, later on, completed. Overall, state overheads are marginal and scale logarithmically
or linearly with the number of applications
(N) and servers (M). In our experiments with
thousands of applications and servers, a single
server could handle all processing and storage
requirements of scheduling, although additional servers can be used for fault tolerance.
Greedy server selection

In examining candidates, the scheduler
considers two factors: rst, which assignments minimize negative interference between the new application and existing load,
and second, which servers have the best SC
for this workload.
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
The scheduler evaluates two metrics,

D1 tserver cnewapp and D2 tnewapp
cserver , where t is the sensitivity score for tolerated and c for caused interference for a specic SoI. The cumulative sensitivity of a
server to caused interference is the sum of
sensitivities of individual applications running on it, whereas the sensitivity to tolerated
interference is the minimum of these values.
The optimal candidate is a server for which
D1 and D2 are exactly zero for all SoIs, which
implies no negative impact from interference
and perfect resource usage. In practice, a
good selection is one where D1 and D2 are
positive and small for all SoIs. Large, positive
values for D1 and D2 indicate suboptimal
resource utilization. Negative values for D1
or D2 imply violation of QoS.
We examine candidate servers for an
application in the following way. The process
is explained for interference tolerated by the
server and caused by the new workload (D1 )
and is exactly the same for D2. We start from
the resource the new application is most sensitive to. We select the server set for which D1
is non-negative for this SoI. Next, we examine the second SoI in order of decreasing sensitivity scores, ltering out any servers for
which D1 is negative, until all SoIs have been
examined. Then, we take the intersection of
server sets for D1 and D2 and select the
machine with the best SC and with
minkD1 D2 kL1 .
As we lter out servers, at some point the
set of candidate servers might become
empty. This implies that there is no single
server for which D1 and D2 are non-negative
for some SoI. Although unlikely, we support
this event with backtracking and QoS
relaxation. Given M servers, the worst-case
complexity is OM SoI 2 , because, theoretically, backtracking might extend all the
way to the rst SoI. In practice, however, we
observe that for a 1,000-server system, 89
percent of applications were scheduled without any backtracking. For 8 percent of the
remaining applications, backtracking led to
negative D1 or D2 for a single SoI (and for 3
percent for multiple SoIs). Additionally, we
bound the runtime of the greedy search
using a timeout mechanism, after which the
best server from the ones already examined
is selected.
Our full paper includes a discussion on

workload phases and applicability to multitier latency-critical applications.5
Evaluation methodology
In the following paragraphs, we describe
the server systems, alternative schedulers,
applications, and workload scenarios used in
our evaluation.
We evaluated Paragon on a 1,000-server
cluster on Amazon EC2 with 14 instance
types from small to extra large.12 All instances
were exclusive (reserved)that is, no other
users had access to the servers. There were no
external scheduling decisions or actions such
as auto-scaling or workload migration during
the course of the experiments.
We compared Paragon to three schedulers. The rst is a baseline scheduler that
assigns applications to least-loaded (LL)
machines, accounting for their core and
memory requirements but ignoring their heterogeneity and interference proles. The
second is a heterogeneity-oblivious (NH)
scheme that uses the interference classication in Paragon to assign applications to servers without visibility in their SCs. The third
is an interference-oblivious (NI) scheme that
uses the heterogeneity classication but has
no insight on workload interference.
We used 400 single-threaded (ST), multithreaded (MT), and multiprogrammed
(MP) applications from SPEC CPU2006,
several multithreaded benchmark suites,5 and
SPECjbb. For multiprogrammed workloads,
we created 350 mixes of four SPEC applications. We also used 26 I/O-bound workloads
in Hadoop and Matlab running on a single
node. Workload durations range from
minutes to hours. For workload scenarios
with more than 426 applications, we replicated these workloads with equal likelihoods
(1/4 ST, 1/4 MT, 1/4 MP, and 1/4 I/O) and
randomized their interleaving.
We used the applications listed in this section to examine the following scenarios: a lowload scenario with 2,500 randomly chosen
applications submitted with 1-second intervals, a high-load scenario with 5,000 applications submitted with 1-second intervals, and
an oversubscribed scenario where 7,500 workloads are submitted with 1-second intervals
and an additional 1,000 applications arrive in
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
25
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Low load
High load
1.2
Speedup over alone on
best platform
Speedup over alone on

best platform
1.2
1.0
0.8
0.6
0.4
0.2
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0
5,00
1,000
1,500
2,000
2,500
1,000
Workloads
2,000
3,000
4,000
5,000
Workloads
(a)
(b)
Speedup over alone on best platform
Oversubscribed
1.2
1.0
0.8
0.6
0.4
0.2
0.0
0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000

Workloads
(c)
Alone on best platform
No heterogeneity (NH)
Least loaded (LL)
Paragon (P)
No interference (NI)
Figure 2. Performance comparison between the four schedulers for three workload scenarios on 1,000 Amazon Elastic
Compute Cloud (EC2) servers. Performance is normalized to optimal performance in isolation, and applications are ordered
from worst to best performing.
burst (less than 0.1-second intervals) after the

rst 3,750 workloads.
Evaluation
We evaluated the Paragon scheduler
against the LL, NH, and NI schedulers, with
respect to performance, decision quality,
resource allocation, and cluster utilization.
Performance impact
............................................................
26
micro
IEEE
Figure 2 shows the performance for the

three workload scenarios on the 1,000-server
EC2 cluster. The low-load scenario, in general, does not create signicant performance
challenges. Nevertheless, Paragon outperforms the other three schemes; it preserves
QoS for 91 percent of workloads and
achieves on average 96 percent of the performance of a workload running in isolation
in the best SC. When moving to the highload scenario, the difference between schedulers becomes more obvious. Although the
heterogeneity and interference-oblivious
schemes degrade performance by an average
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
No degradation
< 10% degradation
100
< 20%
> 20%
100
60
40
20
Low load
LL
NH
NI
P
LL
NH
NI
P
LL
NH
NI
P
0
High load Oversubscribed
80
60
40
20
0
Low load
High load Oversubscribed
LL
NH
NI
P
80
LL
NH
NI
P
Application percentage
Application percentage
M
q
M
q
M
q
M
q
MQmags
q
LL
NH
NI
P
micro
IEEE
Figure 3. Breakdown of decision quality for the four schedulers across the three EC2
scenarios. Different colors correspond to different impacts in application performance in
terms of heterogeneity (left) and interference.
of 22 and 34 percent and violate QoS for 96

and 97 percent of workloads, respectively, Paragon degrades performance by only 4 percent
and guarantees QoS for 61 percent of workloads. The least-loaded scheduler degrades
performance by 48 percent on average, with
some applications not terminating successfully. The differences in performance are
larger for workloads submitted when the system is heavily loaded.
Finally, for the oversubscribed case, NH,
NI, and LL dramatically degrade performance for most workloads, while the number
of applications that do not terminate successfully increases to 10.4 percent for LL. Paragon, on the other hand, preserves QoS
guarantees for 52 percent of workloads, while
the other schedulers provide similar guarantees only for 5, 1, and 0.09 percent of workloads, respectively. Additionally, it limits
degradation to less than 10 percent for an
additional 33 percent of applications and
maintains moderate performance degradation (no cliffs in performance similar to NH
for applications 1 through 1,000).
Decision quality
Figure 3 shows a breakdown of the decision quality of the different schedulers for
heterogeneity (left) and interference (right)
across the three scenarios. LL induces more
than 20 percent performance degradation to

most applications, both due to heterogeneity
and interference. NH has low decision quality in terms of platform selection, whereas NI
causes performance degradation by colocating
unsuitable applications. The errors increase as
we move to scenarios of higher load. Paragon
decides optimally for 65 percent of applications for heterogeneity and 75 percent for
interference, on average, signicantly higher
than the other schedulers. It also constrains
decisions that lead to larger than 20 percent
degradation to less than 8 percent of
workloads.
Resource allocation
Figure 4 shows why this deviation exists.
The solid black line in each graph represents
the required core count based on the applications running at a snapshot of the system,
while the other lines show the allocated cores
by each of the schedulers. Because Paragon
optimizes for increased utilization within QoS
constraints, it follows the application requirements closely. It only deviates when the
required core count exceeds the resources
available in the system (oversubscribed case).
NH has mediocre accuracy, whereas NI and
LL either signicantly overprovision the number of allocated cores, or oversubscribe certain
servers. There are two important points in
.............................................................
MAY/JUNE 2014
micro
IEEE
27
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
High load
Oversubscribed load
7,000
6,000
6,000
6,000
5,000
4,000
3,000
2,000
Core count
7,000
Core count
Core count
Low load
7,000
5,000
4,000
3,000
5,000
4,000
3,000
2,000
2,000
1,000
1,000
1,000
0
0
50 100 150 200 250 300

Time (minutes)
(a)
100
200 300
Time (minutes)
400
0
0 100 200 300 400 500 600 700 800
Time (minutes)
500
(b)
(c)
Required
No interference (NI)
No heterogeneity (NH)
Least loaded (LL)
Paragon (P)
Figure 4. Resource allocation for the three workload scenarios. Each line corresponds to the number of allocated computing
cores at each point during the execution of the scenario. Although the heterogeneity-oblivious (NH), interference-oblivious
(NI), and least-loaded (LL) schedulers under- or overestimate the required resources, Paragon closely follows the application
resource requirements.
Least loaded
Paragon
1,000
100
100
1,000
90
90
80
800
80
800
70
60
50
40
400
Servers
Servers
70
600
60
600
50
400
40
30
30
20
200
200
20
10
10
0
100
200
300
Time (minutes)
(a)
400
500
100
200
300
400
Time (minutes)
500
(b)
Figure 5. CPU utilization heat maps for the high-load scenario for the least-loaded system and Paragon. Utilization is averaged
across the cores of a server and is sampled every 5 seconds. Darker colors correspond to higher CPU utilization in the heat
maps.
............................................................
28
micro
IEEE
these graphs. First, as the load increases, the

deviation of execution time from optimal
increases for NH, NI, and LL, whereas Paragon approximates it closely. Second, for high
loads, the errors in core allocation increase dramatically for the other three schedulers,
whereas for Paragon the average deviation
remains approximately constant, excluding

the part where the system is oversubscribed.
Cluster utilization
Figure 5 shows the cluster utilization in
the high-load scenario for LL and Paragon in
the form of heat maps. Utilization is shown
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
for each individual server throughout the

duration of the experiment and is averaged
across the servers cores every 5 seconds.
Whereas with LL utilization does not exceed
20 percent for the majority of time, Paragon
achieves an average utilization of 52 percent.
Additionally, as workloads run closer to their
QoS requirements, the scenario completes in
19 percent less time.
he Paragon scheduler moves away from

the traditional empirical design
approach in computer architecture and
systems and adopts a more data-driven
approach. In the past few years, we have
entered an era where data has become so vast
and rich that it can provide much better (and
faster) insight on design decisions than the
traditional trial-and-error approach can.
Applying such techniques in datacenter
scheduling with signicant gains is proof of
the value of using data to drive system design
and management decisions. There are other
highly dimensional problems where similar
techniques can be proven effective, such as
the large space-design explorations for either
processors13 or memory systems or the more
general cluster management problem in
cloud providers. The latter becomes increasingly challenging because many cloud applications are multitier workloads with complex
dependencies and they must satisfy strict tail
latency guarantees. Additionally, issues like
heterogeneity and interference are not relevant only to datacenters. Systems of all scales,
from low-power mobile to traditional CMPs
and large-scale cloud computing facilities,
face similar challenges, which makes employing techniques that work online, fast and can
handle huge spaces a pressing need.
Determining which data can offer valuable insights in system decisions and designing
efcient techniques to collect and mine it in a
way that leverages their nature and characteristics is a signicant challenge moving
forward.
MICRO
feedback on earlier versions of this manuscript. This work was partially supported by
a Google-directed research grant on energyproportional computing. Christina Delimitrou was supported by a Stanford Graduate
Fellowship.
....................................................................
References
1. L.A. Barroso and U. Holzle, The Datacenter

as a Computer: An Introduction to the
Design of Warehouse-Scale Machines,
Morgan and Claypool, 2009.
2. J. Rabaey et al., Beyond the Horizon: The
Next 10x Reduction in PowerChallenges
and Solutions, Proc. IEEE Intl Solid-State
Circuits Conf., 2011, doi:10.1109/ISSCC.
2011.5746206.
3. L.A. Barroso, Warehouse-Scale Computing: Entering the Teenage Decade, Proc.
38th Ann. Intl Symp. Computer Architecture (ISCA 11), 2011.
4. D. Meisner et al., Power Management of
Online Data-Intensive Services, Proc. 38th
Ann. Intl Symp. Computer Architecture
(ISCA 11), 2011, pp. 319-330.
5. C. Delimitrou and C. Kozyrakis, Paragon:
QoS-Aware Scheduling in Heterogeneous
Datacenters, Proc. 18th Intl Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS
13), 2013, pp. 77-88.
6. R.M. Bell, Y. Koren, and C. Volinsky,
The BellKor 2008 Solution to the
Netflix Prize, tech. report, AT&T Labs, Oct.
2007.
7. J. Mars et al., Bubble-Up: Increasing Utilization in Modern Warehouse Scale Computers via Sensible Co-locations, Proc.
44th Ann. IEEE/ACM Intl Symp. Microarchitecture, 2011, pp. 248-259.
8. R. Nathuji, C. Isci, and E. Gorbatov,
Exploiting Platform Heterogeneity for
Power Efficient Data Centers, Proc. 4th
Intl Conf. Autonomic Computing (ICAC 07),
2007, doi:10.1109/ICAC.2007.16.
Acknowledgments
We sincerely thank John Ousterhout,
Mendel Rosenblum, Byung-Gon Chun,
Daniel Sanchez, Jacob Leverich, David Lo,
and the anonymous reviewers for their
9. N. Vasic et al., DejaVu: Accelerating

Resource Allocation in Virtualized Environments, Proc. 17th Intl Conf. Architectural
Operating Systems, 2012, pp. 423-436.
.............................................................
MAY/JUNE 2014
micro
IEEE
29
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
10. A. Rajaraman and J.D. Ullman, Mining of

Massive Datasets, Cambridge Univ. Press,
2011.
11. C. Delimitrou and C. Kozyrakis, iBench:
Quantifying Interference for Datacenter
Workloads, Proc. IEEE Intl Symp. Workload Characterization, 2013, pp. 23-33.
12. C. Delimitrou and C. Kozyrakis, QoS-Aware
Scheduling in Heterogeneous Datacenters
with Paragon, ACM Trans. Computer Systems, vol. 31, no. 4, 2013, article no. 12.
13. O. Azizi et al., Energy Performance Tradeoffs in Processor Architecture and Circuit
Design: A Marginal Cost Analysis, Proc.
37th Ann. Intl Symp. Computer Architecture (ISCA 10), 2010, pp. 26-36.
Christina Delimitrou is a PhD student in

the Department of Electrical Engineering at
Stanford University. Her research focuses on
large-scale datacenters, specifically on scheduling and resource allocation techniques
with quality-of-service guarantees, practical
cluster management systems that improve
resource efficiency, and datacenter application analysis and modeling. Delimitrou has
an MS in electrical engineering from
Stanford University. She is a student member of IEEE and the ACM.
Christos Kozyrakis is an associate professor
in the Departments of Electrical Engineering and Computer Science at Stanford University, where he investigates hardware
architectures, system software, and programming models for systems ranging from
cell phones to warehouse-scale datacenters.
His research focuses on resource-efficient
cloud computing, energy-efficient multicore
systems, and architectural support for security. Kozyrakis has a PhD in computer
science from the University of California,
Berkeley. He is a senior member of IEEE
and the ACM.
article to Christina Delimitrou, Gates Hall,
353 Serra Mall, Room 316, Stanford, CA
94305; ____________
cdel@stanford.edu.
________________
________________________
............................................................
30
micro
IEEE
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..................................................................................................................................................................................................................
A CASE FOR SPECIALIZED PROCESSORS

FOR SCALE-OUT WORKLOADS
..................................................................................................................................................................................................................
EMERGING SCALE-OUT WORKLOADS NEED EXTENSIVE COMPUTATIONAL RESOURCES, BUT

DATACENTERS USING MODERN SERVER HARDWARE FACE PHYSICAL CONSTRAINTS. IN THIS
ARTICLE, THE AUTHORS SHOW THAT MODERN SERVER PROCESSORS ARE HIGHLY
INEFFICIENT FOR RUNNING CLOUD WORKLOADS. THEY INVESTIGATE THE
MICROARCHITECTURAL BEHAVIOR OF SCALE-OUT WORKLOADS AND PRESENT
OPPORTUNITIES TO ENABLE SPECIALIZED PROCESSOR DESIGNS THAT CLOSELY MATCH THE
NEEDS OF THE CLOUD.
......
Cloud computing is emerging as

a dominant computing platform for delivering scalable online services to a global client
base. Todays popular online services, such as
web search, social networks, and video sharing, are all hosted in large scale-out datacenters. With the industry rapidly expanding,
service providers are building new datacenters, augmenting the existing infrastructure
to meet the increasing demand. However,
while demand for cloud infrastructure continues to grow, the semiconductor manufacturing industry has reached the physical
limits of voltage scaling,1,2 no longer able to
reduce power consumption or increase power
density in new chips. Physical constraints
have therefore become the dominant limiting
factor, because the size and power demands
of larger datacenters cannot be met.
Although major design changes are being
introduced at the board and chassis levels of
new cloud servers, the processors used in
modern servers were originally created for
desktops and are not designed to efciently
run scale-out workloads. Processor vendors
use the same underlying microarchitecture
0272-1732/14/$31.00 c 2014 IEEE
micro
IEEE
for servers and for the general-purpose market, leading to extreme inefciency in todays
datacenters. Moreover, both general-purpose
and traditional server processor designs follow a trajectory that benets scale-up workloads, a trend that was established for desktop
processors long before the emergence of scaleout workloads.
In this article, based on our paper for the
17th International Conference on Architectural Support for Programming Languages
and Operating Systems,3 we observe that
scale-out workloads share many inherent
characteristics that place them into a workload class distinct from desktop, parallel, and
traditional server workloads. We perform a
detailed microarchitectural study of a range
of scale-out workloads, nding a large mismatch between the demands of the scale-out
workloads and todays predominant processor microarchitecture. We observe signicant overprovisioning of the memory
hierarchy and core microarchitectural resources for the scale-out workloads.
We use performance counters to study the
behavior of scale-out workloads running on
Michael Ferdman
Stony Brook University
Almutaz Adileh
Ghent University
Onur Kocberber
Stavros Volos
Mohammad Alisafaee
Djordje Jevdjic
Cansu Kaynak
Adrian Daniel Popescu
Anastasia Ailamaki
Babak Falsafi
Ecole Polytechnique Federale
de Lausanne
.............................................................
31
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
modern server processors. On the basis of

our analysis, we demonstrate the following:

Scale-out workloads suffer from high

instruction-cache miss rates. Instruction caches and associated next-line
prefetchers found in modern processors are inadequate for scale-out
workloads.
Instruction-level parallelism (ILP)
and memory-level parallelism (MLP)
in scale-out workloads are low. Modern
aggressive out-of-order cores are excessively complex, consuming power
and on-chip area without providing
performance benefits to scale-out
workloads.
Data working sets of scale-out workloads considerably exceed the capacity of on-chip caches. Processor
real estate and power are misspent on
large last-level caches that do not contribute to improved scale-out workload performance.
On- and off-chip bandwidth requirements of scale-out workloads are
low. Scale-out workloads see no benefit from fine-grained coherence and
excessive memory and core-to-core
communication bandwidth.
Continuing the current processor trends

will further widen the mismatch between
scale-out workloads and server processors.
Conversely, the characteristics of scale-out
workloads can be effectively leveraged to specialize processors for these workloads in order
to gain area and energy efciency in future
servers. An example of such a specialized processor design that matches the needs of scaleout workloads is Scale-Out Processor,4 which
has been shown to improve the system
throughput and the overall datacenter cost
efciency by almost an order of magnitude.5
Modern cores and scale-out workloads
............................................................
32
micro
IEEE
Todays datacenters are built around conventional desktop processors whose architecture was designed for a broad market. The
dominant processor architecture has closely
followed the technology trends, improving
single-thread performance with each processor generation by using the increased clock
speeds and free (in area and power) transistors provided by progress in semiconductor
manufacturing. Although Dennard scaling
has stopped,1,2,6,7 with both clock frequency
and transistor counts becoming limited by
power, processor architects have continued to
spend resources on improving single-thread
performance for a broad range of applications
at the expense of area and power efciency.
In this article, we study a set of applications that dominate todays cloud infrastructure. We examined a selection of Internet
services on the basis of their popularity. For
each popular service, we analyzed the class
of application software used by major providers to offer these services, either on their
own cloud infrastructure or on a cloud
infrastructure leased from a third party.
Overall, we found that scale-out workloads
have similar characteristics. All applications
we examined

operate on large data sets that are distributed across a large number of
machines, typically into memoryresident shards;
serve large numbers of completely
independent requests that do not
share any state;
have application software designed
specifically for the cloud infrastructure, where unreliable machines may
come and go; and
use connectivity only for high-level
task management and coordination.
Specically, we identied and studied the
following workloads: an in-memory object
cache (Data Caching); a NoSQL persistent
data store (Data Serving); data ltering,
transformation, and analysis (MapReduce); a
video-streaming service (Media Streaming); a
large-scale irregular engineering computation
(SAT Solver); a dynamic Web 2.0 service
(Web Frontend); and an online search engine
node (Web Search). To highlight the differences between scale-out workloads and
traditional workloads, we evaluated cloud
workloads alongside the following traditional
benchmark suites: Parsec 2.1 Parallel workloads, SPEC CPU2006 desktop and engineering workloads, SPECweb09 traditional web
services, TPC-C traditional transaction processing workload, TPC-E modern transaction
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
Table 1. Architectural parameters.

Component
Details
Processor
32-nm Intel Xeon X5670, operating at 2.93 GHz
Chip multiprocessor width

Core width
Six out-of-order cores

Four-wide issue and retire
Reorder buffer
128 entries
Load-store queue
Reservation stations
48/32 entries
36 entries
Level-1 caches
32 Kbytes instruction and 32 Kbytes data, four-cycle
Level-2 cache
access latency
256 Kbytes per core, six-cycle access latency
Last-level cache (Level-3 cache)
12 Mbytes, 29-cycle access latency
Memory
24 Gbytes, three double-data-rate three (DDR3) channels,

delivering up to 32 Gbytes/second
processing workload, and MySQL Web 2.0

back-end database.
Methodology
We conducted our study on a PowerEdge
M1000e enclosure with two Intel X5670 processors and 24 Gbytes of RAM in each blade,
using Intel VTune to analyze the systems
microarchitectural behavior. Each Intel X5670
processor includes six aggressive out-of-order
processor cores with a three-level cache hierarchy: the L1 and L2 caches are private to
each core; the last-level cache (LLC)the L3
cacheis shared among all cores. Each core
includes several simple stride and stream
prefetchers, labeled as adjacent-line, HW
prefetcher, and DCU streamer in the
processor documentation and system BIOS
settings. The blades use high-performance
Broadcom server network interface controllers
(NICs) with drivers that support multiple
transmit queues and receive-side scaling. The
NICs are connected by a built-in M6220
switch. For bandwidth-intensive benchmarks,
2-Gbit NICs are used in each blade.
Table 1 summarizes the blades key architectural parameters. We limited all workload
congurations to four cores, tuning the
workloads to achieve high utilization of the
cores (or hardware threads, in the case of
the SMT experiments), while maintaining
the workload quality-of-service requirements.
To ensure that all application and operating
system software runs on the cores under test,

we disabled all unused cores using the available operating system mechanisms.
Results
We explore the microarchitectural behavior of scale-out workloads by examining the
commit-time execution breakdown in Figure 1. We classify each cycle of execution as
Committing if at least one instruction was
committed during that cycle, or as Stalled
otherwise. We note that computing a breakdown of the execution-time stall components
of superscalar out-of-order processors cannot
be performed precisely because of overlapped
work in the pipeline. We therefore present
execution-time breakdown results based on
the performance counters that have no overlap. Alongside the breakdown, we show the
Memory cycles, which approximate time
spent on long-latency memory accesses, but
potentially partially overlap with instruction
commits.
The execution-time breakdown of scaleout workloads is dominated by stalls in both
the application code and operating system.
Notably, most of the stalls in scale-out workloads arise because of long-latency memory
accesses. This behavior is in contrast to the
CPU-intensive desktop (SPEC2006) and
parallel (Parsec) benchmarks, which stall execution signicantly less than 50 percent
of the cycles and experience only a fraction
.............................................................
MAY/JUNE 2014
micro
IEEE
33
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
100%
75%
50%
25%
Stalled (OS)
Stalled (Application)
Committing (Application)
Committing (OS)
nd
-E
ke
Ba
c
TP
W
eb
-C
C
TP
C
(c
PA
pu
RS
)
EC
SP
(m
EC
em
20
)
06
SP
EC
(c
pu
20
)
06
(m
em
SP
)
EC
w
eb
09
PA
RS
E
ar
ch
Se
te
n
lv
Fr
on
So
W
eb
W
eb
er
g
in
SA
T
uc
M
ed
ia
St
re
a
in
ap
M
at
a
Se
Re
d
rv
hi
ac
D
C
at
a
D
0%
ng
Total execution cycles
TOP PICKS
Memory
Figure 1. Execution-time breakdown and memory cycles of scale-out workloads (left) and traditional benchmarks (right).
Execution time is further broken down into its application and operating system components.
............................................................
34
micro
IEEE
of the stalls due to memory accesses. Furthermore, although the execution-time breakdown of some scale-out workloads (such
as MapReduce and SAT Solver) appears
similar to the memory-intensive Parsec and
SPEC2006 benchmarks, the nature of these
workloads stalls is different. Unlike the scaleout workloads, many Parsec and SPEC2006
applications frequently stall because of pipeline ushes after wrong-path instructions,
with much of the memory access time not on
the critical path of execution.
Scale-out workloads show memory system
behavior that more closely matches traditional online transaction processing workloads (TPC-C, TPC-E, and Web Backend).
However, we observe that scale-out workloads differ considerably from traditional
online transaction processing (TPC-C), which
spends more than 80 percent of the time
stalled, owing to dependent memory accesses.
We nd that scale-out workloads are most
similar to the more recent transaction processing benchmarks (TPC-E) that use more complex data schemas or perform more complex
queries than traditional transaction processing. We also observe that a traditional enterprise web workload (SPECweb09) behaves
differently than the Web Frontend workload,
representative of modern scale-out congurations. Although the traditional web workload
is dominated by serving static les and a
few dynamic scripts, modern scalable web
workloads like Web Frontend handle a

much higher fraction of dynamic requests,
leading to higher core utilization and less OS
involvement.
Although the behavior across scale-out
workloads is similar, the class of scale-out
workloads as a whole differs signicantly
from other workloads. Processor architectures optimized for desktop and parallel
applications are not optimized for scale-out
workloads that spend most of their time waiting for cache misses, resulting in a clear
microarchitectural mismatch. At the same
time, architectures designed for workloads
that perform only trivial computation and
spend all of their time waiting on memory
(such as SPECweb09 and TPC-C) also cannot cater to scale-out workloads.
Front-end inefficiencies
There are three major front-end inefciencies:

Cores are idle because of high

instruction-cache miss rates.
L2 caches increase average instruction-fetch latency.
Excessive LLC capacity leads to long
instruction-fetch latency.
Instruction-fetch stalls play a critical role
in processor performance by preventing the
core from making forward progress because
of a lack of instructions to execute. Front-end
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
117
146
100
L1-I (OS)
L1-I (Application)
L2 (OS)
L2 (Application)
75
50
25
d
ck
en
-E
Ba
C
W
eb
TP
-C
C
TP
eb
w
EC
SP
SP
EC
20
09
06
C
SE
Se
PA
R
ar
ch
nd
eb
W
eb
Fr
on
te
ol
ve
r
tS
Sa
M
ed
ia
St
Re
re
a
du
ce
rv
in
g
a
at
M
ap
Se
ac
C
a
at
D
m
in
g
hi
ng
Instruction misses per k-instruction
M
q
M
q
M
q
M
q
MQmags
q
Figure 2. L1 and L2 instruction miss rates for scale-out workloads (left) and traditional benchmarks (right). The miss rate is
broken down into its application and operating system components.
stalls serve as a fundamental source of inefciency for both area and power, because the
core real estate and power consumption are
entirely wasted for the cycles that the front
end spends fetching instructions.
Figure 2 presents the instruction miss
rates of the L1 instruction cache and the L2
cache. In contrast to desktop and parallel
benchmarks, the instruction working sets of
many scale-out workloads considerably
exceed the capacity of the L1 instruction
cache, resembling the instruction-cache
behavior of traditional server workloads.
Moreover, the instruction working sets of
most scale-out workloads also exceed the L2
cache capacity, where even relatively infrequent instruction misses incur considerable
performance penalties. We nd that modern
processor architectures cant tolerate the
latency of the L1 instruction caches misses,
avoiding front-end stalls only for applications
whose entire instruction working set ts into
the L1 cache. Furthermore, the high L2
instruction miss rates indicate that the L1
instruction caches capacity experiences a signicant shortfall and cant be mitigated by
the addition of a modestly sized L2 cache.
The disparity between the needs of
the scale-out workloads and the processor
architecture are apparent in the instructionfetch path. Although exposed instruction-
fetch stalls serve as a key source of inefciency

under any circumstances, the instructionfetch path of modern processors actually
exacerbates the problem. The L2 cache experiences high instruction miss rates, increasing
the average fetch latency of the missing fetch
requests by placing an additional intermediate lookup structure on the path to retrieve
instruction blocks from the LLC. Moreover,
the entire instruction working set of any
scale-out workload is considerably smaller
than the LLC capacity. However, because the
LLC is a large cache with a high uniform
access latency, it contributes an unnecessarily
large instruction-fetch penalty (29 cycles to
access the 12-Mbyte cache).
To improve efciency and reduce frontend stalls, processors built for scale-out workloads must bring instructions closer to the
cores. Rather than relying on a deep hierarchy of caches, a partitioned organization
that replicates instructions and makes them
available close to the requesting cores8 is
likely to considerably reduce front-end stalls.
To effectively use the on-chip real estate, the
system would need to share the partitioned
instruction caches among multiple cores,
striking a balance between the die area dedicated to replicating instruction blocks and
the latency of accessing those blocks from the
closest cores.
.............................................................
MAY/JUNE 2014
micro
IEEE
35
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
4
SMT
Application MLP
Application IPC
Baseline
3
2
1
0
Baseline
SMT
6
4
2
at
a
D Ca
at c
a h
M M Se ing
e d a p rv
ia Re ing
St du
re c
W SAT ami e
eb S ng
F o
W ro lve
eb nte r
Se nd
PA
ar
R
ch
PA S
E
SP RS C
(
E
SP EC C cp
EC 20 (m u)
20 06 em
0 (c )
SP 6 (m pu
EC e )
w m)
eb
0
T
P 9
W
eb T C-C
Ba PC
c k -E
en
d
D
at
a
D Ca
at c
a h
M Ma Se ing
e d p rv
ia Re ing
St du
re c
a e
S
W AT min
eb S g
F o
W ron lver
eb te
Se nd
PA
ar
c
PA RS
EC h
R
SP S
E (
SP EC C cpu
EC 200 (me )
20 6 m
0 (c p )
SP 6 (m u)
EC em
w )
eb
TP 09
W
eb T C-C
Ba PC
c k -E
en
d
Figure 3. The instructions per cycle (IPC) and memory-level parallelism (MLP) of a simultaneous multithreading (SMT) enabled
core. Application IPC for systems with and without SMT out of a maximum IPC of 4 (a). MLP for systems with and without
SMT (b). Range bars indicate the minimum and maximum of the corresponding group.
Furthermore, although modern processors include next-line prefetchers, high

instruction-cache miss rates and signicant
front-end stalls indicate that the prefetchers
are ineffective for scale-out workloads. Scaleout workloads are written in high-level languages, use third-party libraries, and execute
operating system code, exhibiting complex
nonsequential access patterns that are not
captured by simple next-line prefetchers.
Including instruction prefetchers that predict
these complex patterns is likely to improve
overall processor efciency by eliminating
wasted cycles due to front-end stalls.
Core inefficiencies
There are two major core inefciencies:

............................................................
36
micro
IEEE
Low ILP precludes effectively using

the full core width.
The reorder buffer (ROB) and the
load-store queue (LSQ) are underutilized because of low MLP.
Modern processors execute instructions

out of order to enable simultaneous execution of multiple independent instructions per
cycle (IPC). Additionally, out-of-order execution elides stalls due to memory accesses by
executing independent instructions that follow a memory reference while the longlatency cache access is in progress. Modern
processors support up to 128-instruction
windows, with the width of the processor
dictating the number of instructions that
can simultaneously execute in one cycle. In

addition to exploiting ILP, large instruction
windows can exploit MLP by nding independent memory accesses within the instruction window and performing the memory
accesses in parallel. The latency of LLC hits
and off-chip memory accesses cannot be hidden by out-of-order execution; achieving
high MLP is therefore key to achieving high
core utilization by reducing the data access
latency.
The processors we study use four-wide
cores that can decode, issue, execute, and
commit up to four instructions on each
cycle. However, in practice, ILP is limited
by dependencies. The Baseline bars in
Figure 3a show the average number of
instructions committed per cycle when
running on an aggressive four-wide out-oforder core. Despite the abundant availability of core resources and functional units,
scale-out workloads achieve a modest
application IPC, typically in the range of
0.6 (Data Caching and Media Streaming)
to 1.1 (Web Frontend). Although there
exist workloads that can benet from wide
cores, with some CPU-intensive Parsec and
SPEC2006 applications reaching an IPC of
2.0 (indicated by the range bars in the gure), using wide processors for scale-out
applications does not yield a signicant
benet.
Modern processors have 32-entry or larger
load-store queues, enabling many memory-
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
reference instructions in the 128-instruction

window. However, just as instruction dependencies limit ILP, address dependencies limit
MLP. The Baseline bars in Figure 3b present
the MLP, ranging from 1.4 (Web Frontend)
to 2.3 (SAT Solver) for the scale-out workloads. These results indicate that the memory
accesses in scale-out workloads are replete
with complex dependencies, limiting the
MLP that can be found by modern aggressive
processors. We again note that while desktop
and parallel applications can use high-MLP
support, with some Parsec and SPEC2006
applications having an MLP up to 5.0, support for high MLP is not useful for scale-out
applications. However, we nd that scale-out
workloads generally exhibit higher MLP than
traditional server workloads. Noting that
such characteristics lend themselves well to
multithreaded cores, we examine the IPC and
MLP of an SMT-enabled core in Figure 3. As
expected, the MLP found and exploited by
the cores when two independent application
threads run on each core concurrently nearly
doubles compared to the system without
SMT. Unlike traditional database server
workloads that contain many inter-thread
dependencies and locks, the independent
nature of threads in scale-out workloads enables them to observe considerable performance benets from SMT, with 39 to 69
percent improvements in IPC.
Support for four-wide out-of-order execution with a 128-instruction window and up
to 48 outstanding memory requests requires
multiple-branch prediction, numerous arithmetic logic units (ALUs), forwarding paths,
many-ported register banks, large instruction
scheduler, highly associative ROB and LSQ,
and many other complex on-chip structures.
The complexity of the cores limits core
count, leading to chip designs with several
cores that consume half the available on-chip
real estate and dissipate the vast majority of
the chips dynamic power budget. However,
our results indicate that scale-out workloads
exhibit low ILP and MLP, deriving benet
only from a small degree of out-of-order execution. As a result, the nature of scale-out
workloads cannot effectively utilize the
available core resources. Both the die area
and the energy are wasted, leading to datacenter inefciency.
The nature of scale-out workloads makes

them ideal candidates to exploit multithreaded multicore architectures. Modern
mainstream processors offer excessively complex cores, resulting in inefciency through
resource waste. At the same time, our results
indicate that niche processors offer excessively
simple (for example, in-order) cores that cannot leverage the available ILP and MLP in
scale-out workloads. We nd that scale-out
workloads match well with architectures
offering multiple independent threads per
core with a modest degree of superscalar
out-of-order execution and support for several simultaneously outstanding memory
accesses. For example, rather than implementing SMT on a four-way core, we could
use two independent two-way cores, which
would consume fewer resources while achieving higher aggregate performance. Furthermore, each narrower core would not require
a large instruction window, reducing the percore area and power consumption compared
to modern processors and enabling higher
computational density by integrating more
cores per chip.
Data-access inefficiencies
There are two major data-access inefciencies:

Large LLC consumes area, but does

not improve performance.
Simple data prefetchers are ineffective.
More than half of commodity processor

die area is dedicated to the memory system.
Modern processors feature a three-level cache
hierarchy, where the LLC is a large-capacity
cache shared among all cores. To enable
high-bandwidth data fetch, each core can
have up to 16 L2 cache misses in ight. The
high-bandwidth on-chip interconnect enables cache-coherent communication between
the cores. To mitigate the capacity and
latency gap between the L2 caches and the
LLC, each L2 cache is equipped with prefetchers that can issue prefetch requests into
the LLC and off-chip memory. Multiple
DDR3 memory channels provide high-bandwidth access to off-chip memory.
The LLC is the largest on-chip structure;
its cache capacity has been increasing
with each processor generation, thanks to
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
37
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
User IPC
normalized to baseline
1.0
0.9
0.8
0.7
0.6
0.5
4
10
11
Cache size (Mbytes)

Scale-out
Server
SPEC2006 (mcf)
Figure 4. Performance sensitivity to the last-level cache (LLC) capacity.

Relatively small average performance degradation due to reduced cache
capacity is shown for the scale-out and server workloads, in contrast to
some traditional applications (such as mcf).
100%
L2 hit ratio
75%
Baseline (all enabled)

Adjacent-line disabled
HW prefetcher disabled
50%
25%
at
a
C
D ac
at
a hin
g
S
M e
M ap rvi
ed R ng
ia ed
St uc
re
a e
SA min
W TS g
eb o
Fr lve
r
W onte
eb
nd
Se
ar
ch
PA
R
SP SE
EC C
SP 2
EC 00
w 6
eb
0
TP 9
C
-C
W
eb TPC
Ba -E
ck
en
d
0%
Figure 5. L2 hit ratios of a system with enabled and disabled adjacent-line

and HW prefetchers. Unlike for Parsec and SPEC2006 applications, minimal
performance difference is observed for the scale-out and server workloads.
............................................................
38
micro
IEEE
semiconductor manufacturing improvements. We investigate the utility of growing

the LLC capacity for scale-out workloads in
Figure 4 through a cache sensitivity analysis
by dedicating two cores to cache-polluting
threads. The polluter threads traverse arrays
of predetermined size in a pseudorandom
sequence, ensuring that all accesses miss in
the upper-level caches and reach the LLC.
We use performance counters to conrm that
the polluter threads achieve nearly a 100 percent hit ratio in the LLC, effectively reducing
the cache capacity available for the workload
running on the remaining cores of the same
processor.
We plot the average system performance
of scale-out workloads as a function of the
LLC capacity, normalized to a baseline system with a 12-Mbyte LLC. Unlike in the
memory-intensive desktop applications (such
as SPEC2006 mcf), we nd minimal performance sensitivity to LLC size above 4 to 6
Mbytes in scale-out and traditional server
workloads. The LLC captures the instruction
working sets of scale-out workloads, which
are less than 2 Mbytes. Beyond this point,
small shared supporting structures may consume another 1 to 2 Mbytes. Because scaleout workloads operate on massive datasets
and service a large number of concurrent
requests, both the dataset and the per-client
data are orders of magnitude larger than the
available on-chip cache capacity. As a result,
an LLC that captures the instruction working
set and minor supporting data structures
achieves nearly the same performance as an
LLC with double or triple the capacity.
In addition to leveraging MLP to overlap
demand requests from the processor core,
modern processors use prefetching to speculatively increase MLP. Prefetching has been
shown effective at reducing cache miss rates
by predicting block addresses that will be referenced in the future and bringing these
blocks into the cache prior to the processors
demand, thereby hiding the access latency. In
Figure 5, we present the hit ratios of the L2
cache when all available prefetchers are
enabled (Baseline), as well as the hit ratios
after disabling the prefetchers. We observe a
noticeable degradation of the L2 hit ratios of
many desktop and parallel applications when
the adjacent-line prefetcher and L2 hardware
prefetcher are disabled. In contrast, only one
of the scale-out workloads (MapReduce) signicantly benets from these prefetchers,
with the majority of the workloads experiencing negligible changes in the cache hit rate.
Moreover, similar to traditional server workloads (TPC-C), disabling the prefetchers
results in an increase in the hit ratio for some
scale-out workloads (Data Caching, Media
Streaming, and SAT Solver). Finally, we note
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
10.0%
23%
Application
OS
7.5%
5.0%
2.5%
at
a
D Cac
at
a h in
M Se g
M ap rvi
ed R ng
ia ed
St uc
re
a e
SA mi
W T S ng
eb o
Fr lve
r
W ont
eb en
Se d
ar
ch
P
SP AR
EC SE
w C
eb
0
TP 9
C
-C
W
eb TP
Ba C-E
ck
en
d
0.0%
that the DCU streamer (not shown) provides

no benet to scale-out workloads, and in
some cases marginally increases the L2 miss
rate because it pollutes the cache with
unnecessary blocks.
Our results show that the on-chip resources devoted to the LLC are one of the key
limiters of scale-out application computational density in modern processors. For traditional workloads, increasing the LLC
capacity captures the working set of a broader
range of applications, contributing to
improved performance, owing to a reduction
in average memory latency for those applications. However, because the LLC capacity
already exceeds the scale-out application
requirements by 2 to 3 times, whereas the
next working set exceeds any possible SRAM
cache capacity, the majority of the die area
and power currently dedicated to the LLC is
wasted. Moreover, prior research has shown
that increases in the LLC capacity that do not
capture a working set lead to an overall performance degradation; LLC access latency is
high due to its large capacity, not only wasting on-chip resources, but also penalizing all
L2 cache misses by slowing down LLC hits
and delaying off-chip accesses.
Although modern processors grossly overprovision the memory system, we can
improve datacenter efciency by matching
the processor design to the needs of the scaleout workloads. Whereas modern processors
dedicate approximately half of the die area to
the LLC, scale-out workloads would likely
benet from a different balance. A two-level
cache hierarchy with a modestly sized LLC
that makes a special provision for caching
instruction blocks would benet performance. The reduced LLC capacity along with
the removal of the ineffective L2 cache would
offer access-latency benets while also freeing
up die area and power. The die area and
power can be applied toward improving
computational density and efciency by adding more hardware contexts and more
advanced prefetchers. Additional hardware
contexts (more threads per core and more
cores) should linearly increase application
parallelism, and more advanced correlating
data prefetchers could accurately prefetch
complex access data patterns and increase the
performance of all cores.
Figure 6. Percentage of LLC data references accessing cache blocks

modified by a remote core. In scale-out workloads, the majority of the
remotely accessed cache blocks are from the operating system code.
Bandwidth inefficiencies
The major bandwidth inefciencies are

Lack of data sharing deprecates

coherence and connectivity.
Off-chip bandwidth exceeds needs by
an order of magnitude.
Increasing core counts have brought parallel programming into the mainstream,
highlighting the need for fast and high-bandwidth inter-core communication. Multithreaded applications comprise a collection
of threads that work in tandem to scale up
the application performance. To enable effective scale-up, each subsequent generation of
processors offers a larger core count and
improves the on-chip connectivity to support
faster and higher-bandwidth core-to-core
communication.
We investigate the utility of the on-chip
interconnect for scale-out workloads in
Figure 6. To measure the frequency of readwrite sharing, we execute the workloads on
cores split across two physical processors in
separate sockets. When reading a recently
modied block, this conguration forces
accesses to actively shared read-write blocks
to appear as off-chip accesses to a remote processor cache. We plot the fraction of L2
misses that access data most recently written
by another thread running on a remote core,
breaking down each bar into Application and
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
Read-write shared LLC hits

normalized to LLC data
references
micro
IEEE
39
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
20%
Off-chip memory
bandwidth utilization
Application
OS
15%
10%
5%
at
a
D Ca
at ch
a
in
M Ma Ser g
ed p vi
R
ia e ng
St du
re c e
S am
W AT in
eb S g
F o lv
W ro n e r
eb te
Se nd
PA
ar
ch
PA RSE
RS C
SP E (c
E C p
SP C2 (m u)
EC 00 em
2 0 6 (c )
0
p
SP 6 ( u)
EC me
w m)
eb
0
TP 9
C
W
e b T -C
Ba PCck E
en
d
0%
Figure 7. Average off-chip memory bandwidth utilization as a percentage of

available off-chip bandwidth. Even at peak system utilization, all workloads
exercise only a small fraction of the available memory bandwidth.
............................................................
40
micro
IEEE
OS components to offer insight into the

source of the data sharing.
In general, we observe limited read-write
sharing across the scale-out applications. We
nd that the OS-level data sharing is dominated by the network subsystem, seen most
prominently in the Data Caching workload,
which spends the majority of its time in the
OS. This observation highlights the need to
optimize the OS to reduce the amount of false
sharing and data movement in the scheduler
and network-related data structures. Multithreaded Java-based applications (Data Serving and Web Search) exhibit a small degree of
sharing due to the use of a parallel garbage
collector that may run a collection thread on
a remote core, articially inducing application-level communication. Additionally, we
found that the Media Streaming server
updates global counters to track the total
number of packets sent; reducing the amount
of communication by keeping per-thread statistics is trivial and would eliminate the mutex
lock and shared-object scalability bottleneckan optimization that is already present
in the Data Caching server we use. The onchip application-level communication in
scale-out workloads is distinctly different
from traditional database server workloads
(TPC-C, TPC-E, and Web Backend), which
experience frequent interaction between threads
on actively shared data structures that are used

to service client requests.
The low degree of active sharing indicates
that wide and low-latency interconnects
available in modern processors are overprovisioned for scale-out workloads. Although the
overhead with a small number of cores is limited, as the number of cores on chip increases,
the area and energy overhead of enforcing
coherence becomes signicant. Likewise, the
area overheads and power consumption of an
overprovisioned high-bandwidth interconnect further increase processor inefciency.
Beyond the on-chip interconnect, we also
nd off-chip bandwidth inefciency. While
the off-chip memory latency has improved
slowly, off-chip bandwidth has been improving at a rapid pace. Over the course of two
decades, the memory bus speeds have increased from 66 MHz to dual-data-rate at
over 1 GHz, raising the peak theoretical bandwidth from 544 Mbytes/second to 17 Gbytes/
second per channel, with the latest server processors having four independent memory
channels. In Figure 7, we plot the per-core
off-chip bandwidth utilization of our workloads as a fraction of the available per-core offchip bandwidth. Scale-out workloads experience nonnegligible off-chip miss rates, but the
MLP of the applications is low, owing to the
complex data structure dependencies. The
combination of low MLP and the small number of hardware threads on the chip leads to
low aggregate off-chip bandwidth utilization
even when all cores have outstanding off-chip
memory accesses. Among the scale-out workloads we examine, Media Streaming is the
only application that uses up to 15 percent of
the available off-chip bandwidth. However,
our applications are congured to stress the
processor, actually demonstrating the worstcase behavior. Overall, modern processors are
not able to utilize the available memory bandwidth, which is signicantly over-provisioned
for scale-out workloads.
The on-chip interconnect and off-chip
memory buses can be scaled back to improve
processor efciency. Because the scale-out
workloads perform only infrequent communication via the network, there is typically no
read-write sharing in the applications; processors can therefore be designed as a collection of core islands using a low-bandwidth
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
interconnect that does not enforce coherence

between the islands, eliminating the power
associated with the high-bandwidth interconnect as well as the power and area overheads of
ne-grained coherence tracking.4 Off-chip
memory buses can be optimized for scale-out
workloads by scaling back unnecessary bandwidth for systems with an insufcient number
of cores. Memory controllers consume a large
fraction of the chip area, and memory buses are
responsible for a large fraction of the system
power. Reducing the number of memory channels and the power draw of the memory buses
should improve scale-out workload efciency
without affecting application performance.
However, instead of taking a step backward and
scaling back the memory bandwidth to match
the requirements and throughput of conventional processors, a more effective solution
would be to increase the processor throughput
through specialization and thus utilize the available bandwidth resources.4
he impending plateau of voltage levels

and a continued increase in chip density are forcing efciency to be the primary
driver of future processor designs. Our
analysis shows that efciently executing scaleout workloads requires optimizing the
instruction-fetch path for multi-megabyte
instruction working sets; reducing the core
aggressiveness and LLC capacity to free area
and power resources in favor of more cores,
each with more hardware threads; and scaling
back the overprovisioned on-chip and offchip bandwidth. We demonstrate that modern processors, built to accommodate a broad
range of workloads, sacrice efciency, and
that current processor trends serve to further
exacerbate the problem. On the other hand,
we outline steps that can be taken to specialize processors for the key workloads of the
future, enabling efcient execution by closely
aligning the processor microarchitecture
with the microarchitectural needs of scaleout workloads. Following these steps can
result in up to an order of magnitude improvement in throughput per processor chip,
and in the overall datacenter efciency.5 MICRO
versions of this work. We thank the PARSA

lab for continual support and feedback, in
particular Pejman Lotfi-Kamran and Javier
Picorel for their assistance with the SPECweb09 and SAT Solver benchmarks. We
thank the DSLab for their assistance with
SAT Solver, and Aamer Jaleel and Carole
Jean-Wu for their assistance with understanding the Intel prefetchers and configuration. We thank the EuroCloud project
partners for advocating and inspiring the
CloudSuite benchmark suite. This work was
partially supported by EuroCloud, project
no. 247779 of the European Commission
7th RTD Framework ProgrammeInformation and Communication Technologies:
Computing Systems.
....................................................................
References
1. M. Horowitz et al., Scaling, Power, and the

Future of CMOS, Proc. Electron Devices
Meeting, 2005, pp. 7-15.
2. N. Hardavellas et al., Toward Dark Silicon
in Servers, IEEE Micro, vol. 31, no. 4,
2011, pp. 6-15.
3. M. Ferdman et al., Clearing the Clouds: A
Study of Emerging Scale-Out Workloads on
Modern Hardware, Proc. 17th Intl Conf.
Architectural Support for Programming
Languages and Operating Systems, 2012,
pp. 37-48.
4. P. Lotfi-Kamran et al., Scale-Out Processors, Proc. 39th Intl Symp. Computer Architecture, 2012, pp. 500-511.
5. B. Grot et al., Optimizing Data-Center TCO
with Scale-Out Processors, IEEE Micro,
vol. 32, no. 5, 2011, pp. 52-63.
6. H. Esmaeilzadeh et al., Dark Silicon and
the End of Multicore Scaling, Proc. 38th
Intl Symp. Computer Architecture, 2011,
pp. 365-376.
7. G. Venkatesh et al., Conservation Cores:
Reducing the Energy of Mature Computations, Proc. 15th Conf. Architectural
Operating Systems, 2010, pp. 205-218.
8. N. Hardavellas et al., Reactive NUCA:
Near-Optimal Block Placement and Replica-
Acknowledgments
We thank the reviewers and readers for
their feedback and suggestions on all earlier
tion in Distributed Caches, Proc. 36th Intl

Symp. Computer Architecture, 2009, pp.
184-195.
.............................................................
MAY/JUNE 2014
micro
IEEE
41
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Michael Ferdman is an assistant professor

in the Department of Computer Science at
Stony Brook University. His research focuses
on computer architecture, particularly on
server system design. Ferdman has a PhD in
electrical and computer engineering from
Carnegie Mellon University.
Almutaz Adileh is a PhD candidate in the
Department of Computer Science at Ghent
University. His research focuses on computer
architecture, particularly on improving performance in power-limited chips. Adileh has
an MSc in computer engineering from the
University of Southern California.
Onur Kocberber is a PhD candidate in the
School of Computer and Communication

Sciences at Ecole
Polytechnique Federale de
Lausanne. His research focuses on specialized architectures for server systems. Kocberber has an MSc in computer engineering
from TOBB University of Economics and
Technology.
Stavros Volos is a PhD candidate in the

Sciences at Ecole
Lausanne. His research focuses on computer architecture, particularly on memory
systems for high-throughput and energyaware computing. Volos has a Dipl-Ing in
electrical and computer engineering from
the National Technical University of Athens.
Mohammad Alisafaee performed the work
for this article while he was a researcher in
the School of Computer and Communica
tion Sciences at Ecole
Polytechnique Fedrale de Lausanne. His research interests
include multiprocessor cache coherence and
memory system design for commercial
workloads. Alisafaee has an MSc in electrical
and computer engineering from the University of Tehran.
............................................................
42
micro
IEEE
Djordje Jevdjic is a PhD candidate in the


Sciences at Ecole
Polytechnique Federale
de Lausanne. His research focuses on highperformance memory systems for servers,
including on-chip DRAM caches and
3D-die stacking, with an emphasis on locality and energy efficiency. Jevdjic has an MSc
in electrical and computer engineering from
the University of Belgrade.
Cansu Kaynak is a PhD candidate in the

Sciences at Ecole
Lausanne. Her research focuses on server
systems, especially memory system design.
Kaynak has a BSc in computer engineering
from TOBB University of Economics and
Technology.
Adrian Daniel Popescu is a PhD candidate
in the School of Computer and Communi
cation Sciences at Ecole
Polytechnique
Federale de Lausanne. His research focuses
on the intersection of database management
systems with distributed systems, specifically
query performance prediction. Popescu has
an MSc in electrical and computer engineering from the University of Toronto.
Anastasia Ailamaki is a professor in the

Sciences at Ecole
Lausanne. Her research interests include
optimizing database software for emerging
hardware and I/O devices and automating
database management to support scientific
applications. Ailamaki has a PhD in computer science from the University of
Wisconsin-Madison.
Babak Falsafi is a professor in the School of
Computer and Communication Sciences at

Ecole
Polytechnique Federale de Lausanne
and the founding director of EcoCloud, an
interdisciplinary research center targeting robust, economic, and environmentally friendly
cloud technologies. Falsafi has a PhD in computer science from the University of Wisconsin-Madison.
article to Michael Ferdman, Stony Brook University, 1419 Computer Science, Stony Brook,
NY 11794; ___________________
mferdman@cs.stonybrook.edu.
____________
_______
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..................................................................................................................................................................................................................
SMART: SINGLE-CYCLE MULTIHOP

TRAVERSALS OVER A SHARED
NETWORK ON CHIP
..................................................................................................................................................................................................................
SMART (SINGLE-CYCLE MULTIHOP ASYNCHRONOUS REPEATED TRAVERSAL) AIMS TO

DYNAMICALLY SET UP SINGLE-CYCLE PATHS (WITH TURNS) FROM THE SOURCE TO THE
DESTINATION FOR MESSAGE FLOWS SHARING A NETWORK ON CHIP. A FLOW-CONTROL
TECHNIQUE ARBITRATES FOR AND RESERVES MULTIPLE LINKS WITHIN A CYCLE. A ROUTER
AND LINK MICROARCHITECTURE ENABLES A MULTIHOP (9 TO 11 HOPS AT 1 GHZ IN 45 NM)
TRAVERSAL WITHIN A CYCLE.
......
Increasing the number of on-chip

cores continues to be the de facto strategy to
scale performance in the presence of two
trends: technology scaling (that is, more transistors) due to Moores law, and single-core
frequency plateauing due to the Power Wall.
These cores are often connected by a shared
interconnect fabric, such as a ring or a mesh,
over which multiple communication ows
multiplex. As core counts go up, the average
number of hops between communicating
cores goes up as welllinearly with k for a
k-node ring or a k k mesh. (We dene a
hop to be the physical distance between
neighboring tiles. In this paper, 1 hop 1
mm, based on place-and-route of a Freescale
PowerPC e200z7 core in 45 nm.)
The number of hops directly impacts the
communication latency TN for a it, which
is the smallest unit of a network packet, and
the granularity at which network resources
0272-1732/14/$31.00 c 2014 IEEE
micro
IEEE
(links and buffers) are allocated. This can be

expressed as:
TN H tr tw
H
X
tc h;
h1
where H is the number of hops, tr is the

wire (between
routers intrinsic delay, tw is theP
two routers) delay, and Tc tc h is the
network contention delaythe number of
cycles spent waiting to get access to the switch
and output link. Multiit packets incur an
additional component Ts or a serialization
delay, which is set by the number of cycles
that a packet of length L takes to cross a channel with bandwidth b (that is, the number of
its in the packet). The on-chip latency in
turn affects the completion time of cache
coherence transactions, because cache misses
often must be serviced by remote caches or
memory controllers. Slower requests and
Tushar Krishna
Chia-Hsin Owen Chen
Woo-Cheol Kwon
Li-Shiuan Peh
Massachusetts Institute of
Technology
.............................................................
43
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
responses lead to a slower injection of new

requests due to dependencies, which leads to
poorer throughput and overall system slowdown. The increased on-chip latency due to
the increased number of hops is a worrisome
trend, especially with ambitious design goals
of adding hundreds of cores to a single chip
for the exascale era.
In this work, we present a solution to
achieve close to an ideal one-cycle network
TN 1 traversal on a mesh for any sourcedestination pair. Our proposed network on a
chip (NoC) is called Smart (Single-cycle
Multihop Asynchronous Repeated Traversal).
As the name suggests, we embed asynchronous
repeaters within each routers crossbar and size
them to drive signals up to multiple hops
(11 in this work) within a single clock cycle
before getting latched. We present a network
ow-control mechanism to set up arbitrary
multihop paths with turns (that is, repeated
wires on demand) within a cycle, and then
traverse them within a cycle. We optimize
network latency as follows:
TN dH =HPC e tr tw
H
X
tc h;
h1
where HPC stands for number of hops per

cycle, the maximum value for which
HPCmax depends on the underlying technology. We reduce the effective number of
hops to dH =HPCe, without adding any
additional physical wires in the datapath
between distant nodes.
On a 64-core mesh, synthetic trafc
shows a 5- to 8-times reduction in average
network latency, whereas full-system Splash-2
and Parsec trafc shows a 27 and 52 percent
reduction in average runtime for private and
shared level-2 (L2) designs, respectively, compared to a state-of-the-art NoC with one-cycle
routers. Smart is a more scalable and less
expensive solution than alternate approaches
to reduce network latency, which are discussed
in the High-Radix Routers and Asynchronous NoCs sidebar.
Background
............................................................
44
micro
IEEE
NoCs consist of shared links, with routers

at crosspoints. Routers perform multiplexing
of its on the links, and buffer its in case of

contention. Each hop consists of a routerand-link traversal. A router performs the following actions:1

Buffer Write (BW): The incoming flit

is buffered.
Route Compute (RC): The incoming
head flit chooses an output port to
depart from.
Switch Allocation (SA): Buffered flits
arbitrate among themselves for the
crossbar switch. At the end of this
stage, there is at most one winner for
every input and output port of the
crossbar.
VC Selection (VS): Head flits that win
SA reserve a Virtual Channel (VC)
for the next router, from a pool of
free VCs.2
The winners of SA proceed to Switch
(crossbar) Traversal (ST) and Link
Traversal (LT) to reach the next
routers.
A plethora of research in NoCs over the

past decade coupled with technology scaling
has allowed the actions within a router to
move from serial execution to parallel execution via look-ahead routing,1 simplied VC
selection,2 speculative switch arbitration,3,4
nonspeculative switch arbitration via lookaheads5,6 to bypass buffering, and so on. This
has allowed the router delay tr (Equation 1)
to drop from three to ve cycles in industry
prototypes7,8 to one cycle in academic NoConly prototypes.6 We use this state-of-the-art
one-cycle router as our baseline. ST and LT
can be done together within a cycle,6,8 giving
us tw 1. Thus, our baseline incurs two
cycles per hop (see Figure 1). In case of contention, its have to be buffered and could
wait multiple cycles before they win SA and
VS, increasing Tc , as shown at Routerni .
The Smart interconnect

Adding asynchronous repeaters (that is, a
pair of inverters) at regular intervals on a long
wire is a standard way to reduce wire
delay.9,10 We perform a design-space exploration of repeated wires in a commercial
45-nm silicon on insulator (SOI) technology
using the place-and-route tool Cadence
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
High-Radix Routers and Asynchronous NoCs
High-radix router designs such as Fat Tree,1 Flattened Butterfly,2

and Clos3 are topology solutions to reduce average hop counts. They
advocate adding physical express links between distant routers. Each
router now has more than five ports, and channel bandwidth (b) is
often reduced proportionally to have similar buffer and crossbar area
and power as a mesh (radix-5) router, increasing the total number of
flits, adding serialization delay Ts to each packet. Moreover, more
ports complicates the routing, switch allocation, and virtual channel
allocation mechanism, often requiring a hierarchical switch allocator
and crossbar,4 increasing router delay tr to 4 to 5 at the routers where
flits must stop. The pipeline optimizations described in the main article
are hard to implement here. These designs also complicate layout
because multiple point-to-point global wires must span the chip.
Moreover, a topology solution works only for certain traffic, and incurs
higher latencies for adversarial traffic (such as near neighbor) because
of higher Ts. In contrast, Smart provides the illusion of dedicated physical express channels, embedded within a regular mesh network, without having to lower the channel bandwidth or increase the number of
router ports. Given the same number of wires, it can virtually create
the same high-radix topology with lower tr and no additional Ts.
Asynchronous networks on chip (NoCs) have been proposed for
the system-on-a-chip (SoC) domain for deterministic traffic.5 Such a
network is programmed statically to preset contention-free routes for
quality of service, with messages then transmitted across a fully asynchronous NoC (routers and links). Instead, Smart couples clocked
routers with asynchronous links, so the routers can perform fast cycleby-cycle reconfiguration of the links, and thus handle general-purpose
chip multiprocessors with nondeterministic traffic and variable
contention scenarios. Asynchronous bypass channels target chips
Encounter. We x the repeater spacing to

1 mm (our tile size), wire spacing to 3 times
the minimum allowed by the technology (to
lower the coupling capacitance), and keep
increasing the length of the wire, letting the
tool size the repeaters appropriately, until it
fails timing closure at our target cycle time of
1 nanosecond (ns)that is, 1 GHz. We
translate the maximum distance that a signal
can be transmitted within 1 ns into a network
microarchitectural parameter hops per cycle
max HPCmax :
HPCmax (maximum mm per ns
clock period in ns)
=(tile width in mm)
Figure 2 shows that HPCmax for repeated
wires at 45 nm is 16 (assuming 1 mm tiles
with multiple clock domains across a die,6 where each hop can incur
significant synchronization delay. They aim to remove this synchronization delay. This leads them to propose sending a clock signal with
the data so that the data can be latched correctly at the destination
router. However, unlike Smart, bypass and buffer modes cannot be
switched cycle by cycle, and flits must be speculatively latched at
every hop.
References
1. W. Dally and B. Towles, Principles and Practices of Interconnection Networks, Morgan Kaufmann, 2003.
2. J. Kim et al., Flattened Butterfly Topology for On-Chip
Networks, Proc. 40th Ann. IEEE/ACM Intl Symp. Microarchitecture, 2007, pp. 172-182.
3. Y.-H. Kao et al., CNoC: High-Radix Clos Network-on-Chip,
IEEE Trans. Computer-Aided Design of Integrated Circuits
and Systems, vol. 30, no. 12, 2011, pp. 1897-1910.
4. J. Kim et al., Microarchitecture of a High-Radix Router,
Proc. 32nd Ann. Intl Symp. Computer Architecture (ISCA
05), 2005, pp. 420-431.
5. T. Bjerregaard and J. Sparso, A Router Architecture for Connection-Oriented Service Guarantees in the MANGO Clockless Network-on-Chip, Proc. Conf. Design, Automation and
Test in Europe (DATE 05), 2005, pp. 1226-1231.
6. T.N.K. Jain et al., Asynchronous Bypass Channels: Improving Performance for Multisynchronous NoCs, Proc. 4th
ACM/IEEE Intl Symp. Networks-on-Chip (NOCS 10), 2010,
pp. 51-58.
and 1 GHz clock). (The place-and-route

tool was found to zigzag wires to t a
xed global grid, adding unnecessary wire
length, limiting HPCmax . A custom design
can potentially go further and with a atter
energy prole, as projected by the timingdriven NoC power-modeling tool Dsent11).
We observe a similar trend for HPCmax at 32
nm and 22 nm, with energy going down by
19 and 42 percent. At smaller technology
nodes, global wires are not expected to
become faster (unlike transistors); however,
smaller tile sizes and fairly constant frequencies should translate to a higher HPCmax.
Router logic delay limits the network frequency to 1 to 2 GHz at 45 nm.6,8 Link drivers are accordingly sized to drive only 1 mm
(1 hop) in 0.5 to 1 ns, before the signal is
latched at the next router. Smart removes this
.............................................................
MAY/JUNE 2014
micro
IEEE
45
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
bypass path
Cout
Flit pipeline
Route
Time
RC*
SA
Routern VS*
Nin
Nout
Sin
Sout
Eout Router
n+1
Ein
bypass path
RC*
SA
VS*
Wout
Route
Crossbar
Switch alloc.
VC select.
credit in
Router n+i
RC* BW
SA SA
VS*
ST+LT
Win
*only required for

head flits
ST+LT
creditout
ST+LT
Cin
Energy (fj/bit/mm)
Figure 1. Microarchitecture and pipeline of a state-of-the-art baseline tr 1 router. Each

network traversal takes two cycles per hop tr RC SA VS; tw ST LT .
51
48 Clocked
driver
45
42 45-nm PnR
39
36
33
30
27
24
21
18
15
0 1 2 3 4 5
45 nm (Place-and-Route)
45 nm (Projected)
32 nm (Projected)
22 nm (Projected)
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Length/period (mm/ns)
Figure 2. Transmission energy as a function of transmission distance for repeated links at

1 GHz. A placed-and-routed repeated wire in 45 nm can go up to 16 nm in 1 ns. (Wire width:
DRCmin ; wire spacing: 3 DRCmin ; metal layer: M6; repeater spacing: 1 mm.)
............................................................
46
micro
IEEE
constraint of latching signals at every hop.

We exploit the positive slack in the linktraversal stage by replacing clocked link drivers with asynchronous repeaters at every hop,
thus driving signals HPCmax -hops within a
cycle. HPCmax is a design-time parameter,
which can be inferred from Figure 2. If we
choose a 2-mm tile size, or a 2-GHz frequency, HPCmax will go down by half. Asynchronous repeaters also consume 14.3
percent lower energy per bit per mm than
conventional clocked drivers, as Figure 2
shows, giving us a win-win. Smart is a better
solution for exploiting the slack than deeper
pipelining of the router with a higher clock
frequency (such as Intels 80-core 5-GHz
ve-stage router7), which, even if it were
possible to do, does not reduce traversal latency

(only improves throughput) and adds huge
power overheads owing to pipeline registers.
Figure 3a shows a Smart router. For
simplicity, we only show Corein Cin ;
Westin Win ; and Eastout Eout ports. (Cin
does not have a bypass path like the other
ports because all its from the network interface controller [NIC] must be buffered at the
rst router before they can create Smart
paths.) All other input ports are identical to
Win , and all other output ports are identical
to Eout . Each repeater must be sized to
drive not just the link, but also the muxes
(2:1 bypass and 4:1 crossbar) at the next
router, before a new repeater is encountered.
Using the same methodology with Cadence
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
Cin
Cin_xb
XB sel
Win
bypass
Asynchronous
repeater
0 BMsel
Eout_xb
local
BWena
E out
Win_xb
Xbar
free_vc
(a)
Time
Routern
VS* + BW
SSR+SA-G
RC*
SA-L
ST+LT
SSR+SA-G
ST+LT
SSR+SA-G
ST+LT
Routern+1
Routern+2
Router n+i
Flit pipeline
SSR pipeline
*only required for
head flits
SSR+SA-G
VS* + BW
RC*
Routern+HPCmax
ST+LT
VS* + BW
SSR+SA-G
RC*
SA-L
ST+LT
(b)
Figure 3. Changes to the router and pipeline to support single-cycle multihop traversals.
Smart router microarchitecture (a) and pipeline (b). BWena , BMsel , and XBsel are set up
during the control path (SSRSA-G). During the datapath (STLT), the flit can cross multiple
routers in a cycle if BWena is 0, and BMsel is set to bypass, and gets latched at the router
where BWena is 1.
Encounter, this reduces HPCmax to 11 at

1 GHz.
Figure 3a shows the three primary components of the design:

Buffer Write enable (BWena ) at the

input flip-flop, which determines
whether the input signal is latched;
Bypass Mux select (BMsel ) at the
input of the crossbar, which chooses
between the local buffered flit and
the bypassing flit on the link; and
Crossbar select (XBsel ).
In the next section, we describe the ow

control to preset these signals.
Smart in a k-ary 1-mesh

We start by demonstrating how Smart
works in a k-ary 1-mesh (see Figure 4). Each
router has three ports: West, East, and Core.

(For illustration purposes, we only show Cin,
Win , and Eout in the gures.) As Figure 3a
shows, Eout xb can be connected either to
Cin xb or Win xb. The latter can be driven
either by bypass, local, or 0, depending on
BMsel .
The design is called Smart_1D (because
routers can be bypassed only along one
dimension). Bypassing routers at turns in a
k-ary 2-mesh will be described later. For purposes of illustration, we will assume HPCmax
to be 3.
Smart-hop setup request

Figure 3b shows the Smart router pipeline. A Smart-hop (single-cycle multihop
path) begins from a start router (where its
are buffered). Unlike the baseline router,
.............................................................
MAY/JUNE 2014
micro
IEEE
47
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
log2 (1+ HPCmax )
SSRs for Wout
R0
R1
R2
R3
R4
C in
Win
E out
SA-G
SSR
SSRs for Eout
1h2h3h
BWen
BMsel
XBsel
h = hop
SA-L
0h
Figure 4. k-ary 1-mesh with dedicated Smart-hop setup request (SSR) links going up to
HPCmax (3 in this example) hops in each direction. The switch allocation local (SA-L) grant
sends SSRs. The switch allocation global (SA-G) unit sets BWena , BMsel , and XBsel based on
the SSRs from 0-hop, 1-hop, 2-hop, and 3-hop neighbors.
............................................................
48
micro
IEEE
switch allocation in Smart occurs over two

stages: switch allocation local (SA-L) and
switch allocation global (SA-G). SA-L is
identical to the SA stage in the conventional
pipeline (described earlier): every start router
chooses a winner for each output port from
among its buffered (local) its. In the next
cycle, each output port winner rst broadcasts a Smart-hop setup request (SSR) up to
HPCmax -hops from that output port. These
SSRs are dedicated repeated wires (which are
inherently multidrop) on the control path
that connect every router to a neighborhood
of up to HPCmax (see Figure 4). SSRs are
log 2 1 HPCmax -bits wide, and carry the
length (in hops) up to which the winning it
wishes to go. For instance, SSR 2 indicates
a 2-hop path request. Each it tries to go as
close as possible to its destination router;
hence, SSR min(HPCmax ; Hremaining ).
During SA-G, all intermediate routers
arbitrate among the SSRs they receive, to set
the BWena , BMsel , and XBsel signals. The
arbiters guarantee that only one it will be
allowed access to any particular I/O port
of the crossbar. In the next cycle (STLT),
SA-L winners that also won SA-G at their
start routers traverse the crossbar and links up
to multiple hops till they are stopped by
BWena at some router. Thus, its spend at
least two cycles (SA-L and SA-G) at a start
router before they can use the switch. SSR
traversal and SA-G occur serially within the
same cycle.
Single-cycle multihop paths are opportunistic, not guaranteed; its can end up getting
prematurely stopped (that is, before their

SSR length) depending on the SA-G results
at different routers, which depends on
contention.
We illustrate all this with examples. In
Figure 5, router R2 has FlitA and FlitB buffered at Cin , and FlitC and FlitD buffered at
Win , all requesting Eout . Suppose FlitD wins
SA-L during Cycle 0. In Cycle 1, it sends
SSRD 2 (that is, a request to stop at R4)
out of Eout to routers R3, R4, and R5. SA-G
is performed at each router. At R2, which is 0
hops away (< SSRD ), BMsel local and
XBsel Win xb !Eout xb . At R3, which is 1
hop away (< SSRD ), BMsel bypass and
XBsel Win xb !Eout xb . At R4, which is 2
hops away ( SSRD ), BWena high. At R5,
which is 3 hops away (> SSRD ), SSRD is
ignored. In Cycle 2, FlitD traverses the crossbars and links at R2 and R3, and is stopped
and buffered at R4.
What happens if there are competing
SSRs? In the same example, suppose R0 also
wants to send FlitE 3 hops away to R3, as
shown in Figure 6. In Cycle 1, R2 sends out
SSRD as before; in addition, R0 sends
SSRE 3 out of Eout to R1, R2, and R3.
Now, at R2, there is a conict between SSRD
and SSRE for the Win xb and Eout xb ports of
the crossbar. SA-G priority decides which
SSR wins the crossbar. For example, the
PrioLocal scheme gives highest priority to
the local (buffered) it, followed by the it
from the neighboring router, followed by the
it from the router two hops away, and so
on; so FlitE loses to FlitD . Figure 6 shows the
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
Cin
Win
R1
R0
R2
R3
R4
R5
=
Cycle 1
Cycle 2
BWena
BM sel
XBsel
BWena
BM sel
XBsel
0
0
X
0
0
X
FlitA FlitC
SSRD = 2
Flit B FlitD
0
BWena
local
BMsel
Win ->E out
XB sel
BWena
BM sel
XB sel
0
bypass
Win ->E out
BWena
BM sel
XB sel
1
0
X
BWena
BMsel
XB sel
0
0
X
Figure 5. Smart example: no SSR conflict. Cycle 1: FlitD (assumed to have won SA-L in Cycle 0) sends SSRD 2; that is, a
request to bypass R3 and stop at R4. The SA-G units at R2, R3, and R4 set up a single-cycle multihop bypass path. Cycle 2:
FlitD starts at R2 and gets latched at R4.
Cin
Win
R0
Flit E
R1
R2
R3
R4
R5
Cycle 1
Cycle 2
BWena
BM sel
XBsel
0
0
Cin->E out
SSRE = 3
Flit A Flit C
Flit B Flit D
BWena
0
BM sel
bypass
XBsel Win ->E out
SSRD = 2
1
BWena
local
BM sel
XBsel Win ->E out
BWena
0
BM sel
bypass
XBsel
Win ->E out
BWena
BMsel
XBsel
1
0
X
BWena
BMsel
XB sel
0
0
X
Figure 6. Smart example: SSR conflict with PrioLocal. The PrioLocal scheme gives highest priority to the local (buffered)
flit, then the flit from the neighboring router, followed by the flit from the router two hops away, and so on. FlitE from R0 is
prematurely stopped at R2, before its intended destination R3, to allow R2 to send its own local FlitD on its East output link.
values of BWena , BMsel , and XBsel at each

router for this priority. In Cycle 2, FlitE traverses the crossbar and link at R0 and R1,
but is stopped and buffered at R2. FlitD traverses the crossbars and links at R2 and R3
and is stopped and buffered at R4. FlitE now
goes through BW and SA-L at R2 before it
can send a new SSR and continue its network
traversal. A free VC with an empty buffer slot
is guaranteed to exist whenever a it is made
to stop, as we will explain later. An alternate
priority, PrioBypass, prioritizes its from
the furthest router over the its from the
nearer ones. Here, in Cycle 2, FlitE would
traverse all the way from R0 to R3, and FlitD
would be stalled.
False positives and false negatives

Can a it arrive at a router, even though
the router isnt expecting it (that is, a false
positive)? The answer is no. For correctness,
all routers must enforce the same SA-G
priority (PrioLocal or PrioBypass), thus

ensuring the same relative priority between
the SSRs. All its that arrive at a router are
expected and will stop or bypass on the basis
of their SSRs success in the previous cycle.
Different routers choosing different SA-G
priorities could result in misrouting beyond
the allowed HPCmax -hops.
Can a it not arrive at a router, even
though the router is expecting it (that is, a
false negative)? Yes. It is possible for the
router to be set up for stop or bypass for
some it, but no it arrives. This can happen
if that it was forced to prematurely stop at
some prior router, owing to some SSR interaction at that router that the current router is
not aware of. For example, suppose a local
it at Win at R1 wants to eject out of Cout . A
it from R0 will prematurely stop at R1s
Win port if PrioLocal is implemented.
However, R2 will still be expecting the it
from R0 to arrive (the valid-bit from the it
.............................................................
MAY/JUNE 2014
micro
IEEE
49
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
is thus used in addition to BWena when deciding whether to buffer). Unlike false positives,
this is not a correctness issue but rather a performance (throughput) issue, because some
links go idle, when they could have been used
by other its if more global information were
available.
Ordering
In Smart, any it can be prematurely
stopped on the basis of the interaction of
SSRs that cycle. We must ensure that this
does not result in reordering between its of
the same packet, or between its from the
same source (if point-to-point ordering is
required in the coherence protocol).
The rst constraint is in routing (relevant
to 2D topologies). Multiit packets and
point-to-point ordered virtual networks
should use only deterministic routes to ensure
that prematurely buffered its do not end up
choosing alternate routes while bypassing its
continue on the old route.
The second constraint is in SA-G priority.
Every input port has a bit to track if there is a
prematurely stopped it among its buffered
its. When an SSR is received at an input
port, and there is either a prematurely buffered head/body it or a prematurely buffered it within a point-to-point ordered
virtual network, the incoming it is stopped.
Guaranteeing free VC with buffers at stop routers
............................................................
50
micro
IEEE
In a conventional network, a routers output port tracks the IDs of all free VCs at the
neighbors input port. A buffered head it
chooses a free VC ID for its next router
(neighbor) before it leaves the router. The
neighbor signals back when that VC ID
becomes free. In a Smart network, the challenge is that the next router could be any
router that can be reached within a cycle. A
it at a start router choosing the VC ID
before it leaves will not work because it is not
guaranteed to reach its presumed next router,
and multiple its at different start routers
might end up choosing the same VC ID.
Instead, we let the VC selection occur at the
stop router. Every Smart router receives 1 bit
from each neighbor to signal if at least one
VC is free. (If the router has multiple virtual
networks, or vnets, for the coherence protocol, we need a 1-bit free VC signal from the
neighbors for each vnet. The SSR also needs

to carry the vnet number, so that the intermediate routers will know which vnets free
VC signal to look at.)
During SA-G, if an SSR requests an output port without a free VC, BWena is made
high and the corresponding it is buffered.
This solution does not add any extra multihop
wires for VC signaling. The signaling is still
between neighbors. Moreover, it ensures that
a head it comes into a routers input port
only if that input port has free VCs; otherwise,
the it is stopped at the previous router.
This solution is conservative because a it
will be stopped prematurely if the neighbors
input port does not have free VCs, even if
there was no competing SSR at the neighbor
and the it would have bypassed it without
having to stop.
Body/tail its identify which VC to go to
at the stop router by using their injection_
router ID. Every input port maintains a table
to map a VC ID to an injection router ID
(the table size equals the number of multiit
VCs at that input port). Whenever the head
it is allocated a VC, this table is updated.
The injection_router ID entry is cleared
when the Tail arrives. The VC is freed when
the Tail leaves. We implement private buffers
per VC, with depth equal to the maximum
number of its in the packet (that is, virtual
cutthrough) to ensure that the body/tail will
always have a free buffer in its VC.
What if two body/tail its with the same
injection_router ID arrive at a router? We
guarantee that this will never occur by forcing
all its of a packet to leave from a routers output port before its from another packet can
leave from that output port. This guarantees a
unique mapping from injection_router ID to
VC ID in the table at every routers input port.
What if a head bypasses, but the body/tail
is prematurely stopped? The body/tail still
must identify a VC ID to get buffered in. To
ensure that it does have a VC, we make the
head it reserve a VC not just at its stop
router, but also at all of its intermediate
routers, even though it does not stop there.
This is done from the bypassing its valid,
type, and injection_router elds. The tail it
frees the VCs at all the intermediate routers.
Thus, for multiit packets, VCs are reserved
at all routers, just like the baseline. But the
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
advantage of Smart is that VCs are reserved

and freed at multiple routers within the same
cycle, reducing the buffer turnaround time.
Additional optimizations
We optimize Smart further to push it
toward the ideal (TN 1) NoC.
Bypassing the destination router. So far, we
have assumed that a it starting at an injection
router traverses one (or more) Smart-hops
until it reaches the destination router, where it
gets buffered and requests for the Cout port.
We add an extra ejection-bit in the SSR to
indicate whether the requested stop router
corresponds to the destination router for the
packet, and not any intermediate router on
the route. If a router receives an SSR from
H-hops away with value H (that is, a request
to stop there), H < HPCmax , and the ejection-bit is high, it arbitrates for Cout port during SA-G. If it loses, BWena is made high.
Bypassing SA-L at low load. We add no-load
bypassing1 to the Smart router. If a it comes
into a router with an empty input port and
no SA-L winner for its output port for that
cycle, it sends SSRs directly, in parallel to getting buffered, without having to go through
SA-L. This reduces tr at lightly loaded start
routers to 2, instead of 3, as shown in Figure
3b for Routerni.
With both ejection and no-load bypass
enabled, if HPCmax is larger than the maximum hops in any route, a it will only spend
two cycles in the entire Smart network in the
best case (one cycle for SSR and one for
STLT all the way to the destination NIC).
Smart in a k-ary 2-mesh

We demonstrate how Smart works in a
k-ary 2-mesh. Each router has ve ports:
West, East, North, South, and Core.
Bypassing routers along dimension

We start with a design in which we do not
allow bypass at turnsthat is, all its must
stop at their turn routers. We reuse Smart_1D
described for a k-ary 1-mesh in a k-ary 2mesh. The extra router ports only increase the
complexity of the SA-L stage, since there are
multiple local contenders for each output
port. Once each router chooses SA-L winners,
SA-G remains identical to our earlier description. Each output port has multidrop SSR
wires spanning upto HPCmax -routers along
that dimension. Each input port of a router
receives HPCmax set of SSR wires, one from
each router. The SSR requests a stop or a
bypass along that dimension. Flits with turning routes perform their traversal one dimension at a time, trying to bypass as many routers
as possible, and stopping at the turn routers.
Bypassing routers at turns

In a k-ary 2-mesh topology, all routers
within an HPCmax neighborhood can be
reached within a cycle, as shown in Figure 7a
by the shaded diamond. We now describe
Smart_2D, which lets its bypass both the
routers along a dimension and the turn
routers. We add dedicated SSR links for each
possible XY/YX path from every router to
its HPCmax neighbors. Figure 7a shows
that the Eout port has ve SSR links, in comparison to only one in the Smart_1D design.
During the routing stage, the it chooses one
of these possible paths. During the SA-G stage,
the router broadcasts one SSR out of each output port, on one of these possible paths. We
allow only one turn within each HPCmax
quadrant to simplify the SSR signaling.
In the Smart_2D design, there can be
more than one SSR from H-hops away, as
shown in the example in Figure 7b for router
Rj ; it receives SSRs from routers Rm and Rn ,
which are both one hop away. Router Rk
receives the same SSRs. Rj and Rk both need
to prioritize the same SSRs to not create false
positives (for example, if Rj prioritizes the SSR
from Rn and Rk prioritizes the SSR from Rm ,
the it from Rn will get misrouted). To arbitrate between SSRs from routers that are the
same distance away, we add a second level of
priority based on direction. We arbitrarily
choose straight-hops > left-hops > righthops, where straight, left, and right are relative
to the I/O port. Figures 7c and 7d plot contours through routers that are the same number of hops away and highlight each routers
relative priority. For the intermediate router
Rj in Figure 7b, the SSR from Rm will have
higher priority (10 ) over the one from Rn (11 )
for the Nout port, as it is going straight, based
on Figure 7c. Similarly, at Rk , the SSR from
Rm will have higher priority (20 ) over the one
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
51
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
SSR
Only 1 of these
SSRs (from Eout)
will be valid
SSR
Rk
N
N out
Rn
Rj
start routers
Rm
stop routers
start router
Two SSRs from 1-hop

requesting Nout at Rj
intermediate routers
(a)
(b)
SSR Priority = distance direction
SSR Priority = distance direction
(0>1>2 ...)
Nout
35
23
11
00
Intermediate router
12
24
36
0
S in
intermediate router
33
(0>1>2 ...)
21
10
22
31
20
32
33
34
21
10
22
31
20
32
start
routers
30
(c)
30
34
start
routers
(d)
Figure 7. Smart_2D: SSRs and their SA-G priorities. k-ary 2-mesh with SSR wires from shaded start router (a). Conflict
between two SSRs for Nout port (b). Fixed priority at Nout port of inter-router (c). Fixed priority at Sin port of inter-router (d).
from Rn (21 ) for the Sin port, based on Figure

7d. Thus, both routers Rj and Rk will unambiguously prioritize the it from Rm to use the
links, whereas the it from Rn will stop at
Router Rj. We can also infer from Figures 7c
and 7d that every router sees the same relative
priority for SSRs based on distance and direction, thus guaranteeing no false positives.
Smart implementation
............................................................
52
micro
IEEE
The Smart control path (see Figure 4)

consists of HPCmax -hops repeated wire delay
(SSR traversal), followed by logic gate delay
(SA-G). This gave an HPCmax of 13 for
Smart_1D and 9 for Smart_2D, following
the methodology described earlier. The

Smart datapath (see Figure 5) is modeled as a
series of 128-bit 2:1 mux (for bypass) followed by a 4:1 mux (crossbar), followed by a
128-bit 1-mm link. This gave an HPCmax of
11. Picking the lowest of the two gave us an
HPCmax of 11 for Smart_1D and 9 for
Smart_2D. In our evaluations, we set
HPCmax 8, which allows bypass of all
routers along the dimension and the ejection,
in our target 8-ary 2-mesh.
Evaluation
We use the GEMS12 and Garnet13 infrastructure for all our evaluations, which
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
Synthetic traffic
We start by running Smart with synthetic
trafc patterns. We inject 1-it packets to
rst understand the benets of Smart without
secondary effects due to it serialization, and
VC allocation across multiple routers. For
the same reason, we also give enough VCs
(12, derived empirically) to allow both the
baseline and Smart to be limited by links,
rather than VCs for throughput.
Smart across different traffic patterns. Figure 8
compares the performance of three Smart
designs: Smart-8_1D and Smart-8_2D
(which are both achievable designs), and
Smart-15_2D, which reects the best that
Smart can do in an 8 8 mesh (with maximum possible hops 15), against the baseline and ideal. The striking feature about
Smart from is that it pushes low-load latency
to four and two cycles, for Smart_1D and
Smart_2D, respectively, across all trafc patterns, unlike the baseline, for which low-load
latency is a function of the average hops.
Thus, Smart truly breaks the locality barrier.
Smart-8_2D achieves most of the benet of
Smart-15_2D for all patterns, except Bit
Complement (BC), since average hop counts
are 8 for an 8 8 mesh.
Impact of HPC max . Next, we study the impact
of HPCmax on performance. We plot the average it latency for BC trafc (which has high
across-chip communication) for HPCmax from
1 to 12, across 1D and 2D in Figure 9. Smart1_1D is identical to the baseline (tr 1) network (as it does not need SA-G). We make
two key observations. First, at an HPCmax of 8,
Smart shows a 5.4 times reduction in latency.
This means that a 1-GHz Smart NoC can be
beaten by an NoC with one-cycle routers only
Average flit latency (cycles)
32
28
24
20
16
12
8
4
0
0
0.1
0.2
0.3
0.4
0.5
0.2
0.25
Injection rate (flits/node/cycle)

(a)
32
provides a cycle-accurate timing model. All

evaluations are for an 8 8 mesh. We
assume 1 GHz frequency and 45 nm technology. The baseline (tr 1) in all our runs
is a state-of-the-art NoC with one-cycle
routers. We also model ideal (TN 1); this
is an ideal but impractical fully connected
NoC, in which every it is magically sent
from the source NIC to its destination NIC
in one cycle with zero contention. All Smart
designs are called Smart-HPCmax _1D/2D,
and PrioLocal is assumed.
28
24
20
16
12
8
4
0
0
0.1
0.05
0.15

(b)
32
28
24
20
16
12
8
4
0
0
0.05
0.1
0.15
micro
0.2

(c)
BASELINE (tr = 1)
SMART-15_2D
SMART-8_1D
IDEAL (TN = 1)
SMART-8_2D
Figure 8. Smart with synthetic traffic. Uniform random (average hops

5.33) (a). Bit complement (average hops 8) (b). Transpose (average hops
6) (c).
.............................................................
MAY/JUNE 2014
IEEE
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
53
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
and 27 percent, respectively, on average, for a

private L2 where only L2 misses traverse the
network; this is only 8 percent away from an
ideal (TN 1) network. The runtime reduction goes up to 49 and 52 percent, respectively,
with a shared L2 design, where both L1 and
L2 misses traverse the network (making network latency even more critical), which is 9
percent off from an ideal (TN 1) network.
Smart-15_2D does not give any signicant
runtime benet over Smart-8_2D.
32
28
24
20
16
12
8
4
0
0.05
0.1
0.15
0.2
0.25

SMART-1_1D
SMART-2_1D
SMART-4_1D
SMART-8_1D
SMART-4_2D
SMART-8_2D
SMART-12_2D
Figure 9. Impact of HPCmax with Bit Complement traffic. HPCmax of 2 and

4 gives a 1.8 to 3 times reduction in low-load latency compared to HPCmax
of 1 (that is, baseline tr 1).
if the latter can run at a speed greater than 5.4

GHz, making a strong case for Smart from
both a performance and power perspective. A
high HPCmax would be available in many-core
chips with small low-frequency cores for the
low-power embedded domain. Second, at a
low HPCmax of 2 and 4, Smart gives a 1.8 to 3
times reduction in low-load latency compared
to HPCmax of 1. This makes a Smart-like
design a better choice than an NoC with onecycle routers even in multicores with large
high-frequency cores that will limit the value
of HPCmax . Note also that as we scale to
smaller feature sizes, cores shrink while die
sizes remain unchanged, so the same interconnect length will translate to a larger HPCmax.
Full-system traffic
............................................................
54
micro
IEEE
Full-system simulations use Wind River

Simics within GEMS,12 with 64 in-order
Sparc cores. We model 32-Kbyte private
instruction and data L1 caches, and a
1-Mbyte L2 cache slice per tile. We evaluate
the parallel sections of Splash-214 and Parsec15 for both private and shared L2, over a
MOESI directory protocol. Each run consists
of 64 threads of the application running on
our chip multiprocessor.
Figure 10 shows that Smart-8_1D and
Smart-8_2D lower application runtime by 26
ggressive NoC pipeline optimizations

can lower router delays to just one cycle.
However, this is not good enough for large
networks with multihop paths. The solution
of adding explicit, fast physical channels to
bypass routers comes with its own set of
problems in terms of layout complexity, area,
and power. We present Smart, a solution to
traverse multihop paths within a single cycle
by virtually bypassing all routers along the
route, without adding any physical channels
on the datapath. In the best case, the network
latency with Smart, from Equation 2, is just
2 cycles if there is no contention (that is, for
all h, tc h 0), and H is less than HPCmax .
In the worst case of tc h > 0 at every hop h,
the achieved HPC will be 1, which is the
same as the baseline.
Although transistors become faster with
technology scaling, wires do not. This trend
of communication becoming slower relative
to logic has been projected in the past as a
motivation to keep global chip-wide communication to a minimum and heavily optimize
for locality. This work rebuts this conclusion.
We project that communication (wire) delay
in cycles will actually remain relatively constant as technology scales, as chip dimensions
are not increasing due to yield, and clock frequencies have also plateaued owing to the
power wall. In addition, tile sizes tend to go
down as technology scales. The same wire
delay will thus translate to a higher HPCmax
as technology scales, making Smart even
more attractive. Locality will no longer be
that critical with Smart NoCs.
This work opens up a plethora of research
opportunities in circuits, NoC architectures,
and locality-oblivious many-core architectures to optimize and leverage Smart NoCs.
MICRO
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
M
q
M
q
M
q
M
q
MQmags
q
BASELINE (tr = 1)
IDEAL (tr = 1)
SMART-8_1D
SMART-8_2D
SMART-15_2D
IDEAL(TN = 1)
1.2
1.0
0.8
0.6
0.4
swaptions
x264
Average
swaptions
x264
Average
fluidanimate
canneal
blackscholes
water-spatial
water-nsq
radix
nlu
lu
0.2
0
fft
Normalized application runtime
IEEE
Parsec
Splash-2
(a)
0.8
0.6
0.4
Splash-2
fluidanimate
canneal
blackscholes
water-spatial
water-nsq
radix
nlu
lu
0.2
0
fft
Normalized application runtime
1.2
1.0
Parsec
(b)
Figure 10. Full-system application runtime with Smart, normalized to the runtime with
baselinetr 1. Private L2 cache per tile (a). Shared L2 cache slice per tile (b). In Shared L2,
L1 and L2 misses traverse the network to a remote node, making on-chip network latency
more critical than in Private L2.
Acknowledgments
We thank Sunghyun Park from the Massachusetts Institute of Technology and Michael
Pellauer from Intel for useful insights on the
interconnect and pipeline. We acknowledge
the support of DARPA UHPC, SMART
LEES, and MARCO C-FAR.
....................................................................
References
1. W. Dally and B. Towles, Principles and Practices of Interconnection Networks, Morgan

Kaufmann, 2003.
2. A. Kumar et al., A 4.6Tbits/s 3.6GHz Single-Cycle NoC Router with a Novel Switch
Allocator in 65nm CMOS, Proc. 25th Intl
Conf. Computer Design, 2007, pp. 63-70.
3. R. Mullins et al., Low-Latency Virtual-Channel Routers for On-Chip Networks, Proc.
31st Ann. Intl Symp. Computer Architecture, 2004, pp. 188-197.
4. H. Matsutani et al., Prediction Router: Yet
Another Low Latency On-Chip Router
Architecture, Proc. IEEE 15th Intl Symp.
High Performance Computer Architecture
(HPCA 09), 2009, pp. 367-378.
.............................................................
MAY/JUNE 2014
micro
IEEE
55
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
5. A. Kumar et al., Token Flow Control,

Proc. 41st IEEE/ACM Intl Symp. Microarchitecture, 2008, pp. 342-353.
6. S. Park et al., Approaching the Theoretical
Limits of a Mesh NoC with a 16-Node Chip
Prototype in 45nm SOI, Proc. 49th Ann.
Design Automation Conf. (DAC 12), 2012,
pp. 398-405.
on on-chip interconnection networks for

homogeneous and heterogeneous manycore systems. Krishna has a PhD in electrical
engineering and computer science from
the Massachusetts Institute of Technology,
where he performed the work for this
article.
7. Y. Hoskote et al., A 5-GHz Mesh Interconnect for a Teraflops Processor, IEEE Micro,
vol. 27, no. 5, 2007, pp. 51-61.
8. J. Howard et al., A 48-core IA-32 MessagePassing Processor with DVFS in 45nm
CMOS, Proc. IEEE Intl Solid-State Circuits
Conf., 2010, pp. 108-109.
9. B. Kim and V. Stojanovic, Equalized Interconnects for On-Chip Networks: Modeling
and Optimization Framework, Proc. IEEE/
ACM Intl Conf. Computer-Aided Design
Chia-Hsin Owen Chen is a doctoral candidate in the Department of Electrical Engineering and Computer Science at the
Massachusetts Institute of Technology. His
research interests include system-level power
and performance modeling and analysis,
and on-chip networks. Chen has an SM
in electrical engineering and computer
science from the Massachusetts Institute of
Technology.
(ICCAD 07), 2007, pp. 552-559.

10. J. Rabaey and A. Chandrakasan, Digital Integrated Circuits: A Design Perspective, Prentice Hall, 2002.
11. C. Sun et al., DSENTA Tool Connecting
Emerging Photonics with Electronics for
Opto-Electronic
Networks-on-Chip
Mod-
eling, Proc. IEEE/ACM 6th Intl Symp. Networks-on-Chip, 2012, pp. 201-210.
12. M.M.K. Martin et al., Multifacets General
Woo-Cheol Kwon is a doctoral candidate

in the Department of Electrical Engineering
and Computer Science at the Massachusetts
Institute of Technology. His research interests include multicore processor architectures and networks on chip. Kwon has an
MS in computer science from the Korea
Advanced Institute of Science & Technology (KAIST).
Execution-Driven Multiprocessor Simulator

(GEMS) Toolset, ACM SIGARCH Computer Architecture News, vol. 33, no. 4,
2005, pp. 92-99.
13. N. Agarwal et al., GARNET: A Detailed OnChip Network Model inside a Full-System
Simulator, Proc. IEEE Intl Symp. Performance Analysis of Systems and Software,
2009, pp. 33-42.
14. S.C. Woo et al., The SPLASH-2 Programs:
Li-Shiuan Peh is a professor in the Department of Electrical Engineering and Computer Science at the Massachusetts Institute
of Technology. Her research focuses on networked computing in many-core chips and
mobile wireless systems. Peh has a PhD in
computer science from Stanford University.
She is a member of IEEE and the ACM.
Characterization and Methodological Considerations, Proc. 22nd Ann. Intl Symp.

Computer Architecture, 1995, pp. 24-36.
15. C. Bienia et al., The PARSEC Benchmark
Suite: Characterization and Architectural
Implications, Proc. 17th Intl Conf. Parallel
Direct questions and comments about

this article to Tushar Krishna, 77 Reed
Road, HD2-330, Hudson, MA 01749;
tushar.krishna@intel.com.
________________
Architectures and Compilation Techniques

(PACT 08), 2008, pp. 72-81.
............................................................
56
micro
IEEE
Tushar Krishna is a researcher in the

VSSAD group at Intel. His research focuses
____________
_______
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..................................................................................................................................................................................................................
NETWORKS ON CHIP WITH PROVABLE

SECURITY PROPERTIES
..................................................................................................................................................................................................................
IN SYSTEMS WHERE A LACK OF SAFETY OR SECURITY GUARANTEES CAN BE

CATASTROPHIC OR EVEN FATAL, NONINTERFERENCE IS USED TO SEPARATE DOMAINS
HANDLING CRITICAL (OR CONFIDENTIAL) INFORMATION FROM THOSE PROCESSING NORMAL
(OR UNCLASSIFIED) DATA FOR FAULT CONTAINMENT AND EASE OF VERIFICATION.

SURFNOC SIGNIFICANTLY REDUCES THE LATENCY INCURRED BY STRICT TEMPORAL
PARTITIONING.
......
As multicore processors nd
increasing adoption in domains such as aerospace and medical devices, where failures have
the potential to be catastrophic, strong performance isolation and security become rstclass design constraints. When cores are used
to run separate pieces of the system, strong
time and space partitioning can help provide
such guarantees. However, as the number of
partitions or the asymmetry in partition
bandwidth allocations grows, the additional
latency incurred by time multiplexing the network can signicantly impact performance.
The difculty in designing such strong
separation functionality into typical networks
on chip (NoCs) is that they have many internal resources that are shared between packets
from different domains, which we would
otherwise wish to keep separate. These
resources include the buffers holding the
packets, the crossbar switches, and the individual ports and channels. Such resource contention introduces interference between
these different domains, which can create a
performance impact on some ows, pose a
security threat by creating an opportunity for
timing channels,1 and generally complicate
the nal verication and certication process
0272-1732/14/$31.00 c 2014 IEEE
micro
IEEE
of the system because all of the ways in which

that interaction might occur must be accounted for. Noninterference means that the
injection of packets from one domain can
never have any effect on the delivery of packets from other domains, even in their timing.
These concerns are similar to, but distinct
from, the problem of providing quality-ofservice guarantees. Although QoS can minimize the performance impact of sharing
between domains by providing a minimum
guaranteed level of service for each domain
(or class),2-5 as Wang and Suh show, QoS
techniques still allow some degree of timing
variations and thus do not truly support noninterference.1 The only way to be certain that
the domains are noninterfering is to statically
schedule them on the network over time.
However, a straightforward application of
time multiplexing leads to signicant increases in latencies because each link in the
network is now time-multiplexed between
many domains.
The core idea behind our approach, for
meshes and tori, is that if a strictly time-multiplexed link is seen as an oscillating behavior,
we can stagger the phases of these oscillations
across the network such that a set of waves
Hassan M.G. Wassel

Google
Ying Gao
University of California,
Santa Barbara
Jason K. Oberg
San Diego
Ted Huffmire
Naval Postgraduate School
Ryan Kastner
San Diego
Frederic T. Chong
Timothy Sherwood
Santa Barbara
.............................................................
57
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
is created. As these waves traverse the network, they provide an opportunity for packets
of the corresponding domain to travel unimpeded along with these waves (thus avoiding
excessive latency), while still requiring no
dynamic scheduling between domains (thus
preventing timing corruption or information
leakage). Channels in the same dimension
and direction appear to propagate different
domains such that after passing through the
pipeline of the router, the channel can forward a packet coming from the same dimension and domain without any additional wait
(unless there is contention from packets of the
same domain). In this way, packets surf the
waves in each dimension. We identify the
many potential challenges of achieving noninterference in a modern NoC router microarchitecture using gate-level analysis, discuss
the details and ramications of our surf scheduling methodology, and demonstrate that our
approach truly does not allow even cycle-level
cross-domain interference. (For information
on previous research, see the Related Work
in Noninterference sidebar.)
SurfNoC scheduling
The straightforward way to support timedivision multiplexing (TDM) is to operate
the whole network in time slices that are divided between application domains. That is, a
packet waits at each hop until the network
begins forwarding packets from its domain.
This approach leads to a zero-load latency L0
that is proportional to the number of application domains D, pipeline depth P, and the
number of hops H, as shown in Equation 1:
T0 HP H D 1
............................................................
58
micro
IEEE
This solution might work efciently for two

to four domains, but in high-assurance applications, as many as tens or hundreds of
domains can be found.6
The most basic routing algorithm in
meshes and tori (k-ary n-cube networks) is
dimension-ordered routing. That is, a packet
walks through a dimension until it cannot
move further without going farther from the
destination, and then transfers to another
dimension. Thus, routing is linear in each
dimension, which provides an opportunity to
reduce wait time between hops. In SurfNoC
scheduling, different routers (in fact, different ports of the same router) can forward
packets from different domains in the same
cycle. In this schedule, a packet waits until it
can be forwarded in one dimension (that is,
its output channel is forwarding packets from
its domain in this cycle) and then does not
experience any wait at any downstream
router in this dimension (assuming there is
no contention from packets from the same
domain). After nishing the rst dimension,
the packet might experience another wait
until it can be forwarded in the next dimension. We call this schedule surf scheduling
because a packet is like a surfer who waits to
ride a wave to some location and then waits
to ride another wave. Equation 2 shows the
maximum zero-load latency and clearly
shows that the overhead is additive, not multiplicative as in the straightforward approach:
T0max HP n 1 2D 1
2
The term n 1 2 comes from the
n 1 transition between dimensions and two
waits during injection and ejection. Note that
this is the maximum wait, not the typical one,
because the schedule might require less wait.
The way to implement these different
waves is by scheduling different directions
in a router independentlyan idea inspired
by dimension-slicing used in dimensionordered routing in meshes and tori. We used
what we call direction-slicing of the pipelines,
such that each direction has its own pipeline.
This pipeline is a virtual one going through
different routers (not in the same router). We
will describe this idea in the case of a 2D
mesh or torus.
In a 2D mesh or torus, each dimension
has two directions (east and west for the xdimension; north and south for the y-dimension). The pipelines of directions of the same
dimension (that is, north, south, east, and
west) run in opposite directions, as Figure 1
shows. In this technique, each port of a router
is scheduled independently of all other ports
in a pipelined way such that the downstream
router in the same direction will forward
packets from the same domain after P cycles,
where P is the routers pipeline depth. These
schedules are imposed on each routers
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
Related Work in Noninterference
Our proposed solution to noninterference in networks on chip

(NoCs) touches on many problems that have been addressed in previous research, such as timing channels in microarchitectures, quality
of service (QoS) in networks on chip, and fault containment in systems
on chip.
Timing channels and noninterference in microarchitecture

Recently there has been renewed interest in the analysis of timing
channel attacks and mitigations through microarchitecture state such
as cache interference1-3 and branch predictors.4,5 One approach to
these problems is a technique that can verify noninterference of hardware/software systems (including high-performance features such as
pipelining and caching) using gate-level information flow tracking.6-8
More recently, researchers proposed a NoC timing-channel protection
scheme for a system with security lattices.9 This priority-based arbitration scheme ensures that information cannot flow from the domain
with a high label to the domain with a low label, but allows for
information flow in the other direction. It can be extended to multiple
security labels as long as they form a lattice. Our proposed technique
enables multiway noninterference in NoCs with low latency (allowing
even packets with noncomparable labels to share the network). However, the techniques working in tandem would still provide the greatest possible benefit.
QoS in networks on chip

Techniques for achieving NoC quality-of-service guarantees have
been proposed based on solutions to analogous problems in macroscale
networks. These approaches attempt to limit the rates of each flow.10-13
However, quality-of-service guarantees are insufficient for timing-channel protection.9 Optimizations that allow flows to go over their designated rate when uncontended and the lack of fault containment are
problematic for high-assurance systems14 because of the high cost of
any unaccounted variation in such systems. Our time-division approach
provides for both fault containment and timing channel elimination.
Noninterference in NoCs
Noninterference in NoCs has been studied in the system-on-chip
domain to provide composability and fault containment as well as predictability of latency for real-time performance guarantees.15,16 Composability means that the system can be analyzed as a set of
independent components, which allows for easier verification of the
output channels to avoid timing channels

based on contention in the allocator (as
detailed in the next section).
Figure 1 shows a 16-node 2D mesh schedule of three domains (colored white, gray,
black). There are two waves, southeast (SE)
and northwest (NW), running in the mesh.
overall system without having to verify all possible interleavings

of events in the system. This has been especially critical in highassurance systems that require a very high level of verification
because of the systems safety ramifications. The thearal NoC is a
time-division multiplexed (TDM) virtual circuit-switching network that
provides guaranteed services for performance-critical applications
with real-time deadlines and a packet-switched best-effort network
for applications with fewer requirements.17 A lighter version that only
provides guaranteed service was proposed to further simplify
routers.18,19 More recently, Stefan and Goossens proposed a modification to thearal that enables multipath routing, both static and
dynamic (based on a true random number generator), to enhance
security by using a nondeterministic path instead of the source-routing used in thearal.20 In addition, the need for real-time worst-case
execution time (WCET) analysis inspired a set of works, such as the
T-Crest project (www.t-crest.org), which tries to build a time-predictable multicore for real-time applications. T-Crest researchers proposed an integer programming technique to minimize the static
schedule length of all-to-all circuit switching connections in a TDM
way.21 Dai Bui and his colleagues proposed an on-time NoC using
real-time packet scheduling, admission control, and runtime path configuration.22 We believe that such an admission-control technique is
orthogonal to SurfNoC. SurfNoC can be augmented by an admission
control mechanism to provide time-predictable packet delivery.
Availability is handled in the Tile64 iMesh networks by separating
(and in fact physically separating) the network accessible by user
applications from the network used by the OS and I/O device traffic.23
Our scheme can protect against denial-of-service (DoS) and bandwidth depletion attacks between domains because of the static time
allocation to different domains.
To the best of our knowledge, our scheme is the first to provide a
packet-switched general-purpose network that can guarantee twoway (or multiway) noninterference and timing-channel protection in a
way that both is provable down to the gate-level implementation and
provides low-latency overhead.
References
1. O. Aciicmez, Yet Another MicroArchitectural Attack:
Exploiting I-Cache, Proc. 2007 ACM Workshop Computer
Security Architecture (CSAW 07), 2007, pp. 11-18.
Each channel propagates packets according

to the schedule white, white, gray, and black
and repeats. Using such a schedule results
in half of the bandwidth being allocated
to the white domain, whereas the black and
gray domains guarantee only a quarter of
the bandwidth for each of them. This
.............................................................
MAY/JUNE 2014
micro
IEEE
59
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
2. Z. Wang and R.B. Lee, New Cache Designs for Thwarting
13. J.W. Lee, M.C. Ng, and K. Asanovic, Globally-Synchronized
Software Cache-Based Side Channel Attacks, Proc. 34th
Frames for Guaranteed Quality-of-Service in On-Chip
Ann. Intl Symp. Computer Architecture (ISCA 07), 2007,

pp. 494-505.
Networks, Proc. 35th Ann. Intl Symp. Computer Architecture (ISCA 08), 2008, pp. 89-100.
3. Z. Wang and R.B. Lee, A Novel Cache Architecture
14. J. Rushby, Partitioning in Avionics Architectures: Require-
with Enhanced Performance and Security, Proc. 41st

Ann. IEEE/ACM Intl Symp. Microarchitecture, 2008,
ments, Mechanisms, and Assurance, NASA Contractor

Report CR-1999-209347, NASA Langley Research Center,
pp. 83-93.
1999.
4. O. Aciicmez, C.K. Koc, and J.-P. Seifert, Predicting Secret

Keys via Branch Prediction, Proc. 7th Cryptographers Track
15. A. Hansson et al., CoMPSoC: A Template for Composable

and Predictable Multi-Processor System on Chips, ACM
at the RSA Conf. Topics in Cryptology (CT-RSA 07), 2007,
Trans. Design Automation of Electronic Systems, vol. 14, no.
pp. 225-242.
5. O. Aciicmez, C.K. Koc, and J.-P. Seifert, On the Power of
1, 2009, pp. 1-24.

16. R. Obermaisser and O. Hoftberger, Fault Containment in a
Simple Branch Prediction Analysis, Proc. 2nd ACM Symp.
Reconfigurable Multi-Processor System-on-a-Chip, Proc.
Information, Computer, and Comm. Security (ASIACCS 07),

2007, pp. 312-320.
IEEE Intl Symp. Industrial Electronics (ISIE 11), 2011,

pp. 1561-1568.
6. M. Tiwari et al., Execution Leases: A Hardware-Supported
17. K. Goossens, J. Dielissen, and A. Radulescu, AEthereal
Mechanism for Enforcing Strong Non-interference, Proc.

42nd Ann. IEEE/ACM Intl Symp. Microarchitecture, 2009,
Network on Chip: Concepts, Architectures, and Implementations, Design & Test of Computers, vol. 22, no. 5,
pp. 493-504.
2005, pp. 414-421.
7. M. Tiwari et al., Crafting a Usable Microkernel, Processor,

and I/O System with Strict and Provable Information Flow
18. A. Hansson, M. Subburaman, and K. Goossens, Aelite: A

Flit-Synchronous Network on Chip with Composable and Pre-
Security, Proc. 38th Ann. Intl Symp. Computer Architecture
dictable Services, Proc. Design, Automation & Test in
(ISCA 11), 2011, pp. 189-200.

8. M. Tiwari et al., Complete Information Flow Tracking from
Europe Conference & Exhibition (DATE 09), 2009, pp. 250255.
the Gates Up, Proc. 14th Intl Conf. Architectural Support
19. R. Stefan et al., A TDM NoC Supporting QoS, Multicast, and
for Programming Languages and Operating Systems

(ASPLOS 09), 2009, pp. 109-120.
Fast Connection Set-Up, Proc. Design, Automation Test in

Europe Conf. & Exhibition (DATE 12), 2012, pp. 1283-1288.
9. Y. Wang and G. Suh, Efficient Timing Channel Protection for
20. R. Stefan and K. Goossens, Enhancing the Security of Time-
On-Chip Networks, Proc. 6th IEEE/ACM Intl Symp. Networks on Chip (NoCS 12), 2012, pp. 142-151.
Division-Multiplexing Networks-on-Chip through the Use of

Multipath Routing, Proc. 4th Intl Workshop Network on
10. B. Grot et al., Kilo-NOC: A Heterogeneous Network-on-Chip

Architecture for Scalability and Service Guarantees, Proc.
38th Ann. Intl Symp. Computer Architecture (ISCA 11),
Chip Architectures (NoCArc 11), 2011, pp. 57-62.

21. M. Schoeberl et al., A Statically Scheduled Time-DivisionMultiplexed Network-on-Chip for Real-Time Systems, Proc.
IEEE/ACM 6th Intl Symp. Networks-on-Chip (NOCS 12),
2011, pp. 401-412.

11. B. Grot, S.W. Keckler, and O. Mutlu, Preemptive Virtual
Clock: A Flexible, Efficient, and Cost-Effective QoS Scheme
2012, pp. 152-160.

22. D. Bui, A. Pinto, and E.A. Lee, On-Time Network On-Chip:
for Networks-on-Chip, Proc. 42nd Ann. IEEE/ACM Intl
Analysis and Architecture, tech. report UCB/EECS-2009-59,
Symp. Microarchitecture, 2009, pp. 268-279.

12. B. Grot, S.W. Keckler, and O. Mutlu, Topology-Aware Qual-
Electrical Engineering and Computer Science Dept., Univ. of

Calif., Berkeley, 2009.
ity-of-Service Support in Highly Integrated Chip Multiproc-
23. D. Wentzlaff et al., On-Chip Interconnection Architecture
essors, Proc. Intl Conf. Computer Architecture (ISCA 10),

2010, pp. 357-375.
of the Tile Processor, IEEE Micro, vol. 27, no. 5, 2007,

pp. 15-31.
illustrates the benet of our schedule in statically allocating nonuniform bandwidth to

domains.
Router microarchitecture
............................................................
60
micro
IEEE
The microarchitecture of the SurfNoC

router has two main goals:
Ensuring a timing-channel-free contention between packetsthat is, contention can occur between packets
from the same domain but not between
packets from different domains.
Scheduling the output channels of
each router in a way that maintains
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
(a)
(b)
(c)
R
(d)
M
q
M
q
M
q
M
q
MQmags
q
(e)
(f)
Figure 1. Surf scheduling in a 16-node 2D mesh with three application domains (denoted by white, gray, and black) assuming
single-cycle routers for illustration purpose. The schedule runs as white, white, gray, and black and repeats, giving the white
domain half the bandwidth. A packet (the white box under the node S) belongs to the white domain and is sent from the node
marked S to the node marked R. The figure contains six consecutive cycles. At T 1, the packet is forwarded on the S port in
the y-dimension (which is scheduled to forward white packets). It keeps moving in the y-dimension until T 3, when it needs
to move in the x-dimension on the W port. The packet waits two cycles (T 4 and T 5) until it is the white domains turn on
the W port, and finally it is forwarded to its destination on T 6. Another wait may happen again in the destination router (R) to
forward the packet on the ejection port waiting for the white domains turn.
the surf schedule across the whole

network.
To achieve these goals, we used a static partitioning of virtual channels (VCs) and carefully designed the VC and switch allocators to
be timing-channel free. In essence, the VC allocator is divided into several allocators that allocate VCs belonging to the same domain
because VCs are statically divided between
domains. The resources (switch I/O ports) in
the switch allocator, on the other hand, can be
requested from multiple domains. Moreover,
switch inputs are shared between VCs belonging to different domains. To solve this problem,
we use input speedup of D to remove this

contention. This observation was discovered
through gate-level information ow analysis of
the allocator microarchitecture. The scheduling
of output channels is done through masking
requests from packets to the switch allocator
until its turn to use the output channel arrives
in the wave pipeline. Traversing the switch
does not require any router modication
because all resources have been arbitrated for.
Pipelining and separation

We have so far discussed separation with
respect to each pipeline stage separately, but
.............................................................
MAY/JUNE 2014
micro
IEEE
61
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
the question remains whether pipelining and

pipeline stalls can cause interference. We will
discuss each pipeline stage, with the basic
idea being to ensure that stalls do not induce
interference between separate domains.
Buffer write and route computation. BW/RC
is the rst stage of the pipeline, and because
we are assuming a credit-based ow control,
its do not enter the router unless there is a
guaranteed space in the buffer for them. Spatial separation is ensured because VC allocation is done in the upstream router. Route
computation can be done in parallel for all
its at the front of all VCs (waiting for RC).
No interference can be caused in this stage.
Virtual channel allocation. At the VA stage,
all its send requests to the VC allocator.
Using our design, interference can happen
between VCs from the same domain but not
between channels from distinct domains.
Stalled its resulting from a lack of free VCs
(in the downstream router) prevent only its
from the same domain from making progress. This can be ensured by recording the
state in the pipeline for each VCthat is,
stalls due to VC allocation must be per VC
(not per input port).
Switch allocation. SA can fail owing to contending its for switch ports (limited to VCs
from the same domain), which cause stalls
in the pipeline. We avoid stalling the whole
port (which leads to interference between
domains) by having a separate state in the
pipeline stage for each VC. SA can also be
stalled because of lack of buffering in the
downstream routerthat is, waiting for a
credit. The effect of this stall is limited to a
single VC, and can be handled the same way
we addressed a stall resulting from a failed
SW allocation.
The key idea here is that stalls can affect
its in the stalled stage and all previous stages
only from the same VC. Thus, we can guarantee separation because we statically assign
VCs to domains.
Noninterference verification
............................................................
62
micro
IEEE
To prove noninterference between domains of our arbitration scheme, we
used gate-level information-ow tracking

(GLIFT) logic.7,8 GLIFT logic captures all
digital information, including implicit and
timing-channel ows, because all information ows represent themselves in decisionmaking circuit constructs, such as multiplexers and arbiters. For example, an arbitration operation leaks information if the
control bits of the multiplexers depend on
one of the two domains, but it will not leak
information (or cause interference) if arbitration is based on a static schedule. GLIFT
tracking logic can accurately capture this fact
because it is precise (that is, not conservative
in the primitive shadow gates but conservative in the compositional shadow circuit) and
sound (that is, it will denitely capture illegal
information ows). For example, a shadowAND gate propagates a label of high only
if the output of the AND gate depends on
the high input (that is, if one input of a
two-input AND gate is low zero, the output is guaranteed to be zero and thus does
not depend on the high input). GLIFT
automatically generates conservative shadow
logic that can be used to prove noninterference between domains for a given circuit.
Shadow logic is a tracking logic used as a verication technique (and is not intended to be
part of the nal system, thus does not cost
any area or power). Using gate-level analysis,
we discovered interference in the switch allocator during initial designs of the system.
Moreover, contention between domains on
the crossbar switch input ports was discovered using the same analysis technique
(hence, our input-speedup idea). In essence,
we used GLIFT analysis to design the architecture in addition to verifying the nal
design.
We integrated the scheduler circuit,
enforcing the surf schedule, into a Verilog
implementation of a switch allocator.9 We
used a two-domain allocator that allocates
requests of different VCs to output ports. We
modied the allocator to have a request per
VC rather than per input port (as in the original design9). We synthesized the allocator
using the Synopsis design compiler, then generated its shadow logic and veried the separation property using simulation of the
resulting circuit. We assigned a low label
for VC 0 requests and a high label for
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
Table 1. Simulation parameters.

Parameter
Baseline-small
Baseline-fast
Surf and TDMA
Virtual channels (VCs)
12
32
See Table 2
Buffers per VC
Input speedup
4
1
4
32
See Table 2
See Table 2
Flits per packet
Router delay
SW and VC allocators
4 cycles
Separable (input-first)
Routing
Dimension-ordered routing
Table 2. Different configurations of partitioned schemes.

Number of domains
Parameter
No. of VCs per port
No. of flits per VC
Input speedup
16
32
16
8
16
8
16
8
32
4
32
4
32
4
16
32
VC 1. We tested inputs for VCs sharing the

same input port requesting different and
same output ports. In all cases, grant signals
had the same label of their respective VC,
which proves that grants are independent of
requests from the other domain. We also
reversed labels of VC 0 (high) and VC 1
(low) to verify that separation holds for the
other direction of information ow (domain
0 to domain 1). This proves that the crossbar
arbitration, and consequently the sharing of
the physical channel, are timing-channel free,
which (in addition to static VC allocation)
ensures network noninterference. Freedom of
two-way information ow, or complete noninterference, was veried.
Evaluation
We evaluate the performance of our SurfNoC scheme and compare the area and power
overhead to a mesh network without noninterference support. A more detailed evaluation can be found in the original paper.10
Experimental setup
We implemented a model of the SurfNoC
router in BookSim 2.0,11 a cycle-level interconnection network simulator. The simulator
is warmed up until steady state is reached and
statistics are reset, then a sample of the packets is measured from the time it enters the
source queue until it is received. For latency
measurements, the simulation runs until all
packets under measurement leave the network. Table 1 lists the simulation parameters
used for different schemes. We evaluated four
schemes, two that do not provide separation
guarantees, and two that support strong separation. The nonseparation baselines are an
input-queued router with minimal resources,
which achieves almost 40 percent saturation
throughput (baseline-small), and a similar
router that has many more resources (buffers
and input-speedup in the crossbar switch),
which we call baseline-fast. We used two
baselines because the separation-supporting
router includes more resources and would
achieve more throughput than a baseline
with minimal area, which will hide the lost
throughput due to the static scheduling. The
noninterference-supporting schemes are a
straightforward time-division multiple access
(TDMA), where the whole network forwards
packets from the same domain, and an
input-queued router, which enforces the surf
schedule (Surf). Table 2 shows the different
congurations used for different numbers of
domains for Surf and TDMA.
.............................................................
MAY/JUNE 2014
micro
IEEE
63
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Zero-load latency (cycles)
200
180
Baseline-small,64
160
TDMA,64
Baseline-fast,64
Surf,64
140
120
100
80
60
40
20
0
2
12
17
22
Number of domains
27
32
27
32
(a)
400
Baseline-small,256
350
TDMA,256
Baseline-fast,256
Surf,256
300
250
200
150
100
50
0
2
12
17
22
Number of domains
(b)
Figure 2. Zero-load latency for different network sizes and a different

number of security domains: 64 nodes (a) and 256 nodes (b). The two
baselines overlap because zero-load latency does not depend on buffers
and crossbar input speedup.
Impact on latency
............................................................
64
micro
IEEE
We rst examine the impact of our noninterference support on latency with a different
number of domains, and a different number
of nodes under the uniform random trafc
pattern. To understand the effect of TDM of
channels, we measure zero-load latency
(latency at offered load of 0.1 percent of
capacity for only one domain) and plot it for
different numbers of domains in Figure 2. In
this gure, we plot latency in cycles (y-axis)
versus number of domains on the x-axis for
network sizes of 64 nodes (Figure 2a) and
256 nodes (Figure 2b). It is clear that the

latency overhead of Surf scales much better
than TDMA for the same network size (for
example, the overhead is reduced from 66
(19.1) to 19 (4.6) cycles by 71.3 percent
(75.8 percent) for network sizes of 64 nodes
with 16 (4) domains. The savings is even
greater (up to 84.7 percent) for a 256-node
network.
We can see that there is one exception to
this reduction in latency. It is a subtle case
that occurs only for ve domains, because the
packet leaves the router after one cycle of
switch traversal (ST), spends one cycle for
link traversal (LT), and after two cycles of
buffer write (BW) and VA in the upstream
router (a total of four cycles during which the
upstream router propagates packets from
other domains), it becomes ready for SA
without any wait using TDMA, leading to
the same latency overhead of surf scheduling.
One would also notice that the benets are
higher for larger networks because of the
increased average number of hops.
To clearly understand how the overhead
scales with network size or average number
of hops, we replotted zero-load latency of
2D mesh networks of sizes varying from 16
to 256 nodes with 16 domains under the
uniform random trafc pattern in Figure 3.
The latency of both baselines increases with
network size due to a higher average number
of hops. The overhead of surf scheduling is
almost independent of network size (average
number of hops), leading to a line parallel
to the baseline with a constant overhead of
19 cycles (except for 16 nodes) because the
packet wait time depends only on the number of dimensions and domains. On the
other hand, the larger the network, the
higher the overhead for TDMA scheduling
because a packet must wait for its turn at
each hop in the path to its destination. This
clearly shows that our scheme is scalable
with network size and proves our intuition
of latency overhead independent of the
number of hops. We can conclude that, in
general, the savings of surf scheduling are
more scalable with larger networks and a
higher number of domains.
Zero-load latency is just one latency metric. Thus, we now study latency as a function
of network offered load. Figure 4 shows
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
average latency measured after convergence

as a function of offered load for a 2D mesh
network of 64 nodes under uniform random
trafc patterns. We vary aggregate offered
load on the x-axisthat is, if we have D
domains, the value of the x-axis is the sum of
the offered load of all D domains. We used
two domains in this experiment. We can see
that surf scheduling maintains its latency savings at all offered load values lower that the
saturation point of the network.
200
Baseline-small
180
TDMA
160
Baseline-fast
Surf
140
120
100
80
60
40
20
Throughput
Area and power overhead

The main source of power and area overhead is the increased size of the crossbar with
an input speedup of D. This increases both
the crossbar and switch allocator area and
power consumption linearly with D. Having
an input speedup of D might be prohibitive
in cases of large D. However, there is a performance/resources trade-off between wait
50
100
150
200
Figure 3. Zero-load latency versus different network size with 16 domains.

The two baselines overlap because zero-load latency does not depend on
buffers and crossbar input speedup.
100
Baseline-fast
Baseline-small
90
TDMA
Surf
80
70
60
50
40
30
0
0.1
0.2
0.3
0.4
0.5
0.6
Network offered load (flits/cycles)
Figure 4. Average latency as a function of aggregate offered load of all

domains for a 2D mesh network of 64 nodes. Latency is stable below the
network saturation point.
time at the switch allocator and input

speedup of the crossbar switch. Keeping our
surf schedule in place while arbitrating the
crossbar input port between VCs from different domains in a static deterministic roundrobin manner (regardless of requests) is the
most straightforward approach. For example,
in the case of 32 domains, we can use an
input speedup of 4 instead of 32, and a it
will wait up to seven cycles before entering
the crossbar. In general, if input speedup is S
.............................................................
MAY/JUNE 2014
micro
250
Number of nodes
Average latency (cycles)
In Figure 4, although we can see that saturation throughput is reduced by about 11.7
percent, aggregate throughput loss is limited
to 4.9 percent for two domains. Noninterference congurations have a higher saturation
throughput than the small baseline because
they use more resources, and lower than the
fast baseline that includes the same resources
because of unused time slots due to schedule
enforcement.
To verify the benets of assigning bandwidth nonuniformly, we performed an
experiment on a 2D mesh network with 64
nodes and three domains. Bandwidth (VCs
and time slots in the schedule) is assigned as
follows: a quarter of the bandwidth is
assigned to domain 0 and domain 1, each;
and half of the bandwidth is assigned to
domain 2. This nonuniform allocation is
done by devising a schedule with four slots
and assigning domain 3 time slots to domain
2. Saturation throughput, as expected, is 0.09
for both domain 0 and 1, and 0.21 for
domain 2. Latency at a 5 percent injection
rate is 36 (53) cycles for domain 2 and 39
(53) cycles for domains 0 and 1 using surf
scheduling (straightforward TDMA). This
shows that our scheme can have both latency
and throughput benets by designing a nonuniform surf schedule.
IEEE
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
65
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
and D is the number of domains (where

1 S D), its can wait up to an extra
D=S 1 cycles to enter the crossbar and
would wait longer than D 1 in turns.
This is one way to avoid excessively large
crossbars (and the slower clock rates they
incur) as well. We use Equation 3 to nd the
maximum zero-load latency of such a
scheme:

D
T0max HP n 1 2 D 1
S

D
1
3
H
S
This essentially creates a continuum of design
between a strict TDMA (in fact, slightly
worse for S 1) and a full surf schedule
(S D).
............................................................
66
micro
IEEE
rogrammers are increasingly asked to

manage a complex collection of computing elements, including a variety of cores,
accelerators, and special-purpose functions.
Although these many-core architectures can
be a boon for common case performance and
power efciency, when an application demands a high degree of reliability or security,
the advantages becomes a little less clear. On
one hand, the ability to spatially separate
computations means that critical operations
can be physically isolated from malicious or
untrustworthy components. There are many
advantages to providing physical separation,
which have been explored in the literature.
On the other hand, real systems are likely to
use different subsets of cores and accelerators
based on an applications needs and thus will
require a shared communication network.
When a general-purpose interconnect is
used, analyzing all the ways in which an
attacker might inuence the system becomes
far more complicated. The problem is hard
enough if we restrict ourselves to considering
only average case performance or packet
ordering, but it becomes even more difcult
if we attempt to prevent even cycle-level
variations.
High-assurance systems are often divided
into a set of domains, which are kept separate. These domains should have no effect on
one another. For example, the Mars Curiosity
rover software runs on a RAD750 processor,
a single-core radiation-hardened version of

the Power architecture with a special-purpose
separation kernel. The kernel partitions the
tasks such as guidance, navigation, and the
various science packages from one another to
help prevent cascading failures. Future space
missions are looking to use multicore systems
that add another layer of communication,12,13 but there are serious concerns about
the introduction of opportunities for interference between system components.14,15 Such
interference can be catastrophic or even fatal.
The SurfNoC scheduling technique for
meshes and tori reduces latency overhead
while maintaining noninterference. Perhaps
more importantly, it shows that security properties of on-chip networks can be reasoned
about rigorously and that doing so has important ramications on the network microarchitecture. Of course our design is not the endall design point: there are certainly inefciencies remaining in our design and how this
work can extend to more general and modern
NoC techniques remains to be seen. However, by coupling our design efforts with gatelevel information ow analysis, we show that
verifying noninterference properties at the
level of gates and wires for a full router microarchitecture is indeed possible.
We do not foresee the trend toward
larger and more diverse NoCs on systems
reversing anytime soon. Thus, we believe
the techniques weve described will continue to grow in importance. SurfNoC
might be extended to support more general
routing functions, bandwidth sharing
through time donation between domains
in a security lattice, a recongurable router
microarchitecture design, and predictable
latency through throttling of source nodes.
Similar design and analysis techniques
could be extended beyond regular packetswitched networks to include virtual circuit-switching and irregular SoC-style
interconnects. Although it is not clear
where the limits are between security and
performance in this space, if we are to avoid
the continuing loop of patch-and-pray
that inevitably falls out of the ad hoc
approach we traditionally have taken to
computer systems security, we will increasingly have to involve the architecture, hardware design, and verication teams in
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
providing a formally sound foundation for

MICRO
integration.
Acknowledgments
This work was funded in part by grants
CNS-1239567, CNS-1162187, and CCF117165. Jason K. Oberg is funded by a US
National Science Foundation graduate
research fellowship. The views and conclusions contained herein are those of the
authors and should not be interpreted as
necessarily representing the official policies
or endorsements, either expressed or implied, of the sponsoring agencies.
....................................................................
References
Strict
and
Provable
Information
Flow
Security, Proc. 38th Ann. Intl Symp.

Computer Architecture (ISCA 11), 2011,
pp. 189-200.
8. M. Tiwari et al., Complete Information
Flow Tracking from the Gates Up, Proc.
14th Intl Conf. Architectural Support for
Programming Languages and Operating
Systems (ASPLOS 09), 2009, pp. 109-120.
9. M. Kinsy and M. Pellauer, Heracles: Fully
Synthesizable Parameterized MIPS-based
Multicore System, tech. report MIT-CSAILTR-2010-058, MIT Computer Science and
Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 2010.
10. H.M.G. Wassel et al., SurfNoC: A Low
Latency and Provably Non-Interfering Approach to Secure Networks-on-Chip, Proc.
1. Y. Wang and G. Suh, Efficient Timing

Channel Protection for On-Chip Networks,
40th Ann. Intl Symp. Computer Architec-
Proc. 6th IEEE/ACM Intl Symp. Networks
11. W. Dally and B. Towles, Principles and Prac-
on Chip (NoCS 12), 2012, pp. 142-151.

2. B. Grot et al., Kilo-NOC: A Heterogeneous
Network-on-Chip Architecture for Scalability
and Service Guarantees, Proc. 38th Ann.
Intl Symp. Computer Architecture (ISCA
11), 2011, pp. 401-412.
3. B. Grot, S.W. Keckler, and O. Mutlu,
Preemptive Virtual Clock: A Flexible, Efficient, and Cost-Effective QoS Scheme for
ture (ISCA 13), 2013, pp. 583-594.

tices of Interconnection Networks, Morgan
Kaufmann, 2003.
12. M. Malone, talk on OPERA RHBD Multicore, Military and Aerospace Programmable
Logic Devices (MAPLD) Workshop, Greenbelt, MD, Aug. 2009; https://nepp.nasa.
__________
gov/mapld_2009/talks/083109_Monday/03_
__________________________
Malone_Michael_mapld09_pres_1.pdf.
_______________________
Networks-on-Chip, Proc. 42nd Ann. IEEE/
13. J.-L. Terraillon, Multicore ProcessorsThe

Next Generation Computer for ESA Space
ACM Intl Symp. Microarchitecture, 2009,

pp. 268-279.
Missions, keynote address, 17th Intl
4. B. Grot, S.W. Keckler, and O. Mutlu,

Topology-Aware Quality-of-Service Support in Highly Integrated Chip Multiprocessors,
Proc.
Intl
Conf.
Computer
Architecture (ISCA 10), 2010, pp. 357-375.
Conf. Reliable Software Technologies, 14

June 2012; www.cister.isep.ipp.pt/ae2012/
___________________
keynote#MP.
________
14. E. Ong, O. Brown, and M.J. Losinski,
System F6: Progress to Date, Proc. Small
Satellite Conf.: Enhancing Global Awareness
5. J.W. Lee, M.C. Ng, and K. Asanovic,
through Small Satellites, 2012, p. 7; http://

___
Globally-Synchronized Frames for Guaran-
digitalcommons.usu.edu/smallsat/2012/
__________________________
teed Quality-of-Service in On-Chip Networks, Proc. 35th Ann. Intl Symp.
15. W.R. Otte et al., F6com: A Component
Computer Architecture (ISCA 08), 2008,

pp. 89-100.
6. J. Rushby, Partitioning for Avionics Architectures: Requirements, Mechanisms, and
Assurance, NASA contractor report CR1999-209347, NASA Langley Research Center, 1999.
7. M. Tiwari et al., Crafting a Usable Microkernel, Processor, and I/O System with
all2012/10/.
______
Model for Resource-Constrained and Dynamic Space-Based Computing Environments, Proc. 16th IEEE Intl Symp. Object/
Component/Service-Oriented
Real-Time
Distributed Computing, 2013; www.isis.
______
vanderbilt.edu/node/4552.
_______________
Hassan M.G. Wassel is a software engineer

at Google, working in the Platforms group.
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
67
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
His research includes computer architecture

and its interaction with software and novel
hardware technologies in order to build
energy-efficient, high-performance, and reliable systems. Wassel has a PhD in computer
Santa Barbara, where he performed the work
for this article. He is a member of IEEE and
the ACM.
Ying Gao is a PhD student in computer
engineering at the University of California,
Santa Barbara. Her research focuses on
security computer architectures and improving program security by using intelligent
hardware. Gao has a BS in optoelectronics
from Tianjin University, China.
Jason K. Oberg is a PhD candidate in the
Computer Science and Engineering Department at the University of California, San
Diego. His research interests include testing
and verification methods for secure hardware
design and methodologies for identifying timing-based side channels in hardware. Oberg
has an MS in computer engineering from the
University of California, San Diego.
Ted Huffmire is an assistant professor of

computer science at the Naval Postgraduate
School. His research focuses on the intersection of computer architecture and computer
security. Huffmire has a PhD in computer
Santa Barbara.
Ryan Kastner is a professor in the Department of Computer Science and Engineering
at the University of California, San Diego.
His research focuses on embedded system
design, particularly the use of reconfigurable
computing devices for digital signal processing as well as hardware security. Kastner has
a PhD in computer science from the University of California, Los Angeles.
Frederic T. Chong is the director of the
Greenscale Center for Energy-Efficient Computing, director of the Computer Engineering
Program, and a professor of computer science
at the University of California, Santa Barbara.
His research interests include emerging technologies, multicore and embedded architectures, computer security, and sustainable
computing. Chong has a PhD in electrical
engineering and computer science from the
Massachusetts Institute of Technology.
Timothy Sherwood is a professor in the
Department of Computer Science at the University of California, Santa Barbara. His
research focuses on the development of novel
computer architectures for security, introspection, and embedded applications. Sherwood
has a PhD in computer science from the University of California, San Diego. He is a member of IEEE and the ACM.
__________________

article to Hassan Wassel, 1600 Amphitheater Pkwy, Mountain View, CA 94043;
hwassel@gmail.com.
_____________
____________________
............................................................
68
micro
IEEE
____________
_______
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..................................................................................................................................................................................................................
CACHE COHERENCE FOR GPU

ARCHITECTURES
..................................................................................................................................................................................................................
GPUS HAVE BECOME AN ATTRACTIVE TARGET FOR ACCELERATING PARALLEL APPLICATIONS

AND DELIVERING SIGNIFICANT SPEEDUPS AND ENERGY-EFFICIENCY GAINS OVER
MULTICORE CPUS. PROGRAMMING GPUS, HOWEVER, REMAINS CHALLENGING BECAUSE
EXISTING GPUS LACK THE WELL-DEFINED MEMORY MODEL REQUIRED TO SUPPORT HIGHLEVEL LANGUAGES SUCH AS C AND JAVA. TEMPORAL COHERENCE, A SIMPLE AND
INTUITIVE TIMER-BASED COHERENCE FRAMEWORK OPTIMIZED FOR GPU, TACKLES THIS
CHALLENGE.
......
Recent research has demonstrated GPU versatility in accelerating even

highly irregular parallel algorithms containing nonuniform communication patterns.
As the true potential of these massively multithreaded architectures is realized, the question becomes: Can GPU programming be
made as accessible to the average software
developer as CPU programming is now?
CPU programmers rely heavily on synchronization primitives provided by high-level
languages, such as C and Java, to make
it easier to manage parallel tasks. Current
GPU programming models do not provide
such semantics and force programmers to
be aware of and workaround the memory
systems intricacies. Our work proposes to
ease GPU programming by introducing a
well-dened memory consistency model,
and hence support for C and Java, to
GPUs. Specically, we address the challenge
of an efcient GPU cache-coherence mechanism that can support such a consistency
model.
Multicore processors regularly employ
hardware cache coherence to enforce strict
0272-1732/14/$31.00 c 2014 IEEE
micro
IEEE
memory consistency models. In this article,

we explore the challenges of introducing
these CPU-optimized protocols to GPU
hardware. We show how the key differences
between CPU and GPU architectures
namely, the latters massive multithreading
and throughput-oriented memory system
present challenges that are unaddressable by
traditional coherence. These challenges stem
from the introduction of coherence messages
on a GPU. We propose a unique cache-coherence framework, called Temporal Coherence
(TC), which addresses all of these challenges
by completely eliminating coherence messages. In doing so, we improve application
performance and reduce power consumption
and hardware complexity, all the while easing
GPU programming.
The observations resulting from this work
have important implications for CPU-GPU
heterogeneous computing. The Heterogeneous System Architecture (HSA) Foundation
has identied that cache coherence will enable mainstream languages such as C and
Java to efciently distribute ne-grained tasks
between the on-chip CPU and GPU.1 CPUs
Inderpreet Singh
Qualcomm
Arrvindh Shriraman
Simon Fraser University
Wilson W.L. Fung
University of British Columbia
Mike OConnor
AMD Research
Tor M. Aamodt
University of British Columbia
.............................................................
69
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Reconv. stacks
Wavefront schedulers
GPU
core
GPU
core
GPU
core
Interconnection network
Memory
partition
Off-chip
GDDR
Register file
ALU
Memory
Local
mem
store
Coalesc. unit
Memory
partition
Off-chip
GDDR
MSHRs
L1 data
cache
L2 cache
bank
Atomic
op.
unit
Memory controller
Figure 1. High-level overview of the baseline noncoherent GPU architecture.

The application executes on a CPU and launches compute kernels onto a
GPU; each kernel then launches threads onto the multithreaded GPU cores.
and GPUs impose conicting requirements

and challenges on hardware coherence that
might not be addressable by a one-size-ts-all
solution. For example, we nd that while
CPU coherence optimizes to capture store
locality, such optimizations on GPUs hinder
rather than improve performance. Effectively
managing these trade-offs will be the key to
realizing coherency across heterogeneous
processors.
GPU architecture
............................................................
70
micro
IEEE
Figure 1 shows the organization of our

baseline noncoherent GPU architecture. An
OpenCL or CUDA application begins execution on a CPU and launches compute kernels onto a GPU. Each kernel launches a
hierarchy of threadsworkgroups of wavefronts of scalar threadsonto the set of heavily multithreaded GPU cores. The scalar
threads are managed as a single-instruction,
multiple-data (SIMD) execution group consisting of 32 or 64 threads, or a wavefront (or,
in Nvidia terminology, a warp).
A GPU kernel commonly accesses the
local, thread-private, and global memory
spaces. Threads within a workgroup can
communicate via an on-chip, per-core, software-managed local memory. Both threadprivate and global memory are stored in
off-chip graphics double-data-rate (GDDR)
DRAM and cached in the multilevel cache
hierarchy, but only global memory requires

coherence.
The GPU cache hierarchy consists of percore private L1 data caches and a shared, writeback L2 cache. Similar to existing GPUs, our
baseline GPUs L1 cashes are not coherent.
The off-chip DRAM memory and the L2
cache are distributed among several memory
partitions that connect to the GPU cores
through an interconnection network. Pointto-point ordering in the interconnection network, L2 cache controllers, and off-chip
DRAM channels ensures that multiple outstanding writes from the same wavefront to the
same address complete in program order. All
cache controllers service one memory request
per cycle in order. L2 misses are handled by
allocating a Miss Status Holding Register
(MSHR) entry and removing the request from
the request queue; this prevents stalling.
GPU coherence
In recent GPUssuch as Nvidias Fermi
series and the Advanced Micro Devices
(AMD) Southern Islands seriesthe noncoherent private L1 caches can exploit memory
access locality in GPU applications without
requiring the programmer to explicitly manage data transfer to/from software-managed
scratchpad memory. However, these noncoherent caches might contain stale versions of
global data; this can introduce errors in applications that expect threads to communicate
updated data via a coherent memory system.
Disabling these L1 caches provides trivial
coherence for these applications, but at a cost
of performance and energy efciency. Figure
2a shows the signicant performance drawback of this naive solution. The performance
of our trivial cache-disabled GPU (NO-L1) is
88 percent worse than a GPU with an idealized coherence protocol (IDEAL-COH). In
other words, by enabling private caching for
this GPU applications set, hardware coherence
lets us improve performance by 88 percent.
Implementation challenges
Because coherent L1 caches give GPUs a
clear performance benet, the next question
is: Can existing CPU cache-coherence protocols be equally effective on GPUs? We now
discuss the three signicant overheads that
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
conventional coherence mechanisms introduce on GPU architectures.
Coherence traffic
Storage requirements
The massively parallel GPU memory system can also introduce signicant storage
overhead for buffering inight coherence requests. Coherence implementations leverage
NO-L1
NO-COH
IDEAL-COH
GPU-VI
2.0
Performance
As Figure 2b shows, traditional CPU protocols implemented on a GPU architecture

introduce a new overhead to a set of GPU
applications that do not require coherent
memory. For example, a commonly used
directory-based write-back MESI coherence
protocol introduces a 127 percent interconnect trafc overhead to our baseline noncoherent GPU. Write-back protocols such as
MESI trade-off interconnect trafc to capture store locality. Although store locality is
common in CPU applications, its rare in
GPU applications.
We also implemented a directory-based
write-through protocol, called GPU-VI
(valid/invalid), which we specically optimized for GPU architectures. GPU-VI introduces a 31 percent trafc overhead to our
noncoherent GPU. Both MESI and GPU-VI
are inclusive protocols and append the directory information on the L2 cache lines. The
trafc overhead results from recall messages
sent to invalidate any cached data that the L2
cache, acting as the directory, can no longer
track due to its limited capacity. The GPU
processes signicantly more data in parallel
than a CPU, leading to high cache contention and hence frequent L2 cache evictions
that generate recall messages.
Another alternative is a noninclusive protocol, in which the directory can be sized
independently of the last-level cache. In our
more detailed paper about this work,2 we
show that a noninclusive protocol not only
increases complexity, but even very large and
highly associative directories cant solve the
trafc overhead on GPUs.
We believe these protocols are representative of CPU coherence in general and highlight the incompatibility between CPU and
GPU coherence. Our proposed GPU coherence protocol doesnt use coherence messages
and thus eliminates this overhead entirely.
1.5
1.5
1.0
0.5
micro
MESI
2.27
1.0
0.5
0.0
Applications
that require
coherence
(a)
Applications that
don't require
coherence
(b)
Figure 2. GPU coherence opportunity and challenges. Although hardware

coherence can provide a significant performance gain, existing coherence
protocols are ill-suited for GPUs. An idealized coherence protocol can
improve performance by 88 percent over a GPU without hardware
coherence that must disable L1 caches (a). Traditional coherence protocols,
such as write-back MESI and write-through valid/invalid (VI), introduce an
interconnect traffic overheadand hence a power overheadto a baseline
GPU that lacks coherence (b).
MSHRs to buffer these requests, which cant

be processed immediately at the L2 cache.
MSHR tables are sized according to the
worst-case number of inight accesses to
avoid potential protocol deadlocks. With
only tens of threads, CPU coherence implementations can easily dedicate enough onchip storage resources to buffer the worst-case
number of coherence requests.3
GPUs, however, process tens of thousands
of memory requests in parallel. Relying on traditional CPU coherence designs means dedicating hardware storage as large as 28 percent
of the GPU last-level caches size for buffering
the many potential inight accesses. By not
utilizing coherence messages, our proposed
TC protocol doesnt need worst-case buffering
and eliminates the storage overhead.
Protocol complexity
Coherence protocols are highly complex,
requiring numerous states and transitions.4
Many of these states are transient states added
to the protocols to guard against a potential
protocol race between accesses to the same
cache block. For example, the MESI protocol
must handle two cores requesting exclusive
state to the same cache block. GPUs
.............................................................
MAY/JUNE 2014
IEEE
M
q
M
q
M
q
M
q
MQmags
q
Interconnect
traffic
micro
IEEE
71
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Table 1. Number of protocol states.

Cache
State type
L1 cache
L2 cache
Noncoherent
GPU-VI
GPU-VIni
MESI
TC-Weak
Stable
Transient cache
Transient coherent
2
0
2
1
2
1
2
4
2
1
Total L1 states
10
Stable
Transient cache
2
2
3
2
5
2
4
3
4
2
Transient coherent
Total L2 states
15
16
implement far simpler noncoherent memory

systems that are highly optimized for massively parallel workloads. Adding hardware
coherence to GPUs presents a massive design,
implementation, and verication challenge
for GPU manufacturers. A GPU coherence
protocol that minimizes the added complexity is therefore a key challenge.
Table 1 quanties the complexity by enumerating the coherence states and shows that
noncoherent implementation is the least
complex. Our proposed TC protocol introduces the fewest coherence states on top of
the noncoherent baseline. Furthermore, traditional coherence protocols require additional virtual networks or deadlock detection
mechanisms to ensure forward progress.4
Our protocol doesnt require any additional
virtual networks and is deadlock-free.
Temporal coherence
............................................................
72
micro
IEEE
In essence, the task of an invalidationbased coherence protocol is to communicate,

among a set of nodes, the transitions between
a memory locations single-writer epochs and
multiple-reader epochs.4 A memory location
in one of its single-writer epochs can only be
exclusively written by a single core. The single-write epoch ends, transitioning to a multiple-reader epoch, when another core has
acquired a shared copy of the memory location. The location returns to a single-writer
epoch when one of the cores has acquired
exclusive access to the location by invaliding
all stale copies cached in other cores.
Here, we present TC, our timer-based
cache-coherence framework designed to
address the needs of high-throughput GPU
architectures. Time-based coherence uses the
insight that single-chip systems can implement synchronized counters to enable lowcost transfer of coherence information.
Specically, if the lifetime of a memory
addresss current epoch can be predicted
and shared among all readers when the
location is read, then these counters allow
the readers to self-invalidate synchronously,
eliminating the need for end-of-epoch
invalidation messages. Compared to traditional CPU cache coherence, TC requires
fewer modications to GPU hardware and
enables greater memory-level parallelism.
Figure 3 compares the invalidation handling of the GPU-VI directory protocol and
TC. The gure shows a read by processors
C1 and C2, followed by a store from C1, all
to the same memory location. Figure 3a
shows the events sequence that occurs for the
write-through GPU-VI directory protocol.
Processor C1 issues a load request to the
directory (1) and receives data. Processor C2
issues a load request (2) and receives the data
as well. C1 then issues a store request (3).
The directory, which stores an exact list of
sharers, sees that C2 must be invalidated
before the write can complete and sends an
invalidation request to C2 (4). C2 receives
the invalidation request, invalidates the block
in its private cache, and sends an acknowledgment back (5). The directory receives the
invalidation acknowledgment from C2 (6),
completes C1s store request, and sends C1
an acknowledgment (7).
Figure 3b shows how TC handles the invalidation for this example. When C1 issues a
load request to the L2 cache, it predicts that
the read-only epoch for this address will end
at time T 1510 . The L2 receives C1s
load request and epoch lifetime prediction,
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
VI coherence
C1
Dir
Temporal coherence
C2
2
load
R
D
time
read-only epoch
C2
1'
3
store
W
4
Inv
6
7
L2
C1
load,
predict R, T=
15
3'
2'
5 T=15
load
5
1
=
predict
D, T
0 T=20
2
=
10
R, T
4'
D, T
=20
5'
15
selfinvalidate
6'
20
self7'
invalidate
store
W
25
8'
read-only epoch
load
1
(a)
M
q
M
q
M
q
M
q
MQmags
q
Ack
5
k
Ac
Inv
Ac
(b)
Figure 3. Coherence invalidation mechanisms. The events sequence for the write-through
GPU-VI directory protocol (a). Temporal Coherence (TC) invalidation handling for the same
example (b). (R read, D data, W write, and Inv invalidation.)
records it, and replies with the data and

timestamp of T 1520 . The timestamp
indicates to C1 that it must self-invalidate
this address in its private cache by T 15.
When C2 issues a load request, it predicts
the epoch to end at time T 2030 . The
L2 receives C2s request, checks the timestamp stored for this address, extends it to
T 20 to accommodate C2s request, and
replies with the data and a timestamp of
T 2040 . At time T 1550 , C1s private cache self-invalidates the local copy of
the address. At time T 2060 , C2 selfinvalidates its local copy. When C1 issues a
store request to the L2 (70 ), the L2 nds the
global timestamp T 20 to be less than
the current time T 25. The expired
timestamp indicates that none of the L1
caches contain a valid copy of this address.
The L2 completes the write instantly and
sends an acknowledgment to C1 (80 ).
Unlike GPU-VI, TC doesnt use invalidation messages. Globally synchronized
counters allow the L2 to make coherence
decisions locally and without indirection.
This example shows how a TC framework

can achieve our desired goals for GPU coherence; all coherence trafc has been eliminated
and, because there are no invalidation messages, the transient states recording the state
of outstanding invalidation requests are no
longer necessary. Lifetime prediction is important in timer-based coherence as it affects
cache utilization and application performance. Our full paper describes our simple predictor, which adjusts the requested lifetime
based on application behavior.2
Here, we propose two implementations of
the TC framework: TC-Strong and TCWeak. TC-Strong stalls writes to unexpired
L2 cache lines and is similar to the Library
Cache Coherence protocol,5 a timer-based
CPU coherence. These stalled writes cause
TC-Strong to perform poorly on a GPU.
TC-Weak uses a novel timer-based memory
fence mechanism to eliminate stalling of writes.
Figure 4 compares the high-level operation
of TC-Strong and TC-Weak. Two cores, C2
and C3, have addresses A and B, respectively,
cached in their private L1 caches. In
.............................................................
MAY/JUNE 2014
micro
IEEE
73
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
L1@ C3
L1@ C2
B
A
L1@ C2
WA
C1
L1@ C3
WB
C1
WA
Time
WB
Time
(a)
(b)
Figure 4. TC operation. The high-level operation of TC-Strong and TC-Weak implementations.

(F fence; WA write to address A; an X indicates self-invalidation of a cache block.)
warps
Time
L1 cache line
GPU SIMT core

L1 cache
GWCT
Valid
bit
Local
timestamp
Tag
Data
L2 cache line
L2$ bank
Time
Mem. partition
(a)
State
Global
Dirty timestamp
Tag
Data
(b)
Figure 5. Hardware extensions for the TC-Weak implementation. GPU cores and memory
partitions with synchronized counters. A Global Write Completion Times (GWCT) table added
to each GPU core (a). L1 and L2 cache lines with timestamp field (b).
TC-Strong, C1s write to A stalls completion

until C2 self-invalidates its locally cached copy
of A. Similarly, C1s write to B stalls completion until C3 self-invalidates its copy of B. In
TC-Weak, C1s writes to A and B do not stall
waiting for other copies to be self-invalidated.
Instead, the fence operation ensures that all
previously written addresses have been selfinvalidated in other local caches. This ensures
that all previous writes from this core will be
globally visible after the fence completes.
TC-Strong coherence
............................................................
74
micro
IEEE
TC-Strong implements release consistency with write atomicity.6 With TCStrong, each GPU core has a private, writethrough L1 cache, and the cores share a
write-back L2 cache. It requires synchronized timestamp counters at the GPU cores
and L2 controllers shown in Figure 5a to
provide the components with the current
system time. A small timestamp eld is
added to each cache line in the L1 and
L23caches, as Figure 5b shows. The local
timestamp value in the L1 cache line indicates the time until the particular cache line
is valid. An L1 cache line with a local timestamp less than the current system time is
invalid. The global timestamp value in the
L2 indicates a time by which all L1 caches
will have self-invalidated this cache line.
Every load request checks both the tag
and the local timestamp of the L1 line. It
treats a valid tag match that has an expired
local timestamp as a miss; self-invalidating an
L1 block doesnt require explicit events. A
load miss at the L1 generates a request to the
L2 with a lifetime prediction. The L2 controller updates the global timestamp to the
maximum of the current global timestamp
and the requested local timestamp to accommodate the time period requested. The L2
responds to the L1 with the data and the
global timestamp. The L1 updates its data
and local timestamp with values in the
response message before completing the load.
A store request writes through to the L2
where its completion is delayed until the
global timestamp has expired.
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
Core C1
Core C2
S1: data = NEW

F1: FENCE
S2: flag = SET
L1: r1 = flag
B1: if(r1 6 SET)goto L1
L2: r2 = data
(a)
Write stalling at L2 (TC-Strong)
Fence waiting for pending requests (both)
Fence waiting for GWCT (TC-Weak)
TC-Weak
TC-Strong
C1
L2
S1
2 F1
C2
C1
flag
data
NULL | 60
OLD | 30
S1
2' F1
20
Time
Ac
selfinvalidate
S2
10
C2
1'
4'
L2
flag
data
NULL | 60
OLD | 30
3'
=
CT
GW 30
30 6'
S2
5'
40
7'
selfinvalidate
=
CT
GW 60
50
Ac
C1's requests
(b)
8
selfinvalidate
60
selfinvalidate
C1's requests
C2's private cache

blocks state
(value I timestamp)
C2's private cache

blocks state
(value I timestamp)
(c)
Figure 6. Comparing two implementations of the TC framework. An example code snippet

(a). Sequence of events for C1 (left) that occur due to code in (a) and the state of C2s blocks
(right) for TC-Strong (b). Sequence of events with TC-Weak (c).
Figure 6b shows how TC-Strong maintains coherence. The example code snippet4
in Figure 6a represents a common programming idiom used to implement nonblocking
queues in pipeline parallel applications.7 Figure 6b shows the memory requests generated
by core C1 on the left, and the state of the
two memory locations, ag and data, in C2s
L1 on the right. Initially, C2 has ag and
data cached with local timestamps of 60 and

30, respectively. For simplicity, we assume
that C2s operations are delayed.
C1 executes instruction S1 and generates
a write request to L2 for data (1), and subsequently issues the memory fence instruction
F1 (2). Because the wavefront has an outstanding store request, F1 defers scheduling
it. When S1s store request reaches the L2
.............................................................
MAY/JUNE 2014
micro
IEEE
75
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
(3), the L2 stalls it because the datas global

timestamp will not expire until time
T 30. At T 30, C2 self-invalidates
data (4), and the L2 processes S1s store (5).
The fence instruction completes when C1
receives the acknowledgment for S1s
request (6). The same sequence of events
occurs for the store to ag by S2. The L2
stalls S2s write request (7) until ag selfinvalidates in C2 (8).
TC-Weak coherence
............................................................
76
micro
IEEE
TC-Strong enforces coherence across all

data by stalling writes. TC-Weak uses the
insight that GPU applications might contain
signicant amounts of data that doesnt
require coherence and is unnecessarily penalized by write-stalling. By relaxing writeatomicity, TC-Weak eliminates write-stalling
and shifts any potential stalling to explicit
memory fence operations. This provides two
main benets. First, it eliminates expensive
stalling at the shared L2 cache controllers,
which affects all cores and wavefronts, and
shifts it to scheduling of individual wavefronts at memory fences. A wavefront
descheduled due to a memory fence doesnt
affect the performance of other wavefronts.
Second, it enforces coherence only when
required and specied by the program
through memory fences. It implements the
release consistency model with special
accesses processor consistent (RCpc) consistency model;6 a detailed discussion on this is
available elsewhere.8
In TC-Weak, writes to unexpired global
timestamps at the L2 do not stall. The write
response returns with the L2 cache lines
global timestamp at the time of the write.
The returned global timestamp is the guaranteed time by which the write will become
visible to all system cores. This is because, by
this time, all cores will have invalidated their
privately cached stale copies. TC-Weak tracks
the global timestampscalled Global Write
Completion Times (GWCTs)returned by
writes for each wavefront. A memory fence
operation uses this information to deschedule
the wavefront long enough to guarantee that
all previous writes from the wavefront have
become globally visible.
As Figure 5a shows, TC-Weak adds a
small GWCT table to each GPU core. The
GWCT table contains 48 entries, one for

each wavefront in a GPU core. Each entry
holds a timestamp value that corresponds to
the maximum of all GWCTs observed for
that wavefront. A memory fence in TCWeak deschedules a wavefront until all pending write requests from the wavefront have
returned acknowledgments, and until the
wavefronts timestamp in the GWCT table
has expired. The latter ensures that all previous writes have become visible to the system by fence completion.
Figure 6c shows how TC-Weak maintains
coherence by showing the execution of C1s
memory instructions from Figure 6a. C1 executes S1 and sends a store request to the L2
for data (10 ). Subsequently, C1 issues a memory fence operation (20 ) that defers scheduling of the wavefront because S1 has an
outstanding memory request. The L2
receives the store request (30 ) and returns the
current global timestamp stored in the L2 for
data. In this case, the value returned is 30 and
corresponds to C2s initially cached copy.
The L2 doesnt stall the write and sends back
an acknowledgment with the GWCT, which
updates the C1s GWCT entry for this wavefront. After C1 receives the acknowledgment
(40 ), no memory requests are outstanding.
The wavefronts scheduling is now deferred
because the GWCT entry of this wavefront
containing a timestamp of 30 has not yet
expired. As data self-invalidates in C2s cache
(50 ), the wavefronts GWCT expires and the
fence is allowed to complete (60 ). The next
store instruction, S2, sends a store request
(60 ) to the L2 for ag. The L2 returns a
GWCT time of 60 (70 ), corresponding to
C2s cached copy.
Comparing Figures 6b and 6c shows that
TC-Weak performs better than TC-Strong
because it stalls only at explicit memory fence
operations. This ensures that writes to data
that dont require coherence have minimal
impact.
Evaluation
We modeled a cache-coherent GPU architecture by extending GPGPU-Sim version
3.1.29 with the Ruby memory system model
from the General Execution-Driven Multiprocessor Simulator (GEMS).10 The baseline
noncoherent memory system and all
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
Table 2. Evaluated GPU applications requiring coherent caches.

Description
BH
BarnesHut algorithm for N-body simulation
CC
CL
Graph cuts for image segmentation

Cloth physics simulation
DLB
Octree partitioning with dynamic load balancing
STN
VPR
Stencil computation for 3D wave propagation

Versatile place and route
3.5
1.2
3.0
1.0
2.5
0.0
HM
0.0
SR
0.2
RG
HM
VPR
STN
DLB
CL
CC
0.5
LPS
0.4
1.0
BH
0.6
KMN
1.5
0.8
Speedup
2.0
HSP
3.8
NDL
Application
(b)
(a)
NO-L1
NO-COH
MESI
GPU-VI
TCW
Figure 7. Performance of GPU memory systems. Coherent protocols compared with a baseline GPU with L1 caches disabled
(NO-L1) (a). The same protocols compared against a noncoherent baseline with L1 caches enabled (NO-COH) (b). (HM
harmonic mean.)
coherence protocols are implemented in

Specication Language including Cache
Coherence (SLICC). We acquired the MESI
cache-coherence protocol from gem5.11 Our
GPGPU-Sim extended with Ruby is congured to model a generic Nvidia Fermi
GPU.12 We used Orion 2.013 to estimate the
interconnect power consumption and evaluated two sets of applications. One set contained emerging GPU applications with interworkgroup communication that require coherent caches for correctness (see Table 2). The
other set contained more mainstream GPU
applications featuring only intra-workgroup
communication that run correctly with noncoherent private L1 caches. Our other paper
offers details on our simulation conguration
and benchmarks.2
Figure 7a compares the performance of
coherence protocols against a baseline GPU
with L1 caches disabled (NO-L1) for applications with inter-workgroup communication.
Figure 7b compares them against the noncoherent baseline protocol with L1 caches
enabled (NO-COH) for applications with
intra-workgroup communication. Figures 8a
and 8b show the breakdown of interconnect
trafc between different coherence protocols.
.............................................................
MAY/JUNE 2014
micro
IEEE
77
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
(a)
STN
RCL=0.16
REQ=0.63
2.27
1.5
1.0
0.0
HSP
KMN
LPS
NDL
RG
SR
NO-COH
MESI
GPU-VI
GPU-Vini
TCW
0.5
NO-COH
MESI
GPU-VI
GPU-Vini
TCW
AVG
RCL=0.15 RCL=0.99
REQ=0.63 REQ=0.55
NO-COH
MESI
GPU-VI
GPU-Vini
TCW
VPR
0.0
RCL=0.25
REQ=0.55
NO-COH
MESI
GPU-VI
GPU-Vini
TCW
NO-L1
MESI
GPU-VI
GPU-Vini
TCW
NO-L1
MESI
GPU-VI
GPU-Vini
TCW
DLB
NO-L1
MESI
GPU-VI
GPU-Vini
TCW
CL
NO-L1
MESI
GPU-VI
GPU-Vini
TCW
CC
NO-L1
MESI
GPU-VI
GPU-Vini
TCW
NO-L1
MESI
GPU-VI
GPU-Vini
TCW
NO-L1
MESI
GPU-VI
GPU-Vini
TCW
0.5
LD
2.0
Interconnect Traffic
1.0
ST
NO-COH
MESI
GPU-VI
GPU-Vini
TCW
2.0
1.5
BH
ATO
NO-COH
MESI
GPU-VI
GPU-Vini
TCW
REQ
NO-COH
MESI
GPU-VI
GPU-Vini
TCW
INV
RCL
RCL=0.03
INV=0.03
REQ=0.68
AVG
(b)
Figure 8. Breakdown of interconnect traffic for coherent and noncoherent GPU memory systems. Inter-workgroup
communication (a); intra-workgroup communication (b).
............................................................
78
micro
IEEE
LD, ST, and ATO are the data trafc from

load, store, and atomic requests, respectively.
MESI performs atomic operations at the L1
cache, and this trafc is included in ST. REQ
refers to control trafc for all protocols. INV
and RCL are invalidation and recall trafc,
respectively.
For applications with inter-workgroup
communication, TC-Weak (TCW) performs 85 percent better than the baseline
noncoherent GPU. MESI and GPU-VI
also achieve similar speedups for these
applications. However, MESIs writeallocate policy at the L1 cache signicantly
increases store trafc due to unnecessary
rells of write-once data. As a result, it performs signicantly worse compared to the
write-through protocols on applications
that
feature
only
intra-workgroup
communication.
On average, MESI increases interconnect
trafc over the baseline noncoherent GPU by
75 percent across all applications. The writethrough GPU-VI introduces unnecessary
invalidation and recall trafc, averaging to a
trafc overhead of 31 percent for applications
with only intra-workgroup communication.
TC-Weak (TCW) removes all invalidations
and recalls; as a result, it reduces interconnect
trafc by 56 percent over MESI and 23 percent over GPU-VI for this set of applications.
This trafc reduction leads to energy savings
in the interconnect, and allows TC-Weak to
perform almost identically to the baseline
noncoherent GPU for these applications.
Our evaluation also shows that TC-Weak

performs 28 percent better on average than
TC-Strong and reduces interconnect trafc
by 26 percent. Our full paper contains further analysis on the impact of lifetime prediction on TC-Weaks performance.2
ur work motivates a need to rethink

our approach to implementing cache
coherence on future heterogeneous systems.
We believe that TCs simplicity will ease the
task of providing coherence for these future
systems that integrate a diverse set of acceleraMICRO
tors tailored for different applications.
Acknowledgments
We thank Mark Hill, Hadi Jooybar,
Timothy Rogers, and the anonymous
reviewers for their invaluable comments.
This work was partly supported by funding
from the Natural Sciences and Engineering
Research Council of Canada and Advanced
Micro Devices.
....................................................................
References
1. P. Rogers, Heterogeneous Systems Architecture Overview, presentation, Hot Chips

2013; _____________________
http://hsafoundation.com/hot-chips2013-hsa-foundation-presented-deeper__________________________
detail-hsa-hsail.
_________
2. I. Singh et al., Cache Coherence for GPU
Architectures, Proc. 19th Intl Symp. High
Performance Computer Architecture, 2013,
pp. 578-590.
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
3. J. Feehrer et al., Coherency Hub Design

for Multisocket Sun Servers with CoolThreads Technology, IEEE Micro, vol. 29,
no. 4, pp. 36-47.
4. D.J. Sorin et al., A Primer on Memory Consistency and Cache Coherence, Morgan and
Claypool Publishers, 2011.
5. K.S. Shim et al., Library Cache Coherence,
tech. report mit-csail-tr-2011-027, Computer
Science and Artificial Intelligence Lab, MIT,
2011.
6. K. Gharachorloo et al., Memory Consistency and Event Ordering in Scalable
Shared-Memory Multiprocessors, Proc.
17th Intl Symp. Computer Architecture,
1990, pp. 15-26.
7. J. Giacomoni, T. Moseley, and M. Vachharajani, FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent
Lock-Free Queue, Proc. 13th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, 2008, pp. 43-52.
8. I. Singh et al., Temporal Coherence: Hardware Cache Coherence for GPU Architectures, tech. report, Electrical and Computer
Eng., Univ. British Columbia, 2013.
9. A. Bakhoda et al., Analyzing CUDA Workloads Using a Detailed GPU Simulator,
Proc. Intl Symp. Performance Analysis of
Systems and Software, 2009, pp. 163-174.
10. M. Martin et al., Multifacets General
an MASc in computer engineering from the

University of British Columbia.
Arrvindh Shriraman is an assistant professor in the School of Computer Science at
Simon Fraser University, where he is the faculty coleader of the Systems Networking
and Architecture Research (SYNAR) group.
His research interests include multiprocessor systems design, hardware-software interface, and memory systems. Shriraman has a
PhD in computer science from the University of Rochester.
Wilson W.L. Fung is a PhD student in the
Department of Electrical and Computer
Engineering at the University of British
Columbia. His research interests include
parallel programming models for GPU-like
accelerators. Fung has an MASc in computer engineering from the University of
British Columbia. He is a member of IEEE
and the ACM.
Mike OConnor is a senior research scientist at Nvidia; he performed this work while
he was a researcher at AMD Research.
OConnor has an MS in electrical and computer engineering from the University of
Texas at Austin. He is a senior member of
IEEE and a member of the ACM.

(GEMS) Toolset, SIGARCH Computer
Architecture News, vol. 33, no. 4, 2005,
pp. 92-99.
11. N. Binkert et al., The Gem5 Simulator,
SIGARCH Computer Architecture News,
vol. 39, no. 2, 2011, pp. 1-7.
12. Nvidias Next Generation CUDA Compute
Architecture: Fermi, white paper, NVIDIA,
2009.
Tor M. Aamodt is an associate professor in

the Department of Electrical and Computer
Engineering at the University of British
Columbia. His research interests include the
architecture of programmable accelerators
and energy-efficient computing. Aamodt
has a PhD in electrical and computer engineering from the University of Toronto. He
is a member of IEEE and the ACM.
13. A.B. Kahng et al., Orion 2.0: A Fast and

Accurate NoC Power and Area Model for
Early-Stage Design Space Exploration,
Proc. Design, Automation, and Test, 2009,
pp. 423-428.
Inderpreet Singh is an engineer at Qualcomm. His research interests include memory models and GPU computing. Singh has

article to Inderpreet Singh, 2332 Main
Mall, Vancouver, BC, Canada V6T 1Z4;
isingh@ece.ubc.ca.
____________
_____________
_______
.............................................................
MAY/JUNE 2014
micro
IEEE
79
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
................................................................................................................................................................................................................
A CONFIGURABLE AND STRONG

RAS SOLUTION FOR DIE-STACKED
DRAM CACHES
................................................................................................................................................................................................................
DIE-STACKED MEMORYS RESILIENCY PROBLEM WILL BECOME IMPORTANT BECAUSE OF ITS

LACK OF SERVICEABILITY. THIS ARTICLE DETAILS HOW TO PROVIDE PRACTICAL AND COSTEFFECTIVE RELIABILITY, AVAILABILITY, AND SERVICEABILITY SUPPORT FOR DIE-STACKED
DRAM CACHE ARCHITECTURES. THE PROPOSED APPROACH CAN PROVIDE VARYING LEVELS
OF PROTECTION, FROM FINE-GRAINED SINGLE-BIT UPSETS TO COARSER-GRAINED FAULTS
WITHIN THE CONSTRAINTS OF COMMODITY NON-ERROR-CORRECTING CODE DRAM
STACKS.
......
Jaewoong Sim
Georgia Institute of Technology
Gabriel H. Loh
Vilas Sridharan
Mike OConnor
Advanced Micro Devices
.......................................................
80
micro
IEEE
Integrating die-stacked DRAM

to provide large amounts of in-package
cache storage can improve performance and
reduce energy consumption by avoiding
costly off-chip accesses to conventional main
memory.1-3 Die-stacking technology is just
beginning to be commercialized, but it might
be limited to certain niche and other market
segments that can afford the higher costs of
incorporating this new technology.
Many high-end markets require superior
reliability, availability, and serviceability
(RAS). Die-stacked memories may need to be
more reliable than external memories because
they are not serviceable. Compared with conventional dual-inline memory modules
(DIMMs) that can be easily replaced, a diestacked memory failure might require discarding the entire package, including the
processor.
Traditionally, RAS for memory has been
provided by using DIMMs enabled with
error-correcting codes (ECCs), where each

ranks memory chips are augmented with one
or more additional chips to store the ECC or
parity information needed to protect the
data. Such ECC DIMMs can provide basic
single-error correction, double-error detection (SECDED) capabilities, or more complex ECCs (such as Reed-Solomon codes)
can be used to provide chip-kill protection
that allows an entire memory chip to fail
without compromising any data.
For die-stacked DRAM, it might not be
practical to add extra chips to provide the
additional capacity for storing ECC information. For example, if a DRAM stack has only
four chips to begin with, it might not be
physically practical to add another half chip
to provide the extra storage (assuming 12.5
percent for SECDED). Other complications
arise with trying to extend a conventional
ECC organization to stacked DRAM, such
as storage and power efciency and economic
0272-1732/14/$31.00 c 2014 IEEE
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
feasibility, which we describe in our paper

for the 40th International Symposium on
Computer Architecture.4 In this article, we
propose a series of modications to a stackedDRAM cache to provide both ne-grained
protection (for example, for single-bit faults)
and coarse-grained protection (for example,
for row, bank, and channel faults) while
only utilizing commodity non-ECC stacked
DRAM chips. Furthermore, these RAS capabilities can be selectively enabled to tailor the
level of reliability for different market needs.
Die-stacked DRAM organization

Die-stacked DRAM consists of one or
more layers of DRAM with a very wide data
interface connecting the DRAM stack to
whatever it is stacked with (such as a processor die). Whereas a conventional memory
chip might provide only a four- or eight-bit
data interface (the reason multiple chips are
ganged together on a DIMM), a single layer
of die-stacked memory can provide a much
larger interface, such as 128 bits.5 Given this
wider interface, all bits for a read request can
be transferred across a single interface; thus,
all bits are sourced from a single chip in the
stack.
In theory, stacked DRAM could be organized to be more like a DIMM, in that each of
the N chips across the layers of a stack provides 1/Nth of the bits. This organization is
undesirable for several reasons. Requiring
parallel access of N chips means activating
banks on all chips. This reduces peak banklevel parallelism by a factor of N, which can
reduce performance. Accessing all chips in
parallel also requires switching N row and
column decoders and associated muxes on
each access, increasing both the power and
the number of points of possible failure.
Finally, timing skew between different bits
coming from different layers for the same
request might make the die-stacked I/O
design more challenging.
DRAM failure modes

Conventional DRAM exhibits various
failure modes, including single-bit, column,
row, bank, and full-chip faults. These faults
can affect one or more DRAM subarrays and
can be permanent or transient. Recent eld
studies on double-data-rate 2 and 3 (DDR-2

and DDR-3) DRAM indicate that more
than 50 percent of DRAM faults can be large
multibit (row, column, bank, and so on)
faults, and that DRAM device failure rates
can be between 10 and 100 failures in time
(FITs) per device, where 1 FIT is one failure
per billion hours of operation.6
The internal organization of a die-stacked
DRAM bank is similar to an external DRAM;
thus, failure modes that occur in external
DRAM devices are likely to occur in diestacked DRAM. Die-stacked DRAM could
also experience other failure modes, such as
broken through-silicon vias (TSVs) and accelerated failure rates from causes such as negative-bias temperature instability (NBTI)7 and
electromigration due to elevated temperatures
from being in a stack. Some of these new failure modes (such as broken TSVs) will manifest as a single failing bit per row, whereas
others (such as electromigration) could cause
multiple bits to fail simultaneously.
The cost of an irreparable DRAM failure
in a die-stacked context can be signicantly
larger than for conventional DIMMs. In a
conventional system, a failed DIMM could
result in costly downtime, but the hardware
replacement cost is limited to a single
DIMM. For die-stacked memory, the failed
memory cannot easily be removed from the
package as the package would be destroyed
by opening it, and the process of detaching
the stacked memory would likely irreparably
damage the processor as well. Therefore, the
entire package (including the expensive processor silicon) would have to be replaced.
So, die-stacked DRAM RAS must provide
robust detection and correction for all existing
DRAM failure modes. It should be robust
enough to handle potential new failure modes
as well, and it likely will need to endure
higher failure rates owing to the reduced serviceability of 3D-integrated packaging.
Objective and requirements

This article aims to provide a high and
congurable level of reliability for die-stacked
DRAM caches in a practical and costeffective manner. From a performance perspective, memory requests usually require
data only from a single chip and channel
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
81
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Row decoder
One DRAM bank
29-way set-associative
2-Kbyte row 32 x 64B blocks
T T T D D D D D D D D D D D D D D D D D D D D D D D D D D D D D
DRAM
cells
3 x 64B for
29 tag entries
29 x 64B for
29 data blocks
Direct-mapped (28 blocks per row)

Sense amp
row buffer
D D D D D D D D D D D D D D D D D D D D D D D D D D D D
Single direct-mapped set (tag+data)
Figure 1. A DRAM bank with 2-Kbyte row size. When used as a cache, the row can be organized as a 29-way set-associative
set (top) or as 28 individual direct-mapped sets (bottom).
(that is, bits are not spread across multiple

layers of the stack). From a cost perspective,
regular-width (non-ECC) DRAM chips are
preferred. From a reliability perspective, we
must account for single-bit faults as well as
large multibit (row, column, and bank)
faults.
Isolated fault types in DRAM caches

We rst begin with the underlying idea of
how to support single-bit error correction
and multibit error detection. Figure 1 shows
an example DRAM bank, a row from the
bank, and the rows contents. In some
DRAM cache designs, tags and data are
placed together in the same row to exploit
row buffer hits, which makes it relatively easy
to reallocate storage between different organizations. For example, Figure 1 provides an
example in which the same physical 2-Kbyte
row can be reorganized to provide either a
set-associative cache organization (top) or a
direct-mapped cache organization (bottom).
Only the control logic that accesses the cache
needs to be changed; the DRAM itself is
oblivious to the specic cache organization.
Supporting single-bit error correction
............................................................
82
micro
IEEE
This observation that data and tags are

fungible leads us to a simple way to provide
error correction for a DRAM cache. Figure 2a
shows the components of a basic tag entry
and a 64-byte cache block. This example tag
entry consists of a 28-bit tag, a 4-bit coherence state, and an 8-bit sharer vector (used to
track inclusion in eight cores); this example
does not include replacement information

because we assume a direct-mapped cache.
We provide one SECDED ECC code to
cover each tag entry, and a separate SECDED
ECC code to cover the corresponding
64-byte data block. In general, a block of n
an SEC code that is about
bits requires

log2 n bits wide to support single-bit correction, as well as a parity bit to provide double-bit error detection. The tag entry consists
of 40 bits in total, thereby requiring a 7-bit
SECDED code; the 512 data bits use an
11-bit code.
Placement of tags and data in a single
2-Kbyte row requires some reorganization to
keep the blocks aligned. We pack the original
tag entry, its ECC, and the ECC for the data
block in a single combined tag entry. These
elements, indicated by the dashed outline in
Figure 2a, collectively add up to 58 bits. We
store eight tag entries in a 64-byte block,
shown by the tag blocks (T) in Figure 2b.
Following each tag block are the eight corresponding data blocks. Overall, a single
2-Kbyte row can store 28 (64-byte) blocks
plus the tags.
Inclusion of ECCs requires slightly different timing for DRAM cache accesses. Figure 2c shows the timing and command
sequence to read a 64-byte block of data
from the cache. After the initial row activation, back-to-back read commands are sent
to read both the tags and the data.8 The ECC
check of the tag entry occurs rst to correct
any single-bit errors; the corrected tag is then
checked for a tag hit. The 64-byte data block
is read in parallel with the tag operations
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
28-bit tag
4-bit coherence 8-bit sharers
64B data
40 bits total per tag entry

7-bit SECDED code
512 bits per data block

11-bit SECDED code
58 bits total for tag entry and both SECDED codes

8 tag entries per 64B block
(a)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
T D D D D D D D D T D D D D D D D D T D D D D D D D D T
D D D D
SECDED ECC (28 blocks per row)
(b)
Command bus
ACT
Data bus
32B tag read

tRCD
DRAM$
controller
RD RD
64B data read

tCAS
update tag*
* if needed
update tag*
WR
32B tag read

DQ DQ DQ
ECC check on tag

tag hit check
DQ
ECC tag ECC

ECC ECC
compute new
ECC for tag*
pipelined ECC
check on data
(c)
Figure 2. Providing simple error correction for a DRAM cache. Contents of one tag entry and one 64-byte data block,
along with single-error correction, double-error detection (SECDED) error-correcting codes (ECC) for each, respectively (a).
Contents of a 2-Kbyte DRAM row, with eight tag entries packed into a 64-byte block and the corresponding eight data
blocks following (b). Timing diagram for reading a 64-byte cache block (c).
(speculatively assuming a hit), and the datas

ECC check is pipelined out over the two 32byte chunks. At the end of the ECC check
(assuming a cache hit and no errors), the data
can be returned to the processor. If the tag
must be updated (for example, transitioning
to a new coherence state), the tag entry and
the corresponding ECC must be updated
and then written back to the cache (marked
with asterisks in the gure). The timing for a
write is similar.
Supporting multibit error detection

In mission-critical environments, error
correction is critical to maintain system
uptime. However, many such environments
also require robust error detection (even
without hardware correction). It is bad to
have a system crash after months of simulation time, but its even worse for that system
to suffer an undetected error and silently
produce erroneous results. Therefore, some

of these environments require more robust detection than that provided by
SECDED ECC.
For such mission-critical systems, we
replace the DED parity bit in each ECC with
a very strong cyclic redundancy check
(CRC). Although CRCs are traditionally
used to detect bursts of errors, they can detect
much broader classes of errors beyond those
that occur in bursts. For example, a 16-bit
CRC can detect all errors of up to 5 bits in
46-bit data (40-bit tags plus 6-bit SEC), and
all errors of up to 3 bits in 265-bit data
(256-bit data plus 9-bit SEC), regardless of
whether these errors are clustered in a burst.
Furthermore, these CRCs can detect all burst
errors up to 16 bits in length, all errors with
an odd number of erroneous bits, and most
errors with an even number of erroneous
bits. Although these CRCs do not increase
.............................................................
MAY/JUNE 2014
micro
IEEE
83
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
28-bit tag
16-bit CRC for tags (incl. SEC code)
9-bit SEC
32B data
coher. 8-bit sharers 6-bit SEC
16-bit CRC for data (incl. SEC code)
32B data
9-bit SEC
16-bit CRC for data (incl. SEC code)
50 bits total for two sets of SEC codes and CRCs
62 bits total for tag entry, SEC code, and CRC

Tag+SEC+CRC for four blocks per 64B
(a)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
T D D D D T D D D D T D D D D T D D D D T D D D D T D D D D T
D
SEC+CRC (25 blocks per row)
(b)
Command bus
ACT
32B tag read
RD RD
64B data read
update tag*
WR
* if needed
32B tag read
tRCD
Data bus
tCAS
DQ DQ DQ
ECC check on tag
DRAM$
controller
tag hit check

CRC check on (corrected) tag
DQ
ECC tag ECC CRC CRC

CRC CRC
ECC ECC CRC CRC CRC CRC
update tag*
compute new ECC
and CRC for tag*
CRC check on
(corrected) data
pipelined ECC check on data
(c)
Figure 3. Providing strong, multibit error detection for a DRAM cache. Contents of one tag entry and one 64-byte data block
(treated as two 32-byte chunks for protection purposes), along with SEC ECC and cyclic redundancy check (CRC) codes (a).
Contents of a 2-Kbyte DRAM row, with four tag entries (tagSECCRC) packed into a 64-byte block and followed by the
corresponding four data blocks (b). Timing diagram for reading a 64-byte cache block (c).
............................................................
84
micro
IEEE
the DRAM caches error-correction capabilities, they greatly increase its error-detection
capability, thereby drastically reducing silent
data corruption (SDC) rates.
Figure 3a shows the layout of the tag
blocks including CRCs. Here, we only use
SEC ECCs (not SECDED); the CRCs provide multibit error detection and so the parity
bit for double-error detection is not needed.
We divide the 64-byte data into two 32-byte
protection regions, each covered by its own
SEC and CRC codes, which allows up to two
errors to be corrected if each error occurs in a
separate 32-byte region.
The storage requirement for the original
tag plus the SEC and CRC codes is 112 bits.
Therefore, tag information for four cache
blocks can be placed in a 64-byte block. Figure 3b shows the overall layout for a 2-Kbyte
DRAM row, with a 64-byte tag block containing four tags (including SEC/CRC),
followed by the four respective data blocks,

and then repeated. The increased overhead
reduces the total number of data blocks per
2-Kbyte row to 25.
Figure 3c shows the timing for reads,
which is similar to the SECDED-only case
from Figure 2, apart from a few minor differences. First, the tag block now contains both
ECC and CRC information, so when the tag
block must be updated, the nal tag writeback is delayed by two extra cycles for the
additional latency to compute the new CRC.
Second, both the tag and data SEC ECC
checks are followed by the corresponding
CRC checks. We can return data to the CPU
as soon as the ECC check nishes; that is,
data can be sent back before the CRC check
completes (or even starts). Even if a multibit
error were detected, the hardware could do
nothing directly to correct the situation. We
assume the hardware simply raises an
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
exception and relies on higher-level software

resiliency support (such as checkpoint
restore) to handle recovery.
Row 110001
Row decoder
Coarse-grained failures in DRAM caches

DRAM failure modes are not limited to
soft errors from corrupted bit-cells. Coarsegrained failures also occur with a nontrivial
frequency in real systems,6 affecting entire
columns, rows, and banks. This section
details an approach to deal with coarsegrained DRAM cache failures.
Before handling failures, the failure must

be detected. Here, we cover how failures are
detected for different scenarios.
Row decoder failures. The failure of an entire
row can occur because of a fault in the row
decoder logic. If the row decoder has a fault
in which the wrong wordline is asserted, then
the data from the wrong row will be sensed
(see Figure 4a). The DRAM should have
returned row 1100012s contents: the data Y
and the ECC for Y, namely E(Y). In this case,
however, the adjacent rows contents, X and
E(X), are returned instead, but because the
data and ECC are self-consistent, the system
is unable to detect that the wrong row has
been accessed.
To detect this scenario, we fold in the row
index into the data. Figure 4b shows example
data and ECC elds. We rst compute the
ECC on the data. Then, instead of storing
the raw data, we store the exclusive-OR of
the data and several copies of the row index.
When reading the data, we rst XOR out the
row index, which returns the original data;
from here, we perform the ECC check.
If the row decoder has a fault, the wrong
row will be read. For example, Figure 4c
shows a case when reading row 1100012
results in the previous row 1100002 instead.
We XOR out the row index for the row that
we requested (that is, 1100012), but this leaves
the data in a state with multiple wrong bits
with respect to the stored ECC. The multibit
error is detected, and the system can then
attempt to do something about it (for example, raise a fault). A similar approach was proposed in the Argus microarchitecture,9 but
E(X)
E(Y)
X
Y
Row 101110
Row 101111
Row 110000
accessed row
Row 110001 requested row
Row 110010
ECC
Identifying coarse-grained failures
No error!
(a)
Data
10111010101001001001010110110101
ECC
01001011
ECC
n copies of
row index
11 0000 11 0000 110000 110000 110000 110000 1100
Bitwise XOR
011110 011010 10 0010 10 0101 01 1101 10 0100 01 11

(b)
Requested row 110001, received row 110000 instead
0111100110101000101001010111011001000111
1100011100011100011100011100011100011100
n copies of
row index
Bitwise XOR
10111110101101001101010010110001 01011011
ECC
Multibit error!
(c)
Figure 4. Handling row decoder faults. Row decoder error that selects the
incorrect row, which is undetectable using within-row ECC (a). Process for
folding in the row index (row 1100002) (b), and usage of the folded row
index to detect a row-decoder error (c).
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
85
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
our usage covers more fault scenarios because

of our CRCs.
Column failures. An entire column could fail
because of problems in the peripheral circuitry. For example, the bitline might not be
precharged to the necessary reference voltage,
or variations in transistor sizing might make
the sense amplier for a particular column
unable to operate reliably. These faults could
be permanent failures (for example, caused
by manufacturing defects or wear) or intermittent failures (for example, temperature
dependent). For a bank with a single column
failure, reading any row from this bank will
result in a corresponding single-bit error.
This failure can be caught and corrected by
the baseline ECC. If multiple columns have
failed, the baseline CRC checksum can detect
the failure in most cases.
Column failures could also occur if the
column decoder selects the wrong column
(for example, if the column index was incorrectly latched due to noise). Similar to hashing in the row index, we can easily extend the
scheme to XOR in the column index. Before
bits are read from the DRAM caches, both
the row and column indexes are XORed out.
An error in either case will cause a CRC mismatch with high probability.
Bank or channel failures. In the case of an
entire bank failing, reading any row from
that bank likely will result in random garbage
bit values, all zeros, or all ones. For random
bit values, the probability of the CRC elds
being consistent with the data portion will be
very low, so this would manifest itself as an
uncorrectable multibit error. For all zeros or
all ones, instead of just XORing in multiple
copies of the row index, some copies (for
example, every other one) are bitwise
inverted. Similar to the row-decoder failure,
it is possible that the bank decoder fails and
sends a request to the wrong bank. The rowindex XOR scheme can be extended to
include the bank index. The failure of an
entire channel or channel decoder faults can
be treated in a similar manner.
Duplicate on write
............................................................
86
micro
IEEE
To tolerate multibit errors, row failures,

and even bank failures, we use a duplicate-
on-write (DOW) approach, which has some

similarities to RAID-1 used for disk systems,
but does not incur the same amount of storage overhead. RAID-1 duplicates every disk
block (so the le system can tolerate the failure of a complete disk); therefore, a storage
system must sacrice 50 percent of its
capacity to provide this level of protection
(and 50 percent of its bandwidth for writes).
The key observation for the DRAM cache
is that for unmodied data, it is sufcient to
detect that an error occurred; the correct data
can always be refetched from main memory.
For dirty data, however, the modied copy in
the cache is the only valid copy, so there is
nowhere to turn if this sole copy gets corrupted beyond repair. This observation has
been leveraged to optimize the protection levels of physically distinct caches (such as parity
in the L1 instruction cache and SECDED
ECC in the L1 data cache). We extend this
concept to vary protection within the same
shared, unied cache structure.
DOW stores a duplicate copy of data only
when the data are modied. This way, the
duplication overhead (capacity reduction) is
limited to dirty cache blocks. Figure 5 shows
a few example cache blocks; blocks A, B, C,
and D are all clean, so the cache stores just
one copy of each. If any blocks are corrupted
beyond repair (for example, C), the clean
copy in memory can provide the correct
value. Blocks X and Y are dirty, so we create
duplicate copies X 0 and Y 0 in other banks. If
the rows (or entire banks) for X or Y fail, we
can still recover their values from X 0 or Y 0.
In this example, we use a simple mapping
function for placing the duplicate copy. For
N banks, a cache line mapped to bank i has
its duplicate placed in bank i N2 mod N .
To support channel-kill, the duplicate from
channel j is instead mapped to channel
j M2 mod M , assuming M total DRAM
cache channels. More sophisticated mapping
could reduce pathological conict cases.
Experimental results
The inclusion of additional error-detecting and -correcting code, and duplicates of
modied blocks in the DRAM cache, reduce
the effective capacity of the DRAM cache.
Checking the ECC, computing new codes,
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
DRAM cache
Bank 0
Multibit
error
Bank 1
Main memory
Bank 2
Clean copy
Bank 3
A, B, C, D,
C
A
Y
X
(X, Y are
stale)
Fine-grained and coarse-grained protections

We rst evaluate the performance impact
of the proposed ne-grained protection
schemes on DRAM caches. Figure 6 shows
the speedup over no DRAM cache between
no RAS, ECC, and ECCCRC congurations. The results show that the performance
impact of our proposed schemes is small. On
average, the ECC and ECCCRC schemes
only degrade an average of 0.50 percent and
ECC
1.25
ECC+CRC
1.20
1.15
1.10
1.05
micro
g.
12
Av
11
L-
L-
Figure 6. Performance comparison among no RAS, ECC, and ECCCRC

(normalized to no DRAM cache). The additional levels of resiliency provided
by ECC and ECCCRC come at only a modest cost in overall performance.
1.68 percent compared to a DRAM cache

with no RAS support, respectively. Unlike
ECC, ECCCRC reduces the number of
data blocks per row compared to no RAS,
and it also slightly increases bandwidth
consumption.
Figure 7 shows the performance of our
coarse-grained protection scheme (DOW)
when applied on top of ECCCRC. We
compare the results with ECCCRC and
ECCCRCRAID-1 to show the effectiveness of DOW. In the case of the RAID-1style approach (that is, full duplication), not
only is cache capacity reduced by half, but
effective write bandwidth is reduced by half
as well, thereby leading to a non-negligible
performance degradation of as much as 13.1
percent compared to ECCCRC (6.5 percent on average). However, the DOW
.............................................................
MAY/JUNE 2014
IEEE
10
L-
LW
LW
LW
LW
LW
LW
LW
L-
1.00
L-
We use a cycle-level x86 simulator for our

evaluations. We model a quad-core processor
with two-level SRAM caches and an L3
DRAM cache. The DRAM cache has eight
channels, each with 128-bit buses and eight
banks, whereas the off-chip DRAM has two
channels, each with eight banks and a 64-bit
bus. We employ several memory-intensive
multiprogrammed workloads (Table 3 in our
ISCA paper4).
We assume DRAM failure modes and
rates similar to those observed from real eld
measurements. We report results using both
the observed failure rates (Table 4 in our
ISCA paper4) and the failure rates of 10 the
observed rates to account for potential
increases in failures caused by die stacking
and for inter-device variations in failure rates.
No RAS
1.30
Methodology
1.35
and writing extra duplicate blocks add

latency, consume more bandwidth, and
occupy banks for longer periods. These
effects can have a detrimental impact on
overall performance, which we assess in this
section.
Speedup over no DRAM

cache
Figure 5. Example DRAM cache contents in which clean data are backed up by main
memory, but dirty data are duplicated into other banks. If corrupted, blocks A through D can
be refetched from main memory, whereas modified blocks X and Y rely on in-cache
duplicates X 0 and Y 0 for error recovery.
87
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
Speedup over no DRAM

cache
TOP PICKS
1.35
ECC+CRC
1.30
ECC+CRC+RAID-1
1.25
ECC+CRC+DOW
1.20
1.15
Fault coverage and failure rates
1.10
1.05
W
L1
W
L2
W
L3
W
L4
W
L5
W
L6
W
L7
W
L8
W
L9
W
L1
W 0
L1
W 1
L12
Av
g.
1.00
Figure 7. Performance comparison between fine-grained (ECCCRC) and

coarse-grained (ECCCRCRAID-1 and ECCCRCDOW) schemes
(normalized to no DRAM cache). The overhead of duplicating every cache
block (RAID-1) causes significant performance reductions, while the more
selective approach of DOW recovers most of the performance loss.
Table 1. Detection coverage for each technique.

Detection (%)
Failure mode
No RAS
ECC-only
ECC1CRC
Single bit
Column
0
0
100
85
100
99.9993
100
99.9993
Row
50
99.9993
99.9993
Bank
50
99.9993
99.9993
DOW
Table 2. Correction coverage for each technique. Cases where

correction coverage differs from detection
coverage (Table 1) are marked with an asterisk.
Correction (%)
Failure mode
No RAS
ECC-only
ECC1CRC
Single bit
100
100
100
Column
Row
0
0
85
0*
85*
0*
99.9993
99.9993
Bank
0*
0*
99.9993
DOW
Table 3. Results using observed and 103 DRAM failure rates.

Failure rates
No RAS
ECC-only
ECC1CRC
DOW
Silent data corruption
234 to
41 to 410
0.0008 to
0.0008 to
failures in time (FITs)

Detectable unrecoverable
error FITs
............................................................
88
micro
IEEE
scheme retains much of the overall performance benet of having a DRAM cache (on
average, only 2.5 percent and 0.8 percent
compared to no RAS and ECCCRC,
respectively) while providing substantial RAS
improvements.
2,335
0
37 to 368
0.0075
0.0075
52 to 518
Table 1 shows the percentage of faults

detected in each failure mode by each of our
schemes, assuming a single four-layer stack of
DRAM. ECC-only detects all single-bit
faults and most column faults, because most
of these faults affect only one bit per row.
ECC-only also detects 50 percent of all row
and bank faults, which look to the ECC like
double-bit errors. ECCCRC substantially
improves the detection coverage of column,
row, and bank faults, detecting 99.9993 percent of all faults. The detection percentage of
the CRC depends on the fault model.
Table 2 shows the fraction of faults, by
failure mode, that our schemes can correct.
ECC-only and ECCCRC correct all single-bit faults and 85 percent of column faults,
but they cannot correct any row or bank
faults. DOW, on the other hand, corrects all
detected faults.
Table 3 shows the overall SDC and detectable unrecoverable error (DUE) FIT rates for
our techniques, using both observed and 10
DRAM failure rates. We assume that all undetected failures will cause a silent data corruption. No RAS leads to an SDC FIT of 234 to
2,335, or an SDC mean time to failure
(MTTF) of 49 to 488 years. ECC-only
reduces the SDC FIT by 5.7 but increases
the DUE FIT to 37 to 368 FIT. ECCCRC
reduces the SDC FIT to just 0.0008 to
0.0075 FIT, but this benet comes at the
expense of an increase in DUE FIT to 52 to
518 (220 to 2,200 years MTTF). Finally,
DOW adds the ability to correct all detected
errors while maintaining the same SDC FIT as
ECCCRC. Overall, DOW provides a more
than 54,000 improvement in SDC MTTF
compared to the ECC-only conguration.
The bandwidth requirements of future
HPC systems will likely compel the use of
signicant amounts of die-stacked DRAM.
The impact of DRAM failures on these systems is signicantly worse than for singlesocket systems because FIT rates are additive
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
across nodes. For example, using the baseline

(1) FIT rates from Table 4 in our ISCA
paper,4 a 100,000-node HPC system with
four DRAM stacks per node would have an
SDC MTTF of only 10 hours from the diestacked DRAM alone with no RAS support.
Our ECC-only technique would increase the
SDC MTTF to only 60 hours. By contrast,
ECCCRC and DOW have an SDC
MTTF of 350 years for the entire 100,000node system. Inclusion of DOW is likely necessary, because ECCCRC (without DOW)
has a 48-hour DUE MTTF on such a system.
Although software techniques might handle
DUEs, the performance overheads of restoring the system to a checkpoint every two days
might not be acceptable. This analysis optimistically assumes the baseline DRAM FIT
rates, which are likely lower than what will be
observed with die-stacked DRAM (for example, if we assume a 10 FIT rate, then without DOW the system would have to roll back
to a checkpoint about every ve hours).
Discussion
As we discussed earlier, providing resilience to stacked DRAM will become an
important problem because DRAM stacks
are not directly serviceable, making the tolerance of soft and hard errors critical for largescale systems. Our proposal will be valuable
for two primary reasons.
First, our proposal provides the benet for
memory vendors of having to support only a
single (non-ECC) DRAM chip design. A key
to conventional ECC DIMMs is that the
same silicon design can be deployed for both
ECC and non-ECC DIMMs. Forcing memory vendors to support two distinct silicon
designs (one ECC, one non-ECC) greatly
increases their engineering efforts and costs
and complicates their inventory management.
Second, at the implementation level, the
DRAM cache is just a commodity memory
component with no knowledge of how the
data storage is being used. Our proposal enables system designers (OEMs) to take a
single-processor (with stacked DRAM) design
but congure different levels of protection for
different deployments of the design. In a
commodity consumer system, a designer
might choose to turn off ECC entirely and
make use of the full DRAM cache capacity.

In servers and certain embedded systems,
basic SECDED ECC might be sufcient,
which would be comparable to the level
of protection used today (for example,
SECDED ECC DIMMs). In mission-critical
enterprise servers and HPC supercomputers,
the full ECCCRC and DOW protections
could be used. The selection of which level of
protection to use could be congured simply
via a BIOS setting read during system boot.
n important characteristic of our work

is that it does not represent a single
monolithic proposal with limited applicability but rather provides a study that can be
applied across a span of designs. Our proposal is broadly applicable and congurable;
various reliability levels are targeted, and all
work within the constraints of commodity,
non-ECC DRAM stacks. As such, we believe
that the potential impact and applicability of
this research can be strong in industry.
In our congurable RAS approach, protection levels conceivably could be adjusted
dynamically. Critical memory resources (such
as operating system data structures) might
receive a high level of protection, whereas other
low-priority user applications might receive no
or minimal protection. The enabled protection
level can be also increased gradually as a product slowly suffers hardware failures from
long-term wear.7 Investigating the impacts of
these dynamic RAS mechanisms could be an
MICRO
interesting future research direction.
....................................................................
References
1. X. Jiang et al., CHOP: Adaptive Filter-Based

DRAM Caching for CMP Server Platforms,
Proc. IEEE 16th Intl Symp. High Performance
Computer Architecture (HPCA 10), 2010,
doi:10.1109/HPCA.2010.5416642.
2. J. Sim et al., A Mostly-Clean DRAM Cache
for Effective Hit Speculation and SelfBalancing Dispatch, Proc. 45th Ann. IEEE/
pp. 247-257.
3. G.H. Loh and M.D. Hill, Efficiently Enabling
Conventional Block Sizes for Very Large
Die-Stacked DRAM Caches, Proc. 44th
Ann. IEEE/ACM Intl Symp. Microarchitecture, 2011, pp. 454-464.
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
89
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
4. J. Sim et al., Resilient Die-Stacked DRAM

Caches, Proc. 40th Ann. Intl Symp. Computer Architecture, 2013, pp. 416-427.
5. J.-S. Kim et al., A 1.2V 12.8GB/s 2Gb Mobile
Wide-I/O DRAM with 4x128 I/Os Using TSVBased Stacking, Proc. IEEE Intl Solid-State
Circuits Conf., 2011, pp. 496-498.
6. V. Sridharan et al., Feng Shui of Supercomputer Memory: Positional Effects in DRAM
and SRAM Faults, Proc. Intl Conf. High Performance Computing, Networking, Storage
and Analysis (SC 13), 2013, article no. 22.
7. D.K. Schroder and J.A. Babcock, Negative
Bias Temperature Instability: Road to Cross
in Deep Submicron Silicon Semiconductor
Manufacturing, J. Applied Physics, vol. 94,
no. 1, 2003, doi:10.1063/1.1567461.
8. M.K. Qureshi and G.H. Loh, Fundamental
Latency Trade-off in Architecting DRAM
Caches, Proc. 45th Ann. IEEE/ACM Intl
9. A. Meixner, M.E. Bauer, and D. Sorin,
Argus: Low-Cost, Comprehensive Error
Detection in Simple Cores, Proc. 40th Ann.
IEEE/ACM Intl Symp. Microarchitecture,
2007, pp. 210-222.
Jaewoong Sim is a PhD candidate in the

Engineering at the Georgia Institute of Technology. His research focuses on the interactions between microarchitecture, compiler,
and operating systems to efficiently enable
emerging technologies. Sim has a BS in electrical engineering from Seoul National University. He is a student member of IEEE and
the ACM.
Gabriel H. Loh is a fellow design engineer

at Advanced Micro Devices. His research
interests include computer architecture, processor microarchitecture, emerging technologies, and 3D die stacking. Loh has a PhD in
computer science from Yale University. He is
a senior member of IEEE and the ACM.
Vilas Sridharan is an RAS architect at

Advanced Micro Devices. His research
interests include the modeling of hardware
faults and architectural and microarchitectural approaches to fault tolerance in highperformance microprocessors. Sridharan has
a PhD from Northeastern University. He is
a member of IEEE.
Mike OConnor is a senior research scientist at NVIDIA. He performed the work for
this article at Advanced Micro Devices. His
research interests include GPU, heterogeneous processors, and memory systems.
OConnor has an MS in electrical and computer engineering from the University of
Texas at Austin. He is a senior member of
IEEE and a member of the ACM.
__________________
___________
_____________________

article to Jaewoong Sim, 266 Ferst Drive
(Klaus Advanced Computing Building),
Atlanta, GA 30332; jaewoong.sim@gatech.
______________
edu.
__
____________
_______
............................................................
90
micro
IEEE
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..................................................................................................................................................................................................................
DECOUPLED COMPRESSED CACHE:

EXPLOITING SPATIAL LOCALITY FOR
ENERGY OPTIMIZATION
..................................................................................................................................................................................................................
THE AUTHORS PROPOSE DECOUPLED COMPRESSED CACHE (DCC) TO IMPROVE

PERFORMANCE AND ENERGY EFFICIENCY OF CACHE COMPRESSION. DCC USES DECOUPLED
SUPERBLOCKS AND NONCONTIGUOUS SUB-BLOCK ALLOCATION TO DECREASE TAG
OVERHEAD AND INTERNAL FRAGMENTATION AND TO ELIMINATE THE NEED FOR ENERGYEXPENSIVE RECOMPACTION CAUSED BY CHANGES IN COMPRESSED BLOCK SIZE. THE
AUTHORS ALSO DEMONSTRATE A PRACTICAL DESIGN BASED ON A RECENT COMMERCIAL
LAST-LEVEL CACHE DESIGN.
......
Caches, especially last-level caches

(LLCs), have long been used to reduce effective memory latency and increase effective
bandwidth. They also play an increasingly
important role in reducing memory system
energy. Increasing LLC size can improve performance for most workloads, but the
improvement comes at signicant area cost.
Cache compression, however, seeks to increase effective cache sizeby compressing
and compacting cache blockswhile incurring small area overheads.1,2 Unfortunately,
previous designs limit compression benets
because of internal fragmentation, limited
tags, and energy-expensive recompaction that
occurs when a blocks size changes.
We propose decoupled compressed cache
(DCC), a technique that uses decoupled
superblocks (also known as sectors3) to
increase the maximum effective capacity to
four times the uncompressed capacity, while
0272-1732/14/$31.00 c 2014 IEEE
micro
IEEE
using area overhead comparable to previous

cache-compression techniques. DCC uses
superblocksfour aligned, contiguous cache
blocks that share a single address tagto
reduce tag overhead. Each 64-byte block in a
superblock is compressed separately and then
compacted into zero to four 16-byte subblocks (or segments). DCC decouples the
address tags by increasing the number of tags
and allowing any sub-block in a set to map to
any tag in that set to reduce fragmentation
within a superblock.3
Decoupling allows sub-blocks of a block
to be noncontiguous, thereby eliminating the
recompaction overheads of previous variablesize compressed caches.2 An optimized cocompacted DCC (Co-DCC) design further
reduces internal fragmentation (and increases
effective capacity) by allocating the compressed blocks from a superblock into the
same set of data sub-blocks.
Somayeh Sardashti
David A. Wood
University of
Wisconsin-Madison
.............................................................
91
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Potential and limits of compressed caching
Norm LLC effective capacity
apache
jbb
oltp
zeus
ammp
applu
equake
mgrid
wupwise
black
canneal
freqmine
m1
m2
m3
m4
m5
m6
m7
m8
GEOMEAN
X
C
-2
X
VS
VS
-4
VS
-3
f
-In
C
VS
Pa
te
By
Id
ea
ck
Figure 1. Normalized effective capacity of different compressed cache

designs. Although an ideal compressed cache (Ideal) has the potential to
significantly increase effective cache size, previous designs (such as
VSC-2X) reduce compression effectiveness by limiting the number of tags
and internal fragmentation.
This article
contributions:
............................................................
92
micro
IEEE
makes
the
following
DCC uses decoupled superblocks to

increase the effective number of tags
with low overhead.
DCC stores compressed data in noncontiguous sub-blocks to eliminate recompaction overheads when a blocks
compressed size changes.
DCC provides more effective capacity, on average, than a conventional cache of twice the size, while
slightly increasing cache area. Viewed
another way, DCC allows a designer
to get approximately the same cache
performance with about half the area.
Co-DCC further reduces internal
fragmentation by compacting the
blocks of a superblock and allocating
them into the same set of data subblocks.
In this article, we also present a concrete

design for Co-DCC and show how it can be
integrated into a recent commercial LLC
design with little additional complexity.
Although some data (and most instructions) are difcult to compress, most workloads are highly compressible. In this
article, we use C-PACKZ, a dictionarybased algorithm4 with nine-cycle decompression latency. C-PACKZ achieves an
average compression ratio (that is, the original size over compressed size) of 3.9. Thus,
compression has the potential to nearly
quadruple cache size (shown as Ideal in
Figure 1).
Previous compressed cache designs fail to
achieve this potential for three main reasons.
First, caches must compact compressed
blocks into sets, which introduces an internal
fragmentation problem. In Figure 1, BytePack represents an idealized compressed
cache with innite tags, which compacts
compressed blocks on arbitrary byte boundaries. BytePack degrades normalized effective
capacity to 3.1 on average. Second, practical
compressed caches introduce another internal
fragmentation problem by compacting compressed blocks into one or more sub-blocks,
rather than storing compressed data on arbitrary byte boundaries.2 Variable-size compression (VSC) techniques relax the mapping
constraint between tags and data and compact compressed blocks into a variable number of contiguous sub-blocks.2 The column
labeled VSC-Inf in Figure 1 illustrates that
compacting compressed blocks into zero to
four 16-byte sub-blocks (but with innite
tags per set) degrades normalized effective
capacity from 3.1 to 2.6 on average. Third,
practical compressed caches have a xed
number of tags per set. The remaining columns in Figure 1 illustrate that reducing the
number of tags, from innite to a more practical two times the baseline, degrades the
average normalized effective capacity from
2.6 to 1.7. Furthermore, VSC is not energy
efcient. It must repack the sub-blocks in a
set whenever a blocks size changes to make
contiguous free space. This action can
increase LLC dynamic energy by a factor of
nearly three, on average.
Decoupled compressed cache overview

DCC uses decoupled superblock tags (see
the Exploiting Spatial Locality sidebar for
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
.............................................................................................................................................................................................
Exploiting Spatial Locality
Superblocks (also known as sectors) have long exploited coarsegrained spatial locality to reduce tag overhead. Superblocks associate
Reference
one address tag with multiple cache blocks, replicating only the per1. A. Seznec, Decoupled Sectored Caches: Conciliating Low
block metadata, such as the coherence state. Figure A1 shows one
Tag Implementation Cost and Low Miss Ratio, Proc. 21st
set of a four-way associative sectored cache (SC), with four-block
Ann. Intl Symp. Computer Architecture, 1994, pp. 384-393.
superblocks. Using four-block superblocks reduces the tag area by 70
percent compared with a conventional cache. However, Figure A1 illustrates that singletons, pairs,
and triossuch as superblocks D, C, and A, respectivelyresult in internal fragmentation, which can
Data array
Tag array
Data array
Tag array
lead to significantly higher miss rates.
A
B
C
D
E
F
GH
H0
B0 A0 C0
B
A
C
D
B0
A0
C0
Seznec showed that decoupling superblock tags
A1 E1 F1 B1
E:<E3,E1>
from data blocks helps reduce internal fragmentaA1 B1
A:<A2,A1,A0>
F:<F1>
B:<B3,B2,B1,B0>
tion.1 Decoupled sectored (or superblock) caches
C2
H2 A2 B2
C2
A2 B2
G:<G3>
C:<C2,C0>
(DSC) increase the number of superblock tags per
H:<H2,H0>
D3 E3 G3 B3
D:<D3>
B3
D3
set and use per-block back pointers to identify the
Reused space
Unused space
corresponding tag. Figure A2 illustrates how decou(1)
(2)
pling can reduce fragmentation by letting two singletons (that is, blocks F1 and G3) share the same
superblock. DSC uses more tag space than SC but
Figure A. Sectored cache (1) and decoupled sectored cache (2). DSC
less than a conventional cache because back
reduces internal fragmentation and can fit more blocks in the cache (Block E
pointers are small.
to Block H).
more information) to improve cache compression in two ways. First, superblocks

reduce tag overhead, permitting more tags
per set for comparable overhead. Second,
decoupling tags and data reduces internal
fragmentation and, importantly, eliminates
recompaction when the size of a compressed
block changes.
Figure 2 shows how DCC exploits superblocks and manages the cache at three granularities: coarse-grained superblocks, singular
cache blocks, and ne-grained sub-blocks.
DCC tracks superblocks, which are groups of
aligned, contiguous cache blocks (Figure 2d),
while it compresses and stores each cache
block as a variable number of sub-blocks.
Figure 2a shows the key components of
DCC for a small, two-way set associative
cache with four-block superblocks, 64-byte
blocks, and 16-byte sub-blocks. DCC consists of a tag array, a sub-blocked back pointer
array, and a sub-blocked data array. DCC is
indexed using the superblock address bits
(Set Index in Figure 2e).
DCC tracks superblocks to t more compressed blocks into the cache while limiting
tag area overhead. DCC explicitly tracks
superblocks using a largely conventional

superblock tag array. Each tag entry (Figure
2b) consists of one tag per superblock and
per-block coherence (C-state) and compression (Comp) states. Because blocks of a
superblock share an address tag, the tag array
can map more blocks compared with the
same size conventional cache without incurring high area overhead. DCC holds as many
superblock tags as the maximum number of
uncompressed blocks that can be stored. For
example, Figure 2a shows a two-way associative cache with four-block superblocks. Each
set in the tag array can map eight blocks (that
is, 2 superblocks 4 blocks/superblock),
while a maximum of two uncompressed
blocks can t in each set. In the worst-case
scenariowhen there is no spatial locality
(that is, all singletons) or when cached data is
uncompressibleDCC can still utilize all
the cache data space, for example, by tracking
two singletons per set.
Although DCC tracks blocks of a superblock using one tag entry, it allocates or evicts
the blocks to and from the data array separately. The data array is a mostly conventional
cache data array, organized into sub-blocks.
.............................................................
MAY/JUNE 2014
micro
IEEE
93
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
(a) DCC cache layout:

Sub-blocked
back pointer
array
5
1
Tag array
Sub-blocked data
array
Tag A,
Blk #0
Tag match and sub-block

selection
(b) One tag entry:
(c) One BPE:
Cstate3
Comp3
Cstate2
Comp2
Cstate1
Comp1
Cstate0
Comp0
Super-block
tag
A0.0
(d) Address space:

Super-block size
{Tag #1, Blk #0}
{Tag A,(I,N), (I,N), (I,N), (VALID,COMP)}
5
A0.1
Index A
Tag ID Blk#
3b1b 3b1b 3b1b 3b1b
1b
...
(e) Address:
Super-block
tag
2b
Set
Blk# Byte
index
6b
(f) DCC lookup:

Lookup
tag array
Lookup
back pointer array
Super-block
miss
Replace victim
Tag match?
super-block
NO
YES
Block
Valid block?
miss
NO
Replace victim
super-block
YES
Update LRU, tag, and BPEs
Read? Write?
Read sub-blocks
and decompress
Compress and
write sub-blocks
Figure 2. A decoupled compressed cache. DCC cache layout (a); one tag entry (b); one back
pointer entry (BPE) (c); address space (d); address (e); and DCC lookup process (f). DCC
exploits superblocks and manages the cache at multiple granularities: coarse-grained
superblocks, singular cache blocks, and fine-grained sub-blocks.
............................................................
94
micro
IEEE
DCC compacts compressed blocks into a

variable number of noncontiguous subblocks in the sub-blocked data array. Figure 2a shows block A0 compressed into two
sub-blocks (A0.1 and A0.0), which are stored
in sub-blocks 5 and 1 in the data array. DCC
decouples sub-blocks from the superblock
tag using a back pointer array as a level of
indirection. Each back pointer entry corresponds to one sub-block in the data array and
identies the owner block (Figure 2c). The
back pointer array slightly increases LLC area
(see the Compressed Cache Overheads

sidebar); however, it enables low-overhead,
variable-size compression. DCCs decoupled
design allows a blocks sub-blocks to be noncontiguous, thus eliminating the need for
recompaction when a blocks size changes.
Co-DCC optimizes DCC to further reduce internal fragmentation. Co-DCC treats
blocks from the same superblock as one large
block and dynamically allocates them into
the same set of data sub-blocks, thereby reducing internal fragmentation within sub-block
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
Compressed Cache Overheads
Compressed caches can increase cache area because of their

extra metadata. Table A shows the quantitative area overheads of
decoupled compressed cache (DCC), co-compacted DCC (Co-DCC), and
previous work (FixedC and VSC-2X) over the same size conventional
cache (16-way associative, 8-Mbyte last-level cache [LLC]). DCC tracks
four-block superblocks and almost doubles the per-block metadata
largely due to the back pointers. However, because the data array is
much larger than the tag array, Cacti calculates the overall LLC area
overhead as about 6 percent. DCCs area overhead is similar to FixedC
and VSC-2X, which track twice as many tags per set (for example, 32
tags per 16 blocks). Co-DCC further increases metadata stored per
block, resulting in 16 percent area overhead compared to the
baseline.
Table A also includes the area overhead of (de-)compression units.
Because C-PACKZs decompressors produce 8 bytes per cycle, we
match the cache bandwidth by considering two decompressors per
cache bank. Compression is not on the critical path, so we consider
one compressor per bank. Thus, for an LLC with eight banks, we need
eight compressors and 16 decompressors, resulting in an extra 1.8
percent area overhead.
boundaries. Co-DCC increases overhead and

complexity in exchange for better cache
compression.
Figure 2f illustrates the DCC lookup procedure. On a cache lookup, the tag array and
the back pointer array are accessed in parallel.
In the common case of a cache hit, both the
block and its corresponding superblock are
found available (that is, tag matched and
block is valid). In the event of a cache hit, the
result of the tag array and the back pointer
array lookup determines which sub-blocks of
the data array belong to the accessing block.
On a read, the corresponding sub-blocks are
then read out of the data array and decompressed. On a write, the new, compressed size
might be larger, resulting in a block (or a
superblock) eviction if sufcient space is not
available. On the other hand, in case of a cache
miss, DCC allocates the compressed block in
the data array. If its superblock is not available,
DCC allocates it rst in the tag array.
A practical design for DCC

DCC can be integrated into the LLC of a
recent commercial design with relatively little
Compressed caches can also increase LLC per-access dynamic

power and LLC static power because of their extra metadata. DCC,
similar to FixedC and VSC-2X, increases LLC per-access dynamic
power by 2 percent and LLC static power by 6 percent. Co-DCC also
incurs a 6-percent overhead on LLC per-access dynamic power and a
16-percent LLC static power overhead. We model these overheads as
well as the power overheads of (de-)compression in detail in this
work.
Table A. Last-level cache (LLC) area overheads of
different compressed caches.
DCC
(%)
Co-DCC
(%)
FixedC/
VSC-2X (%)
Tag array
Back pointer array
2.1
4.4
11.3
5.4
6.3
0
Compressors
0.6
0.6
0.6
Decompressors
Total area overhead
1.2
8.3
1.2
18.5
1.2
8.1
Components
additional complexity and, more importantly, no need for an alignment network.

The AMD Bulldozer processor implements
an 8-Mbyte LLC that is broken into four
2-Mbyte subcaches. Each subcache consists
of four banks that can independently service
cache accesses.5 Figure 3 illustrates the data
array of one bank in LLC and shows how it is
divided into four sequential regions (SRs).
Each sequential region runs one phase (that
is, half a cycle) behind the previous region
and contains a quarter of a cache block (that
is, 16 bytes). Figure 3 shows how block A0s
four 16-byte sub-blocks (A0.0 to A0.3) are
distributed to the same row in each sequential region. Each subsequent sequential
region receives the address half a cycle later
and takes half a cycle longer to return the
data. Thus, a 64-byte block is returned in a
burst of four cycles on the same data bus. For
example, A0.1 is returned one cycle after
A0.0 in Figure 4a.
DCC requires only a small change to the
data array to allow noncontiguous subblocks. In Figure 3, block B1 is compressed
into two sub-blocks (B1.0 and B1.1), which
are stored in sequential regions 1 and 2, but
.............................................................
MAY/JUNE 2014
micro
IEEE
95
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
SR3
SR2
A0.3
A0.2
SR1
SR0
Set addr
A phase flop
A0.0
B phase flop
B1.1
A phase flop
B phase flop
N
A0.1
B1.0
Read data
4 SR0 addr
4 SR1 addr
4 SR2 addr
4 SR3 addr
A0: uncompressed; B1 is compressed to 2 sub-blocks
Figure 3. DCC data array organization. The data array is divided into four
sequential regions, each containing a sub-block of a cache block.
............................................................
96
micro
IEEE
not in the same row. To select the correct

sub-block, DCC must send additional
address lines (4 bits for a 16-way associative
cache) to each sequential region (illustrated
by the dotted lines in Figure 3). DCC must
also enforce the constraint that a compressed
blocks sub-blocks are allocated to different
sequential regions to prevent sequential
region conicts.
Figure 4b illustrates DCC timing when
reading block B1. The back pointer array is
accessed in parallel with the tag array. The
sub-block selection logic then nds the back
pointer entries corresponding to this block
using its block ID (derived from its address)
and the matched tag ID, which is found by
the tag match logic. The sub-block selection
logic can only be partially overlapped with
the tag match logic because it needs the
matched tag ID. To calculate the latency
overhead of the sub-block selection, we
implemented the tag match and the subselection logic in Verilog, synthesized in 45
nm, and scaled to 32 nm.6 The sub-block
selection logic adds less than half a cycle to
the critical path, which we conservatively
assume increases the access latency by one
cycle.
Figure 4b shows how the matching subblocks are returned and fed directly into the
decompression logic, which accepts 16 bytes
per cycle and has a small rst-in, rst-out
(FIFO) buffer to rate match. Decompression
starts as soon as the rst sub-block arrives (for
example, B1.0), depending on which sequential region it resides in. Because sub-block
B1.0 resides in sequential region 1, there is

one extra cycle (worst case is three cycles).
Note that because the decompression latency
is deterministic (nine cycles), DCC can
determine at the end of sub-block selection
when the data will be ready and whether the
decompression hardware can be bypassed.
Thus, even though completion times vary,
DCC has ample time to arbitrate for the
response network.
Evaluation
We evaluate DCC using a full-system
simulator based on GEMS.7 We model a
multicore system with eight out-of-order
cores; per-core private 32-Kbyte, 8-way L1
instruction and data caches;; per-core private
256-Kbyte, 8-way L2 caches; and one shared
8-Mbyte, 16-way L3 cache.8 We use
CACTI 6.59 to model power at 32 nm. We
also use a detailed DRAM power model based
on Micron Technologys power model.10 In
this section, we report total system energy,
which includes energy consumption of processors (cores and caches), on-chip network,
and off-chip memory. For DCC and CoDCC, we use four-block superblocks, 64-byte
blocks, and 16-byte sub-blocks. With these
parameters, DCC has similar area overhead as
FixedC, which doubles the number of tags
and compresses a block to half, if possible,
and VSC-2X, which doubles tags but compresses a block into zero to four 16-byte
sub-blocks (see the Compressed Cache Overheads sidebar).
Our evaluations use representative multithreaded and multiprogrammed workloads
from commercial workloads (apache, jbb,
oltp, zeus),11 SPEC-OMP (ammp, applu,
equake, mgrid, wupwise),12 Parsec (blackscholes, canneal, freqmine),13 and mixes of
SPEC CPU2006 benchmarks denoted
as m1 to m8 (bzip2, libquantum-bzip2,
libquantum, gcc, astar-bwaves, cactus-mcfmilc-bwaves, gcc-omnetpp-mcf-bwaves-lbmmilc-cactus-bzip, omnetpp-lbm).
Improved cache efficiency

Compressed caches improve the caches
effective capacity by tting more blocks into
the same space. They can achieve the benets
of larger cache sizes with lower area and
power overheads.
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Access
tag array
Tag
match
t cycles m cycles
(a)
Access
tag array
Access
BPA
(b)
M
q
M
q
M
q
M
q
MQmags
q
t cycles
Tag
match
SR0 SR1 SR2 SR3

A0.0 A0.1 A0.2 A0.3
Access data
array
d cycles
SR0 SR1 SR2 SR3

B1.0 B1.1
Access data
array
Sub-block
selection
m cycles 1
Decompression
d cycles
1
9 cycles
Figure 4. Timing of a conventional cache (a) and DCC (b). A 64-byte block is returned in a burst
of four cycles on the same data bus. With DCC, only the matching sub-blocks are read and
fed directly into the decompression logic.
Result 1: By exploiting spatial locality,

DCC achieves on average 2.2 (and
up to 4) higher LLC effective
capacity compared to the baseline,
resulting in 18 percent lower LLC
miss rate on average and up to 38
percent lower LLC miss rate.
Result 2: Co-DCC further improves
the effective cache capacity by reducing internal fragmentation within
data sets. It achieves on average 2.6
(and up to 4) higher effective
capacity and 24 percent (up to 42
percent) lower LLC miss rate.
Result 3: DCC and Co-DCC provide
significantly higher effective cache
capacity and lower miss rate than FixedC and VSC-2X. DCC and CoDCC also perform better on average
than a cache with twice the capacity
(2 baseline) while incurring much
lower area overhead.
Figure 5a shows LLC effective capacity of

different techniques normalized to baseline.
We calculate the effective cache capacity by
counting valid LLC cache blocks periodically.
DCC can signicantly improve LLC effective
capacity and LLC miss rate (misses per kilo
executed instructions [MPKI]) for many
applications by tting more compressed
blocks. DCC benets differ per workload,

depending on workload sensitivity to cache
capacity, compression ratio, and spatial
locality. It achieves the greatest benet for
cache-sensitive workloads with good compressibility and spatial locality (such as apache
and omnetpp-lbm/m8). Workloads with low
spatial locality (such as canneal) or low compression ratio (such as wupwise) observe lower
improvements. Cache-insensitive workloads
(such as blackscholes) also do not benet from
compression.
Overall performance and energy

By improving LLC utilization and reducing accesses to the main memory (that is, the
lower LLC miss rate), DCC and Co-DCC
signicantly improve system performance
and energy.
Result 4: DCC and Co-DCC improve

LLC efficiency and boost system performance by 10 percent (up to 29
percent) and 14 percent (up to 38
percent) on average, respectively.
Result 5: DCC and Co-DCC save on
average 8 percent (up to 24 percent)
and 12 percent (up to 39 percent) of
system energy, respectively, because
of shorter runtime and fewer accesses
to the main memory.
.............................................................
MAY/JUNE 2014
micro
IEEE
97
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
4.00
C
C
o-
Fi
2X
xe
D
oC
C
C
e
D
in
el
2X
Ba
s
-2
dC
VS
Fi
o-
xe
C
C
e
D
in
el
-2
C
Ba
s
2X
VS
xe
Fi
(a)
0.40
in
0.40
dC
1.00
0.50
0.50
0.60
1.50
el
0.60
0.70
-2
0.70
0.80
Ba
s
2.00
0.80
0.90
2.50
0.90
1.00
VS
Norm runtime
3.00
Norm system energy
1.00
3.50
dC
Norm LLC effective capacity
TOP PICKS
(b)
(c)
apache
jbb
oltp
zeus
ammp
applu
equake
mgrid
wupwise
black
canneal
freqmine
m1
m2
m3
m4
m5
m6
m7
m8
GEOMEAN
Figure 5. Normalized LLC effective capacity (a); normalized runtime (b); normalized total system energy (c). DCC and Co-DCC
improve LLC utilization, resulting in higher performance and energy improvements than previous work and 2x baseline.
............................................................
98
micro
IEEE
Result 6: DCC and Co-DCC achieve

2.5 and 3.5 higher performance
improvements, respectively, and 2.2
and 3.3 higher system energy improvements compared with FixedC
and VSC-2X.
Result 7: DCC and Co-DCC also improve LLC dynamic energy by about
50 percent on average by accessing
fewer bytes. However, VSC-2X hurts
LLC dynamic energy for the majority
of our workloads because of its need
for energy-expensive recompactions.
Figure 5b shows that DCC outperforms

baseline, FixedC, VSC-2X, and 2 baseline
by effectively more than doubling the cache
capacity. DCC and Co-DCC also improve
system energy owing to shorter runtime and
fewer accesses to the main memory. Figure
5c shows the total system energy of different
techniques. DCC and Co-DCC signicantly reduce the main memory dynamic
energy by reducing the number of cache
misses, which contributes to greater system
energy improvements as well. Unlike VSC2X, which hurts LLC dynamic energy
because of recompaction, DCC and CoDCC eliminate the need for recompaction
and can even save LLC dynamic energy by

accessing fewer bytes when reading or writing compressed data.
CC demonstrates the potential for

compression to increase the effective
capacity of last-level caches, improving both
performance and energy efciency. Alternatively, DCC can reduce the chip area
required for a given cache capacity, thereby
reducing implementation cost. Future work
should explore algorithms that better compress instructions and oating point data, as
well as extending compression to all levels of
the cache and memory hierarchy.
Acknowledgments
This work is supported in part by the
National Science Foundation (CNS0916725, CCF-1017650, CNS-1117280,
and CCF-1218323) and a University of Wisconsin Vilas award. The views expressed
herein are not necessarily those of the NSF.
Professor Wood has a significant financial
interest in AMD. We thank Hamid Reza
Ghasemi, Dan Gibson, members of the Multifacet research group, and the anonymous
reviewers for their comments on the article.
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
....................................................................
References
12. V. Aslot et al., SPEComp: A New Benchmark Suite for Measuring Parallel Computer
1. S. Sardashti and D. Wood. Decoupled

Compressed Cache: Exploiting Spatial
Locality for Energy-Optimized Compressed
Caching, Proc. 46th Ann. IEEE/ACM Intl
Performance, Proc. Intl Workshop OpenMP Applications and Tools: OpenMP

Shared Memory Parallel Programming,
2001, pp. 1-10.
2. A. Alameldeen and D. Wood. Adaptive
13. C. Bienia and K. Li, PARSEC 2.0: A New

Benchmark Suite for Chip-Multiprocessors,
Cache Compression for High-Performance

Processors, Proc. 31st Ann. Intl Symp.
Proc. 5th Ann. Workshop Modeling, Benchmarking and Simulation, 2009, pp. 47-55.

3. A. Seznec, Decoupled Sectored Caches:
Conciliating Low Tag Implementation Cost
and Low Miss Ratio, Proc. 21st Ann.
Intl Symp. Computer Architecture, 1994,
pp. 384-393.
4. X. Chen et al., C-Pack: A High-Performance
Microprocessor Cache Compression Algorithm, IEEE Trans. VLSI Systems, vol. 18,
no. 18, 2010, pp. 1196-1208.
5. D. Weiss et al., An 8MB Level-3 Cache in
32nm SOI with Column-Select Aliasing,
Proc. Solid-State Circuits Conf., 2011,
pp. 258-260.
6. International
Technology
Roadmap
for
Semiconductors, 2010 Update, ITRS, 2011;

www.itrs.net.
________
7. M. Martin et al., Multifacets General
(GEMS) Toolset, Computer Architecture
News, 2005, vol. 33, no. 4, pp. 92-99.
8. 4th Generation Intel Core i7 Processors,
Intel Corporation; www.intel.com/products/
_______________
processor/corei7.
__________
9. CACTI: An Integrated Cache and Memory
Access Time, Cycle Time, Area, Leakage,
and Dynamic Power Model, HP Labs
Research; www.hpl.hp.com/research/cacti.
___________________
10. Calculating Memory System Power for
Somayeh Sardashti is a PhD candidate in

the Department of Computer Sciences at
the University of Wisconsin-Madison. Her
research interest is computer architecture,
specifically energy-optimized memory hierarchies. Sardashti has an MS in computer
science from the University of WisconsinMadison and an MS in computer engineering from the University of Tehran. She is a
member of IEEE and the ACM.
David A. Wood is a professor in the
Department of Computer Sciences and the
Engineering at the University of WisconsinMadison. His research interests include
techniques for improving the performance
and energy efficiency of multiprocessor and
heterogeneous computing systems. Wood
has a PhD in computer science from the
University of California, Berkeley. He is a
fellow of IEEE and the ACM, and a member of the IEEE Computer Society.
article to Somayeh Sardashti, Department of
Computer Science, University of WisconsinMadison, 1210 West Dayton Street, Madison, WI 53706-1685; somayeh@cs.wisc.edu.
______________
DDR3, tech. note TN-41-01, Micron Technology, 2007.

11. A. Alameldeen et al., Simulating a $2M
Commercial Server on a $2K PC, IEEE
Computer, vol. 36, no. 2, 2003, pp. 50-57.
____________
_______
.............................................................
MAY/JUNE 2014
micro
IEEE
99
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
................................................................................................................................................................................................................
SONIC MILLIP3DE: AN ARCHITECTURE

FOR HANDHELD 3D ULTRASOUND
................................................................................................................................................................................................................
SONIC MILLIP3DE, A SYSTEM ARCHITECTURE AND ACCELERATOR FOR 3D ULTRASOUND

BEAMFORMING, HAS A THREE-LAYER DIE-STACKED DESIGN THAT COMBINES A
HARDWARE-FRIENDLY APPROACH TO THE ULTRASOUND IMAGING ALGORITHM WITH A
CUSTOM BEAMFORMING ACCELERATOR. THE SYSTEM ACHIEVES HIGH-QUALITY 3D
ULTRASOUND IMAGING WITHIN A FULL-SYSTEM POWER OF 15 W IN 45-NM
SEMICONDUCTOR TECHNOLOGY. SONIC MILLIP3DE IS PROJECTED TO ACHIEVE THE TARGET
5-W POWER BUDGET BY THE 16-NM TECHNOLOGY NODE.
......
Richard Sampson
University of Michigan
Ming Yang
Siyuan Wei
Chaitali Chakrabarti
Arizona State University
Thomas F. Wenisch
University of Michigan
.......................................................
100
micro
IEEE
Much as every medical professional listens beneath the skin with a stethoscope
today, we foresee a time when handheld medical imaging will become as ubiquitous
peering under the skin using a handheld
imaging device. Mobile medical imaging is
advancing rapidly to reduce the footprint of
bulky, often room-sized machines to compact
handheld devices. In the last ve years,
research has demonstrated that by combining
the increasing capabilities of mobile processors with intelligent system design, portable
and even handheld imaging devices are not
only possible, but commercially viable. In
particular, ultrasound imaging has proven to
be an especially successful candidate for high
portability due to its safety and low transmit
power, with commercial handheld 2D ultrasound devices marketed and being used in
hospitals today. Newly developed portable
imaging devices have not only led to demonstrated improvements in patient health,1 they
have also enabled new applications for handheld ultrasound, such as disaster relief care2
and battleeld triage.3 However, despite
the increasing capabilities of handheld
ultrasound devices, these systems remain

unable to produce the high-quality real-time
3D images that are possible with their nonhandheld counterparts.
In recent years, many hospitals have been
transitioning to 3D ultrasound imaging
when mobility is not required because it provides numerous benets over 2D, including
increased technician productivity, greater
volumetric measurement accuracy, and more
readily interpreted images. 3D imaging can
also enable advanced diagnostic capabilities,
such as tissue sonoelastography through highvelocity 3D motion tracking and accurate
blood-ow measurements via 3D Doppler.
Creating a handheld 3D system could enable
hospital-quality ultrasound imaging in nearly
any setting, greatly expanding the way ultrasound is used today. However, 3D ultrasound
comes with many challenges that are compounded when implementing a system in a
handheld form factor. The sheer amount of
data that must be sensed, transferred, and
computed is nearly 5,000 times more than in
a 2D system. At the same time, the massive
data rate (as high as 6 terabits/second) of the
0272-1732/14/$31.00 c 2014 IEEE
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
received echo signals is so high that the data

cannot easily be transferred off chip for image
formation; current 3D systems typically transfer data for only a fraction of receive channels,
sacricing image quality or aperture size. In
addition to the extreme computational
requirements, power is of the utmost importance, not only to ensure adequate battery
life, but more importantly because the device
is in direct contact with the patients skin,
placing tight constraints on safe operating
temperature.
For safe operation, a handheld ultrasound
system must operate within roughly a 5-W
power budget. Implementing a handheld 3D
system with commercially available digital
signal processor (DSP) or graphics-accelerator
chips using conventional beamforming algorithms designed for software is simply infeasible. Our analysis indicates that it would take
700 ultrasound DSP chips with a total power
budget of 7.1 kW to meet typical 3D imaging
computational demands at just 1 frame per
second (fps). To enable such demanding
computation on such a low power budget, a
complete rethink of both the algorithm and
architecture is required.
In this article, we introduce Sonic Millip3De,4,5 a hardware accelerator that combines
a new approach to the ultrasound imaging
algorithm better suited to hardware with
modern computer architecture techniques to
achieve high-quality 3D ultrasound imaging
within a full-system power of 15 W in 45-nm
semiconductor technology. Under anticipated
scaling trends, we project that Sonic Millip3De will achieve our target 5-W power
budget by the 16-nm technology node.
We present this work both to make progress on realizing the promise of handheld
medical imaging and as a case study for application-specic accelerator design. Our work
also illustrates the unique benets of 3D die
stacking in heterogeneous systems and motivates moving beyond the limitations of the
conventional von Neumann architecture in
certain applications.
Synthetic aperture ultrasound

Synthetic aperture ultrasound imaging is
performed by sending high-frequency pulses
(typically 1 to 15 MHz) into a medium and
constructing an image from the reected pulse

signals. The process comprises three stages:
transmit, receive, and beamsum. Transmission
and reception are both done using an array of
transducers that are electrically stimulated to
produce the outgoing signal and generate current when they vibrate from the returning
echo. After all echo data is received, the beamsum process (the computation-intensive stage)
combines the data into a partial image. The
partial image corresponds to echoes from a single transmission. Several transmissions from
different locations on the transducer array are
needed to produce high-quality images, so several iterations of transmit, receive, and beamsum are typically necessary to construct a
single complete frame.
Each transmission is a pulsed signal conceptually originating from a single location
in the array, shown in Figure 1a. To improve
signal strength, multiple transducers can re
together in a pattern to emulate a single virtual source located behind the transducer
array.6 The pulse expands into the medium
radially, and as it encounters interfaces
between materials of differing density, the
signal is partially transmitted and partially
reected, as shown in Figure 1b. The returning echoes cause the transducers to vibrate,
generating a current signal that is digitized
and stored in a memory array associated with
each transducer. Each position within these
memory arrays corresponds to a different
round-trip time from the emitting transducer
to the receiving transducer. Because transducers cannot distinguish the direction of an
incoming echo, each array element contains
the superimposed echoes from all locations
in the imaging volume with equal round-trip
times (that is, an arc in the imaging volume).
Because of the geometry of the problem, the
round-trip arcs are different for each transducer, resulting in different superpositions at
each receiver. The beamsum operation sums
the echo intensity observed by all transducers
for the arcs intersecting a particular focal
point (that is, a location in the imaging volume), yielding a strong signal when the focal
point lies on an echoic boundary. Combining
transmissions from multiple source locations
allows further focusing.
A typical beamsum pipeline rst transforms the raw signal received from each
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
101
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
(a)
(b)
Image
Space
Point
P
RP
Xi
(c)
(d)
Figure 1. Synthetic aperture ultrasound. Pulse leaving transmit transducer

(a). Echo pulses reflecting from points B and C. All transducers in array (or
subaperture) will receive the echo data, but at different times due to
different round trip distances (b). All of the reconstructed data for point B
from each of the transducers added together. By adding thousands of
views together, crisp points become focused (c). Variables used in
calculating round-trip distance, dp , for the ith transducer and point P in
Equation 1 (d).
............................................................
102
micro
IEEE
transducer to enhance signal quality. The signal is upsampled using an interpolation lter
to generate additional data points between
the received samples. This process enhances
resolution without the power and storage
overheads of increasing the data sampling
rate of the analog front end. Then, so-called
apodization scaling factors are applied to the
interpolated data to place greater weight on
receivers near the origin of the transmission,
because these signals are more accurate owing
to their lower angle of incidence.
Once the data has been preprocessed
(transformed), the beamsum operation can
begin. In essence, this entails calculating the
round-trip delay between the emitting transducer and all receiving transducers through
each focal point, converting these delays into
indices in each transducers received signal
array, retrieving the corresponding data, and
summing these values. Figure 1c illustrates
this process. These partial images are then
summed over multiple transmissions. Finally,

a demodulation operation is applied to
remove the ultrasound carrier signal.
The delay calculation (identifying the
right index within each receive array) is the
most computationally intensive aspect of
beamsum, because it must be completed for
every {focal point, transmit transducer, receive transducer} trio. Because the transmit
signal propagates radially, the image space is
described by a grid of scanlines that radiate at
a constant angular increment from the center
of the transducer array into the image volume. Focal points are located at even spacing
along each scanline, in effect creating a spherical coordinate system. However, the transducers are laid out in a grid-based Cartesian
coordinate system, requiring a fairly complex
law of cosines calculation to compute roundtrip distances via Equation 1:
dP
q
1
Rp Rp2 xi2 2xi Rp sin h
c
1
In this equation, dp is the round-trip delay

from the center transducer to point P to
transducer i, c is the speed of sound in tissue
(1,540 m/s), Rp is the radial distance of point
P from the center of the transducer, h is the
angular distance of point P from the line normal to the center transducer, and xi is the
distance of transducer i from the center. Figure 1d shows variables as they correspond to
the system geometry. This formula requires
extensive evaluation of both trigonometric
functions and square roots; hence, many 2D
ultrasound systems precalculate all delays and
store them in a lookup table (LUT).7 However, a typical 3D system requires roughly
250 billion unique delay values, making a
LUT implementation impractical. Instead,
delays are calculated as needed.8
Redesigning the ultrasound algorithm for

hardware acceleration
A key innovation of Sonic Millip3De is to
codesign hardware with a new beamforming
algorithm better suited to hardware acceleration. Our main algorithmic insight is to
replace the expensive exact delay calculation
of Equation 1 with an iterative, piecewise
quadratic approximation, which can be
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
To memory
Transducer
bank
12-bit
ADC
SRAM
array
Transform
unit
Select unit
(10 Subunits)
Reduce
unit
Transducer
bank
12-bit
ADC
SRAM
array
Transform
unit
Select unit
(10 Subunits)
Reduce
unit
Transducer
bank
12-bit
ADC
SRAM
array
Transform
unit
Select unit
(10 Subunits)
Reduce
unit
Layer 1: Transducers
Layer 2: ADC/storage
Layer 3: Beamforming
Sonic Millip3De
From memory
Figure 2. Sonic Millip3De hardware overview. Layer 1 comprises 120 88 transducers grouped into banks with one
transducer per bank in each subaperture. Analog transducer outputs from each bank are multiplexed and routed over throughsilicon vias (TSVs) to Layer 2, comprising 1,024 analog-to-digital converter (ADC) units operating at 40 MHz and static RAM
(SRAM) arrays to store incoming samples. The stored data is passed via face-to-face links to Layer 3 for processing in the
three stages of the 1,024-unit beamsum accelerator. The transform stage upsamples the signal to 160 MHz. The 10 units in
the select stage map signal data from the receive time domain to the image space domain in parallel for 10 scanlines. The
reduce stage combines previously stored data from memory with the incoming signal from all 1,024 beamsum nodes over
a unidirectional pipelined interconnect, and the resulting updated image is written back to memory.
computed efciently using only add operations. The algorithms iterative nature lends
itself to an efcient data streaming model,
allowing the proposed hardware to exploit
locality and eliminate inefcient address calculation and memory-access operations that
are a bottleneck in conventional implementations. Our early analysis shows that the delta
function between adjacent focal-point delays
on a scanline forms a smooth curve and indices can be approximated accurately (with
error similar to that introduced by interpolation) over short intervals with quadratic
approximations. We replace these exact delta
curves with a per-transducer precomputed
piecewise quadratic approximation constrained to allow an index error of at most 3
(corresponding to at most a 30-lm error
between the estimated and exact focal point),
thus resulting in negligible blur.
Using ofine image quality analysis, we
have determined that, for a target imaging
depth of 8 cm, we can meet the error constraints with only three piece-wise sections.
Each section requires precalculating three
coefcients and a section cut-off, achieving a
250-times storage reduction relative to an
exhaustive lookup table. Through careful
pipelining of the beamforming process, the
constants can be efciently streamed from
off-chip memory, limiting storage requirements within the beamforming accelerator.
Sonic Millip3De
The Sonic Millip3De system hardware
(shown in Figure 2) is divided into three distinct silicon die layerstransducers and analog components, analog-to-digital converters
(ADCs) and storage, and beamforming computationwhich are connected vertically
using through-silicon vias (TSVs). The 3Dstacked chip connects to separate LPDDR2
(low-power double data rate 2) memory. All
.............................................................
MAY/JUNE 2014
micro
IEEE
103
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
............................................................
104
micro
IEEE
of these components are integrated directly

into the ultrasound scanhead, the wand-like
device held against the patients skin to obtain
ultrasound images, allowing for a complete
handheld device. These components comprise an ultrasound systems front end, capable of generating raw, volumetric images. A
separate back end for viewing and postprocessing might be implemented in a tablet
or PC.
Using a 3D die-stacked design provides
several architectural benets. First, it is possible to stack dies manufactured in different
technologies. Hence, the transducer layer can
be manufactured in a cost-effective process
for the analog circuitry, higher voltages, and
large geometry of ultrasonic transducers,
while the beamforming accelerator can
exploit the latest digital logic process technology. Second, stacking allows far more TSV
links between dies than conventional chip
pins, resolving the bandwidth bottleneck that
plagues existing 3D systems where the probe
and computation units are connected via
cable. Third, TSV connections replace long
wires that would otherwise be required in
such a massively parallel system, reducing
interconnect power requirements. Finally,
stacking provides the potential for design
modularity, where the same beamforming
accelerator die could be stacked with alternative transducer arrays designed for different
imaging applications.
The top die layer comprises a 120 88
grid of capacitive micromachined ultrasonic
transducers (CMUTs) with k/2 spacing. The
area between the transducers is used for additional analog components and routing to the
TSV interface. Our system uses a sliding
subaperture technique where, for each
transmit, only a 32 32 subgrid of transducers receive. The full 120 88 aperture is
sampled over multiple transmissions, reducing hardware (and power) requirements at
the cost of more transmissions per frame.
The transducers are grouped into banks such
that only one transducer per bank receives
data in any subaperture. With this banking
design, only a single beamforming channel is
necessary for each of the 1,024 banks rather
than each of 10,560 transducers.
The second layer comprises 1,024 (12bit) ADCs and static RAM (SRAM) arrays,
which each correspond to the 1,024 transducer banks of the analog transducer layer.
The ADCs are sampled at 40 MHz, storing
the digital output into corresponding 6-Kbyte
SRAM arrays. The SRAMs are clocked at
1 GHz and connect vertically to a corresponding computational unit on the beamforming accelerator layer, requiring a total
of 24,000 face-to-face TSVs for data and
address signals.
The nal layer is the most complex of the
three, comprising the beamforming accelerator processing units, a unidirectional pipelined interconnect, and a control processor
(M-class ARM core) that interfaces to the
LPDDR2 off-chip memory.
Beamforming accelerator design

The beamforming accelerator comprises
1,024 independent channels, each divided
into three conceptual stages: transform, select,
and reduce. Each of these stages performs a
separate operation to convert the digitized
receive samples into beamformed focal-point
intensities. Although the transform-selectreduce conceptual framework is particularly
well suited for ultrasound beamforming, this
design paradigm could also be applicable to
other problems with similar dataow.
Transform
The transform unit operates on all of the
receive data, performing a 4-times linear
interpolation on the raw receive signals. After
upsampling, a constant apodization is applied providing a weight based on transducer
position as previously described.
Select
The select unit remaps data from the
receive time domain to the image space
domain using the algorithm described previously. The select unit is split into 10 subunits
that concurrently operate on neighboring
scanlines. These subunits each iterate over
the same incoming datastream from a corresponding second-layer SRAM array in a
synchronized fashion, reducing the number
of times data must be read from the SRAM
by a factor of 10. Figure 3 shows a block
diagram of a single subunit.
Data is streamed simultaneously into the
input rst-in, rst-out (FIFO) buffer of each
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
From reduce
unit
A+B
Constant
storage
2A
+
+
+
Section
decrementor
Select
decrementor
Select logic
Input buffer
Output buffer
From transform
unit
Select subunit
To reduce
unit
Figure 3. Select unit microarchitecture. Select units map upsampled echo data from the receive time domain to image focal
points. Sample data arrives from the transform unit at the input buffer, and each sample is either discarded or copied to
the output buffer as determined by our piecewise quadratic approximation algorithm. The constant storage holds the
precomputed constants and boundary for each approximation section. The adder chain calculates the next delta index value
to determine how far ahead the hardware needs to iterate to find the next focal point, with the final adder accumulating
fractional bits from previous additions. The select decrementor is initialized with the integer component of the adder chain.
In each cycle, the head of the input buffer is copied to the output if the decrementor is zero, or discarded if it is nonzero. The
section decrementor tracks when to advance to the next piece-wise section.
select subunit. As the subunits each drain

their input buffers, new data is streamed in
from SRAM. On each clock cycle, a sample
is popped from the head of the input buffer,
and the select logic determines whether the
data corresponds to the next focal point on
the scanline. If the sample is selected, it is
copied to the output buffer. Otherwise, the
sample is discarded. Whenever the output
buffer lls, its contents are sent to the reduce
stage.
The select logic implements the piecewise
quadratic delay calculation described previously. The logic calculates the deltathat is,
the number of samples to discardbetween
two consecutive focal points. The unit comprises constant storage that is preloaded with
the piecewise quadratic constants required to
process a particular set of scanlines, a decrementor that determines when to change sections, a series of adders to generate the delta
value, and nally a decrementor that counts
down in step with the input buffer and
determines which data should be selected.

After initialization, the subunit generates the
rst delta value (n 0) to determine how
much to advance the input. This delta value is
then loaded into the select decrementor.
Once the select decrementor reaches 0, the
current data is selected and written to the
output buffer. A new delta value (n 1) is
calculated and the process continues until the
entire scanline has been generated. Because of
the iterative nature of the calculation, deltas
can be calculated efciently using the adder
chain shown in Figure 3. Using this design
approach, we can change what is typically an
address-calculation and load-intensive software loop to a streaming design, greatly
improving efciency.
Reduce
The nal stage is the reduce unit, which
ties the 1,024 channels together via a pipelined network. Each reduce unit corresponds
to a single node on the network and adds the
.............................................................
MAY/JUNE 2014
micro
IEEE
105
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Value
Total transmits per frame
96
Total transducers
Receive transducers per subframe
10,560
1,024
SRAM size per receive transducer
4,096 12 bits
Focal points per scanline

Image depth
4,096
10 cm
Image total angular width
p/6
Sampling frequency
Interpolation factor
40 MHz
4
Interpolated sampling frequency
160 MHz
Speed of sound (tissue)

Target frame rate
1540 m/s
1 frame/s
Table 2. CNR values for ideal system and Sonic Millip3De (SM3D).
Values correspond to cysts shown in Figure 4.
Left column of
Right column of
cysts in Figure 4
cysts in Figure 4
SM3D
Ideal
SM3D
3.59
3.58
1.93
1.85
3.18
2.68
3.21
2.67
1.51
1.94
1.41
1.85
1.61
1.62
2.10
2.01
1.10
0.33
1.18
0.39
2.39
2.43
2.30
2.34
data received from the preceding node with

the data from the local select unit before
sending the summed result to the next node
on the network.
Methodology and results
............................................................
106
micro
IEEE
20
30
30
40
40
50
50
60
We evaluate our system in terms of both

image quality and system power. Because
Sonic Millip3De is intended for diagnostic
medical imaging, it is critical that it generates
high-quality images, comparable to existing
devices. Hence, the goal of our image quality
evaluation is to conrm that the approximation techniques used to reduce power do not
unacceptably degrade image quality. We contrast images generated according to our
method against an ideal system without power
constraints or approximations.
In our image quality analysis, we simulate
cysts in tissue using Field II,9,10 varying cyst
size with depth and covering a range of 8 cm
60
70
70
80
80
90
90
100
10
(a)
Ideal
20
z (mm)
Parameter
z (mm)
Table 1. 3D ultrasound system parameters.
0
10
x (mm)
100
10
0
10
x (mm)
(b)
Figure 4. Image quality comparison. X to Z

(horizontal) slice through a series of cysts
from a 3D simulation using Field II,9,10
generated with double-precision floatingpoint and exact delay index calculation (a).
The same slice generated via our delay
algorithm, fixed-point precision, and
dynamic focus (b). Table 2 gives the
contrast-to-noise ratios (CNRs) for both.
(2- to 10-cm depth). Table 1 shows the relevant ultrasound system parameters. We generate 3D images using both our system
(iterative delay calculation and xed-point
adders) as well as an ideal system (full delay
calculation and double-precision oatingpoint arithmetic). Figure 4 shows a 2D slice
for both images. We quantitatively compare
image quality using contrast-to-noise ratios
(CNRs) for each cyst, shown in Table 2.
Overall, Sonic Millip3Des image quality is
nearly indistinguishable from the ideal case,
providing high image quality and validating
our algorithm design.
We analyze the full system power of Sonic
Millip3De using a register transfer level design
targeting a 45-nm standard cell library and
Spice models of the global interconnect. Using
results from synthesis (SRAM, beamformer,
interconnect) and published values (transducers, analog-to-digital converters, memory
interface, DRAM), we determine that the
design requires 14.6 W in current 45-nm
technology, falling a bit short of the ambitious
5-W power target. However, under current
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
scaling trends, we project that the design will

meet the target 5-W budget by the 16-nm
technology node.
omputer architecture is continuing to

move toward heterogeneous designs,
with accelerated processing units for multimedia and graphics already commonplace
today. As the heterogeneous space grows, so
will the need for more advanced accelerators.
With Sonic Millip3De, we have targeted a
specic application (handheld 3D ultrasound) that has incredible potential impact
and whose unique form of computation is
not well suited to existing designs. We believe
that focusing on such problems will help
future accelerator design move forward.
Furthermore, a key take-away from our
process is the importance of codesigning
algorithms and hardware when highefciency gains are required. Sonic Millip3De achieves orders-of-magnitude greater
energy efciency over stock ultrasound
designs, which would not have been possible
with just a simple hardware solution. The
fundamental reworking of the algorithm
itself enabled the streaming dataow that lies
at the heart of our efciency gains, and was a
critical component in our hardware design.
However, because our algorithmic modications introduce approximations, they necessitated a domain-specic evaluation to ensure
that result quality was not compromised.
Additionally, emerging architectural techniques such as 3D die stacking helped us create
a design that simply could not have existed
previously, and they show great potential for
new and unique heterogeneous systems. MICRO
Acknowledgments
This work was partially supported by
NSF CCF-0815457, CSR-0910699, and
grants from ARM Inc. The authors thank
J. Brian Fowlkes, Oliver Kripfgans, and Paul
Carson for feedback and assistance with
image quality analysis and Ron Dreslinski
for assistance with Spice.
....................................................................
References
1. D. Weinreb and J. Stahl, The Introduction of

a Portable Head/Neck CT Scanner May Be
Associated with an 86% Increase in the
Predicted Percent of Acute Stroke Patients

Treatable with Thrombolytic Therapy, Radiological Soc. of North America, 2008.
2. M. Shorter and D.J. Macias, Portable
Handheld Ultrasound in Austere Environments: Use in the Haiti Disaster, Prehospital Disaster and Medicine, vol. 27, no. 2,
2012, pp. 172-177.
3. J.A. Nations and R.F. Browning, Battlefield
Applications for Handheld Ultrasound,
Ultrasound Quarterly, vol. 27, no. 3, 2011,
pp. 171-176.
4. R. Sampson et al., Sonic Millip3De: A Massively Parallel 3D Stacked Accelerator for
3D Ultrasound, Proc. 19th IEEE Intl Symp.
High-Performance Computer Architecture,
2013, pp. 318-329.
5. R. Sampson et al., Sonic Millip3De with
Dynamic Receive Focusing and Apodization
Optimization, Proc. IEEE Ultrasonics, Ferroelectrics, and Frequency Control Soc.
Symp., 2013, pp. 557-560.
6. C. Passmann and H. Eermert, A 100-MHz
Ultrasound Imaging System for Dermatologic and Ophthalmologic Diagnostics,
IEEE Trans. Ultrasonics, Ferroelectrics and
Frequency Control, vol. 43, no. 4, 1996,
pp. 545-552.
7. K. Karadayi, C. Lee, and Y. Kim, SoftwareBased Ultrasound Beamforming on Multicore
DSPs, tech. report, Univ. of Washington,
March 2011.
8. F. Zhang et al., Parallelization and Performance of 3D Ultrasound Imaging Beamforming Algorithms on Modern Clusters, Proc.
16th Intl Conf. Supercomputing, 2002,
pp. 294-304.
9. J. Jensen, Field: A Program for Simulating
Ultrasound Systems, Proc. 10th NordicBaltic Conf. Biomedical Imaging, 1996,
pp. 351-353.
10. J.A. Jensen and N.B. Svendsen, Calculation
of Pressure Fields from Arbitrarily Shaped,
Apodized, and Excited Ultrasound Transducers, IEEE Trans. Ultrasonics, Ferroelectrics and Frequency Control, vol. 39, no. 2,
1992, pp. 262-267.
Richard Sampson is a PhD candidate in

the Department of Electrical Engineering
and Computer Science at the University of
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
107
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Michigan. His research interests include

hardware-system and accelerator design for
imaging and computer vision applications.
Sampson has an MS in computer science
and engineering from the University of
Michigan.
Ming Yang is a PhD student in the School of
Electrical, Computer and Energy Engineering
at Arizona State University. His research
focuses on the development of algorithms and
low-power hardware for a handheld 3D ultrasound imaging device. Yang has an MS in electrical engineering from Beijing University of
Posts and Telecommunications.
Siyuan Wei is a PhD student in the School of
Electrical, Computer and Energy Engineering
at Arizona State University. His research
focuses on algorithm-architecture codesign of
ultrasound imaging systems, especially those
based on Doppler imaging. Wei has a BE in
electrical engineering from Huazhong University of Technology and Science.
Chaitali Chakrabarti is a professor in the
School of Electrical, Computer and Energy
Engineering at Arizona State University.

Her research interests include low-power
system design, VLSI algorithms, and architectures for signal processing systems. Chakrabarti has a PhD in electrical engineering
from the University of Maryland. She is a
fellow of IEEE.
Thomas F. Wenisch is an associate professor in the Department of Electrical Engineering and Computer Science at the University of Michigan. His research focuses on
computer architecture, particularly multiprocessor and multicore systems, multicore
programmability, smartphone architecture,
data center architecture, and performance
evaluation methodology. Wenisch has a
PhD in electrical and computer engineering
from Carnegie Mellon University. He is a
article to Thomas F. Wenisch, Computer
Science & Engineering Department, 2260
Hayward St., Ann Arbor, MI 48109;
twenisch@umich.edu.
______________
_____________________________
............................................................
108
micro
IEEE
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..................................................................................................................................................................................................................
HARDWARE PARTITIONING FOR

BIG DATA ANALYTICS
..................................................................................................................................................................................................................
TARGETED DEPLOYMENT OF HARDWARE ACCELERATORS CAN IMPROVE THROUGHPUT AND

ENERGY EFFICIENCY IN LARGE-SCALE DATA PROCESSING. DATA PARTITIONING IS CRITICAL
FOR MANIPULATING LARGE DATASETS AND IS OFTEN THE LIMITING FACTOR IN DATABASE
PERFORMANCE. A HARDWARE ACCELERATOR FOR RANGE PARTITIONING AND A
HARDWARE-SOFTWARE STREAMING FRAMEWORK PROVIDE AN ORDER OF MAGNITUDE
IMPROVEMENT IN PARTITIONING ENERGY AND PERFORMANCE.
......
In the era of big data, diverse

elds such as natural language processing,
medical science, national security, and business
management depend on analyzing massive,
multidimensional datasets. These communities rely on computer systems to process data
quickly and efciently. In this article, we discuss specialized hardware to more effectively
address this task.
Databases manage large quantities of data,
letting users query and update the information
that they contain. The database community
has been developing algorithms to support fast
or even real-time queries over relational databases, and, as data sizes grow, researchers
increasingly opt to partition the data for faster
subsequent processing. Partitioning enables
the resulting partitions to be processed independently and more efciently (that is, in parallel and with better cache locality).
Partitioning is used in virtually all modern
database systems including Oracle Database
11g, IBM DB2, and Microsoft SQL Server
2012 to improve performance, manageability,
and availability in the face of big data, and the
partitioning step itself has become a key determinant of query-processing performance.
0272-1732/14/$31.00 c 2014 IEEE
micro
IEEE
In this article, we demonstrate that software implementations of data partitioning

have fundamental performance limitations
that make it computation-bound, even after
parallelization. We describe and evaluate a
system that both accelerates data partitioning
itself and frees processors for other computations. The system consists of two parts: an
area and power-efcient specialized processing element for range partitioning, called
the Hardware-Accelerated Range Partitioner
(HARP); and a high-bandwidth, hardwaresoftware streaming framework that transfers
data to and from HARP and integrates seamlessly with existing hardware and software.
Our approach
As the price of memory drops, modern
databases arent typically disk-I/O-bound,1,2
with many databases now either tting into
main memory or having a memoryz-resident
working set. At Facebook, 800 servers supply
over 28 Tbytes of in-memory data to users.3
Despite the relative scarcity of memory
pins, there is ample evidence that these
and other large data workloads dont saturate
the available bandwidth and are largely
Lisa Wu
Raymond J. Barker
Martha A. Kim
Kenneth A. Ross
Columbia University
.............................................................
109
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Date
Date
Qty.
SKU
5/6/11
2/2/11
7/27/11
6/1/11
10/10/11
9/3/11
5/20/11
12/6/11
3/1/11
7/1/11
10/1/11
Input table
Splitters
<
Qty.
SKU
2/2/11
>=
5/6/11
6/1/11
5/20/11
<
>=
<
7/27/11
9/3/11
>=
10/10/11
12/6/11
Partitioned data
Figure 1. An example table of sales records range partitioned by date into

smaller tables. Processing big data one partition at a time makes working
sets cache-resident, improving the overall analysis speed.
computation-bound. Servers running Bing,

Hotmail, and Cosmos (Microsofts search,
email, and parallel data analysis engines,
respectively) show 67 to 97 percent processor
use but only 2 to 6 percent memory bandwidth use under stress testing.4 Googles BigTable and Content Analyzer (large data storage
and semantic analysis, respectively) show fewer
than 10,000 per millisecond last-level cache
misses, which represents just a couple percent
of the total available memory bandwidth.5
Noting the same imbalances between
computing and memory bandwidth, others
have opted to save power and scale down
memory throughput to better match computing throughput6,7 or to adjust the resource
allocation in server microarchitectures.8 We
propose to resolve the imbalance by deploying specialized hardware to alleviate computing bottlenecks and more fully exploit the
available pin bandwidth. This work embodies
a unique approach that maximizes memory
bandwidth use rather than rebalancing memory and computing together. This is particularly signicant in light of the large volumes
of data used in modern analyses.
Background and motivation

To begin, we provide some background
on partitioning: its role and prevalence in
databases, and its software characteristics.
Partitioning background
............................................................
110
micro
IEEE
Partitioning a table splits it into multiple

smaller tables called partitions. Each row in
the input table is assigned to exactly one partition on the basis of the value of the key
eld. Figure 1 shows an example table of sales
transactions partitioned using the transaction
date as the key. This work focuses on a particular partitioning method called range partitioning, which splits the space of keys into
contiguous ranges, as illustrated in Figure 1
where sales transactions are partitioned by
quarter. The boundary values of these ranges
are called splitters.
Partitioning a table allows ne-grained
synchronization and data distribution. Moreover, when tables become so large that they
or their associated processing metadata cant
t in the cache, partitioning improves the
performance of many critical database operations, such as joins, aggregations, and
sorts.9-11 Partitioning is also used in databases for index building, load balancing,
and complex query processing.12 More generally, a partitioner can improve locality for
any application that needs to process large
datasets in a divide and conquer fashion,
such as histogramming, image alignment
and recognition, MapReduce-style computations, and cryptoanalysis.
To demonstrate the benets of partitioning, lets examine joins. A join takes a common key from two tables and creates a new
table containing the combined information
from both tables. For example, to analyze
how weather affects sales, we would join the
sales records in SALES with the weather
records in WEATHER, where SALES.date
WEATHER.date. If the WEATHER
table is too large to t in the cache, this
process will have poor cache locality, as the
left side of Figure 2 depicts. On the other
hand, if both tables are partitioned by
date, each partition can be joined in a pairwise fashion, as the right side of Figure 2
illustrates. When each partition of the
WEATHER table ts in the cache, the perpartition joins can proceed more rapidly.
When the data is large, the time spent partitioning is more than offset by the time saved
with the resulting cache-friendly, partitionwise joins.
Join performance is critical because
most queries begin with one or more joins
to cross-reference tables, and as the most
data-intensive and costly operations, their
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
SALES
After partitioning, small table

partitions are cache resident,
accelerating per-partition joins.
Without partitioning, even smaller
SALES_1
tables exceed cache capacity.
Consequently, lookups thrash and
the join operation is slow.
SALES_3
SALES_2
WEATHER_2
WEATHER_3
SALES_4
WEATHER_1
WEATHER_4
WEATHER
Partition
(WEATHER)
Partition
(SALES)
Join(1)
Join(2)
Join(3)
Join(4)
Join(SALES,WEATHER)
Figure 2. Joining two large tables exceeds cache capacity. Thus, join implementations partition tables first and then compute
partition-wise joins, each of which exhibits substantially improved cache locality.10,11 Joins are extremely expensive on large
datasets, and partitioning represents up to half of the observed join time.11
Other
80
60
40
17
9
11
5
7
8
22
1
2
12
21
18
19
10
15
3
20
4
16
14
13
6
Avg.
20
TPC-H query
Figure 3. Several key database operations such as join, sort, and

aggregation use partitioning to improve performance. Here we see
joins consuming 47 percent of the Transaction Processing Performance
Council Benchmark H (TPC-H) execution time on MonetDB. With current
join algorithms spending roughly half of the join time on partitioning,11
we estimate that partitioning for joins alone accounts for roughly one
quarter of query-execution time.
Software partitioning evaluation

We now characterize the performance and
limitations of software partitioning on generalpurpose CPUs. Because partitioning scales
with additional cores, we analyze both singleand multithreaded performance.10,11,13
For these characterizations, we use a
microbenchmark that partitions 100 million
random records. Although actual partitioning implementations would allocate output
space on demand during partitioning, we
.............................................................
MAY/JUNE 2014
micro
IEEE
Join
100
Query execution time (%)
inuence on overall performance is large.

We measured the fraction of Transaction
Processing Performance Council Benchmark H (TPC-H; see _____________
http://www.tpc.org/
tpch/default.asp)
__________ query-execution time attributable to joins using MonetDB (http://www.
_______
monetdb.org),
an open source database that
________
provides high performance on queries
over large datasets. Figure 3 plots the percent of TPC-H runtime spent joining
tables. The values shown are the median
across the 10 runs of each query. Ranging
from 5 to 97 percent, TPC-H spends on
average 47 percent of its execution time in
a join operation. Current join implementations spend up to half their time in partitioning,11 thus placing partitioning at
approximately 25 percent of TPC-H
query-execution time.
In addition to performance, a good partitioner will have several other properties.
Ordered partitions, whereby there is an
order among output partitions, are useful
when a query requires a global data sort.
Record order preservation, whereby all
records in a partition appear in the same
order in which they were found in the input
table, is important for some algorithms (for
example, radix sorting). Finally, skew tolerance maintains partitioning throughput
even when input data is unevenly distributed across partitions. HARP provides all
three of these properties as well as high performance and low energy use.
111
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Partitioning throughput (GBps)
30
25
1 thread
20
16 threads
Potential system-memory throughput
15
10
100
200
300
No. of partitions
400
500
Figure 4. Sixteen threads improve partitioning throughput by 8.5 times,

peaking at 2.9 and 2.6 GBps for 128- and 256-way, respectively. However,
partitioning remains computation-bound, underusing available memory
bandwidth.
HARP
(Fig. 6)
Core
Core
SBout
SB in
L1
L1
SB in
L2
Memory
Hardware-accelerated
data partitioning
L2
Memory
controller
Figure 5. Block diagram of a typical two-core system with HardwareAccelerated Range Partitioner (HARP) integration. New components
(HARP and stream buffers) are shaded.
............................................................
112
micro
IEEE
Hardware-Accelerated Range Partitioner

As we saw in the previous discussion, a
partitioners input is a large table, and its output is a set of smaller tables that are easier to
process by virtue of their smaller size. Here,
we describe the architecture and microarchitecture of a system that incorporates HARP.
New structures
Software-controlled
data streaming
SBout
We benchmarked software partitioning

throughput on an eight-core Xeon server,
and observed from the data in Figure 4 the
following: partitioning throughput depends
on the number of partitions; partitioning
parallelizes reasonably well from one to 16
threads; and while the memory system supports a bandwidth peak of 25.6 Gbytes per
second, our optimistic software microbenchmark was able to use only 3 GBps with
16 threads. Even after deploying all computing resources in the server, partitioning
remains computing-bound, severely underusing available memory bandwidth. In contrast, we will demonstrate that a single
HARP-accelerated thread achieves the throughput of close to 16 software threads
using a fraction of the power.
conservatively preallocate space for the output tables beforehand to streamline the inner
loop. The partitioning inner loop runs over
an input table reading one record at a time,
computing its partition using a partition
function, and then writing the record to the
destination partition. We implement the
partition function using an equality rangepartitioning implementation,14 which performs a binary search of the splitters.
Overview
Figure 5 shows a block diagram of the
major components in a system with rangepartitioning acceleration. Two stream buffersone running from memory to HARP
(SBin ) and the other from HARP to memory
(SBout )decouple HARP from the rest of
the system. The range-partitioning computation is accelerated in hardware (indicated by
the double arrow in Figure 5), while inbound
and outbound data stream management is
left to software (single arrows in Figure 5),
maximizing exibility and simplifying the
interface to the accelerator. One set of
instructions provides conguration and
control for HARP, which freely pulls data
from and pushes data to the stream buffers,
while a second set of streaming instructions
moves data between memory and the stream
buffers. Because data moves in a pipeline
that is, streamed in from memory via the
streaming framework, partitioned with
HARP, and then streamed back outthe
lowest-throughput component determines
overall system throughput.
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
Serializer
Conveyor
=
<
=
<
=
<
Convert burst
to stream of
records (FSM)
From
SB in
WE
WE
WE
WE
WE
WE
WE
Pull burst of records from the most full partition buffer (FSM)
Merge
To SBout
Figure 6. HARP draws records in bursts, serializing them into a single stream that is fed into a pipeline of comparators. At
each stage of the pipeline, the record key is compared with a splitter value, and the record is either filed in a partition buffer
(downward) or advanced (to the right) according to the comparison outcome. As records destined for the same partition
collect in the buffers, the merge stage identifies and drains the fullest buffer, emitting a burst of records all destined for the
same partition. (WE: write enable.)
HARP accelerator
The HARP acceleration is managed via
three instructions. The instruction set
splitter is invoked once per splitter to
delineate a boundary between partitions;
partition start signals HARP to
start pulling data from the SBin ; and
partition stop signals HARP to stop
pulling data from SBin and drain all inight data to SBout . To program a 15-way
partitioner, for example, HARP uses
seven set splitter instructions to set
values for each splitter value, followed by a
partition start to start partitioning.
Because HARPs microarchitectural state is
not visible to other parts of the machine, the
splitter values are not lost upon interruption.
HARP pulls and pushes records in
64-byte bursts (tuned to match the system
vector width and DRAM burst size). The
HARP microarchitecture consists of three
modules, as Figure 6 depicts, and is tailored
to range partition data highly efciently:
The serializer pulls bursts of records

from SBin and uses a simple finite
state machine to pull each record
from the burst and feed them, one
after another, into the subsequent

pipeline. As soon as one burst has
been fed into the pipe, the serializer is
ready to pull the subsequent burst.
The conveyor compares record keys
against splitters. The conveyor accepts
a stream of records from the serializer
into a deep pipeline with one stage
per splitter. At each stage, the key is
compared to the corresponding splitter and routed either to the appropriate partition or to the next pipeline
stage. Partition buffers, one per partition, buffer records until a burst of
them is ready.
The merge module monitors the partition buffers as records accumulate.
It looks for full bursts of records that
it can send to a single partition.
When such a burst is ready, merge
drains the partitioning buffer, one
record per cycle, and sends the burst
to SBout .
HARP uses deep pipelining to hide the

latency of multiple splitter comparisons. We
experimented with a tree topology for the
conveyor, analogous to the binary search tree
.............................................................
MAY/JUNE 2014
micro
IEEE
113
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
in the software implementation, but found

that the linear conveyor architecture was
preferable. When the pipeline operates bubble-free, as it does in both cases, it processes
one record per cycle, regardless of topology.
The only difference in total cycle count
between the linear and tree conveyors was the
overhead of lling and draining the pipeline
at the start and nish. With large record
counts, the difference is negligible in time
required to ll and drain a k-stage pipeline
versus a log(k)-stage pipeline in the tree version. Although cycle counts were more or less
the same between the two approaches, the
linear design had a slightly shorter clock
period because of the more complex layout
and routing requirements in the tree, resulting in slightly better overall throughput.
The integer comparators in HARP can
support all SQL data types as partitioning
keys. This is because the representations typically lend themselves to integer comparisons.
For example, MySQL represents dates and
times as integers: dates as 3 bytes, time stamps
as 4 bytes, and datetimes as 8 bytes.15 HARP
can also accomplish partitioning ASCII
strings alphabetically on the rst N characters
with an N-byte integer comparator.
Delivering data to and from HARP
............................................................
114
micro
IEEE
To ensure that HARP can process data at

its full throughput, the framework surrounding HARP must stream data to and from
memory at or exceeding the rate that HARP
can partition. This framework provides software-controlled streams and allows the
machine to continue seamless execution after
an interrupt, exception, or context switch.
We describe a hardware-software streaming
framework based on the concept outlined in
Jouppis prefetch stream buffer work.16
Software moves data between memory
and the stream buffers via four instructions.
Instruction sbload loads data from memory to SBin , taking as arguments a source
address in memory and a destination stream
buffer ID. The instruction sbstore does
the reverse, taking data from the head of the
designated outgoing stream buffer and writing it to the specied address. Each sbload
and sbstore moves one vectors worth of
data (that is, 128 or 256 bytes) between
memory and the stream buffers. A full/empty
bit on the stream buffers will block the

sbloads and sbstores until there is
space (in SBin ) and available data (in SBout ).
Because the CPU software knows the
table size, it knows how many sbload/
sbstore instructions must be executed to
partition the entire table.
To ensure seamless execution after an
interrupt, exception, or context switch, we
make a clean separation of architectural and
microarchitectural states. Specically, only
the stream buffers themselves are architecturally visible, with no accelerator state exposed
architecturally. This separates the HARP
microarchitecture from the context and will
help facilitate future extension to other
streaming accelerators. Before the machine
suspends accelerator execution to service an
interrupt or a context switch, the OS will execute an sbsave instruction to save the
stream buffer contents. Prior to an sbsave,
HARP must be stopped and allowed to drain
its in-ight data to an outgoing stream buffer
by executing a partition stop instruction. As a consequence, the stream buffers
should be sized to accommodate the maximum amount of in-ight data supported by
HARP. After the interrupt has been serviced,
before resuming HARP execution, the OS
will execute an sbrestore to ensure that the
streaming states are identical before and after
the interrupt or context switch.
These stream buffer instructions, together
with the HARP instructions described in the
previous section, allow full software control
of all aspects of the partitioning operation,
except for the work of partitioning itself,
which is handled by HARP.
To implement the streaming instructions,
we propose minimal modications to conventional processor microarchitecture. Figure
7 summarizes the new additions. The
sbload instructions borrow the existing
microarchitectural vector load (for example,
Intels Streaming SIMD Extensions or
PowerPCs AltiVec) request path, diverging
from vector load behavior when data lls
return to the stream buffer instead of the data
cache hierarchy. To support this, we add a
1-bit attribute to the existing last-level cache
request buffer to differentiate sbload
requests from conventional vector load
requests. This attribute acts as the multiplexer
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
Inbound stream buffer
New attribute bit helps route data fill

(demand or prefetch, cache or stream)
D/P
C/S
SB
A dedicated data
bus to and from
the memory
subsystem
A multiplexer
that steers the
fill data
...
LLC request buffer
Address
Last-level
cache
(LLC)
Data from memory
Figure 7. Implementation of streaming instructions into existing datapath of a generic lastlevel cache request/fill microarchitecture. Required minimal modifications are shaded.
select for the return datapath, as Figure 7

illustrates. Finally, a dedicated bidirectional
data bus is added to connect that mux to the
stream buffer.
Stream buffers can be made fully coherent
to the core caches. The sbload instructions
already reuse the load request path, so positioning SBin on the ll path such that hits in
the cache can be returned to the SBin will
ensure that sbload requests always produce
the most up-to-date values. Figure 7 depicts
the scenario when a request misses all levels of
the cache hierarchy, and the ll is not cached,
as sbload requests are noncacheable. On the
store side, sbstore instructions can copy
data from SBout into the existing store buffer,
sharing the store datapath and structures such
as the write combining and snoop buffers.
Stream loads are most effective when data
is prefetched ahead of use, and our experiments indicate that the existing hardware prefetchers are quite effective in bringing
streaming data into the processor. Prefetches
triggered by stream loads can be handled in
one of two ways: ll the prefetched data into
the cache hierarchy as occurs in current processors, or ll the prefetched data into the
stream buffer. We choose the former because
it reduces the additional hardware support
needed and incurs minimal cache pollution
by marking prefetched data as nontemporal.
Because sbload requests check the cache
and request buffer for outstanding requests
before sending the request out to the memory
controller, this design allows for coalescing
loads and stores and for shorter data return

latency when the requests hit in the prefetched data in the cache.
Evaluation
To evaluate the throughput, power, and
area efciency of our design, we implemented
HARP in Bluespec System Verilog (www.
bluespec.com). The partitioner evaluated
here supports 16-byte records with 4-byte
keys. Assuming 64-byte DRAM bursts, this
works out to four records per burst. We evaluate the overhead of the streaming framework
using CACTI.17 For further details about the
methodology, including synthesis settings,
please refer to the methodology section of our
other work.18
We evaluate the proposed HARP system
in the following categories:
throughput comparison with the

optimistic software range partitioning from the Software partitioning
evaluation section,
area and power comparison with the
processor core on which the software
experiments were performed, and
nonperformance partitioner desiderata.
We use the baseline conguration of

HARP outlined in the previous paragraph,
unless otherwise noted.
HARP throughput
Figure 8 plots the throughput of three
range partitioner implementations: single-
.............................................................
MAY/JUNE 2014
micro
IEEE
115
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
8
1 thread
16 threads
1 thread + HARP
100
200
300
400
500
Partitioning energy (joules/Gbyte)
TOP PICKS
20
1 thread
16 threads
1 thread + HARP
15
10
100
No. of partitions
Copy throughput (GBps)
Figure 8. A single HARP unit outperforms

single-threaded software from between 7.8
times with 63 or 255 partitions to 8.8 times
with 31 partitions, approaching the
throughput of 16 threads.
6.8
6.4
5.4
5.5
memcpy
ASM (SSE)
4.6
memcpy
X86
ASM (SSE)
Assembly
(ASM, scalar)
Our experiments
Prior results
Figure 9. The streaming framework shares much of its implementation with

the existing memory system, and as such its throughput will be comparable
to the copy throughput of existing systems. (SSE: streaming SIMD
extensions.)
............................................................
116
micro
IEEE
threaded software, multithreaded software,

and single-threaded software plus HARP.
We see that HARPs throughput exceeds a
single software thread by 6.5 to 8.8 times,
with the difference primarily attributable
to the elimination of instruction fetch and
control overhead from the splitter comparison and the deep pipeline. In particular, the
partitioning operation structure doesnt
introduce hazards or bubbles into the pipeline, allowing it to operate in a near-perfect
fashionthat is, always full as well as
accepting and emitting one record per
200
300
400
No. of partitions
500
Figure 10. This figure shows the energy

consumption in joules per Gbyte of data
partitioned using HARP as the number of
partitions increases. HARP-augmented
cores partition data using 6.3 to 8.7 times
less energy than parallel or serial software.
clock cycle. We conrm this empirically as

our measurements indicate average cycles
per record ranging from 1.008 (for 15-way
partitioning) to 1.041 (for 511-way partitioning). As Figure 8 indicates, 16 threads
are required for the software implementation to match the throughput of the hardware implementation. At 3.13 GBps per
core with HARP, augmenting all or even
half of the eight cores with HARP would
provide sufcient computing bandwidth to
fully use all DRAM pins.
In terms of absolute numbers, the baseline
HARP conguration achieved a 5.06-nanosecond critical path, yielding a design that
runs at 198 MHz, delivering partitioning
throughput of 3.13 GBps. This is 7.8 times
faster than the optimistic single-threaded
software range-partitioner described in
the Software Partitioning Evaluation
section.
Streaming throughput
Our results in Figure 9 show that Cs standard library memcpy provides similar throughput to hand-optimized vector code, whereas
scalar codes throughput is slightly lower. For
comparison, we have also included the results
of a similar experiment published by IBM
Research.19 Based on these measurements, we
will conservatively estimate that the streaming
framework can bring in data at 4.6 GBps and
write results to memory at 4.6 GBps with a
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
Table 1. Area and power overheads of HARP units and stream

buffers for various partitioning factors.
HARP unit
Area
No. of
Stream buffers
Power
Area
Power
partitions mm2 Xeon (%) Watts Xeon (%) mm2 Xeon (%) Watts Xeon (%)
15
31
0.16
0.31
0.4
0.7
0.01
0.02
0.3
0.4
0.07
0.07
0.2
0.2
0.063
0.079
1.3
1.6
63
0.63
1.5
0.04
0.7
1.30
0.2
0.078
1.6
127
255
1.34
2.83
3.1
6.6
0.06
0.11
1.3
2.3
0.11
0.13
0.3
0.3
0.085
0.100
1.7
2.0
511
5.82
13.6
0.21
4.2
0.18
0.4
0.233
4.7
Area and power efficiency

The addition of the stream buffer and
accelerator hardware does increase the area and
power of the core. Table 1 quanties the area
and power overheads of the accelerator and
stream buffers relative to a single Xeon core.
Comparatively, the additional structures are
small, with the baseline design point adding
just 6.9 percent area and 4.3 percent power for
both the HARP and the stream buffers. HARP
itself consumes just 2.83 mm2 and 0.11 W.
Because the stream buffers are sized
according to the accelerators they serve, we
quantify their area and power overheads for
each HARP partitioning factor we consider in
Table 1. The proposed streaming framework
adds 0.3 mm2 of area and consumes 10 mW
of power for a baseline HARP conguration.
Energy efficiency
From an energy perspective, this slight
increase in power is overwhelmed by the
improvement in throughput. Figure 10 compares the partitioning energy per gigabyte of
data of software (both serial and parallel)
against HARP-based alternatives. The data
show a 6.3 to 8.7 times improvement in
single-threaded partitioning energy with
HARP.
By design, HARP preserves the record
order. All records in a partition appear in the
4
single thread. These data shows that the

streaming framework provides more throughput than HARP can take in, but not too much
more, resulting in a balanced system.
3
2
1
25
50
75
100
Records in hot partition (%)
Figure 11. This figure shows the impact of

uneven partition distribution on partitioning
throughput. As input imbalance increases,
throughput drops by at most 19 percent
owing to increased occurrence of back-toback bursts to the same partition.
same order that they were found in the input

record stream. This is a useful property for
other parts of the database system and is a
natural consequence of the HARP structure,
where only one route exists from input port
to each partition and records cant pass one
another in-ight.
We evaluate HARPs skew tolerance by
measuring the throughput (that is, cycles per
record) on synthetically unbalanced record
sets. In this experiment, we varied the record
distribution from optimal, where records
were uniformly distributed across all partitions, to pessimal, where all records were sent
to a single partition. Figure 11 shows the
.............................................................
MAY/JUNE 2014
micro
IEEE
117
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
gentle degradation in throughput as one partition receives an increasingly large share of

records.
This mild degradation is due to the design
of the merge module. Recall that this stage
identies which partition has the most
records ready and drains them from that partitions buffer to send as a single burst back to
memory. Back-to-back drains of the same
partition require an additional merge cycle,
which rarely happens when records are distributed across partitions. Note that this tolerance is independent of many factors
including splitter number, key size, or partitioned table size.
The baseline HARP design supports four
records per burst, resulting in a 25 percent
degradation in throughput between best- and
worst-case skew. This is close to the degradation seen experimentally in Figure 11, where
throughput sinks from 3.13 GBps with no
skew to 2.53 GBps in the worst case.
............................................................
118
micro
IEEE
his research offers a novel solution to

the problem of big data computing efciency. The database community has spearheaded software innovations to improve
performance efciency of database management systems running on commodity server
hardware, and several researchers have proposed running big data analyses on eldprogrammable gate arrays. However, this is
the rst work to accelerate partitioning, a
generic kernel that is a critical piece of many
large data analyses. We have described a specialized database processing element and a
streaming framework that provide seamless
execution in modern computer systems and
exceptional throughput and power efciency
advantages over software. These benets are
necessary to address the ever-increasing
demands of big data processing.
Processing data with accelerators such as
HARP can alleviate serial performance bottlenecks in the application and free up
resources on the server to do other useful
work. Because databases and other dataprocessing systems represent a common,
high-value server workload, the impact of
improvements in partitioning performance
would be widespread.
The design shows how accelerators can be
seamlessly integrated into a CPU core. The
streaming framework decouples the microarchitecture of the accelerator from the specics of data layout and management. This
allows seamless integration of the accelerator
into existing software, as well as a clean
mechanism for handling context switches
and interrupts by saving and restoring just
the contents of the stream buffers.
The research demonstrates the potential
of data-oriented specialization. Moving data
through the memory subsystem and CPU
cache hierarchy consumes more than double
the energy of the computation itself.20 With
an application-specic integrated circuit
designed to specically process tables in a
streaming fashion, the HARP system delivers
an order of magnitude improvement in
energy efciency. The overall system design
also makes it easy to introduce other streaming accelerators such as specialized aggregators, joiners, sorters, lters, or compressors to
expand both the use and benets of this
approach.
MICRO
....................................................................
References
1. A. Ailamaki et al., DBMSs on a Modern

Processor: Where Does Time Go? Proc.
25th Intl Conf. Very Large Data Bases,
1999, pp. 266-277.
2. G. Graefe and P.-A. Larson, B-Tree Indexes
and CPU Caches, Proc. 17th Intl Conf.
Data Engineering, 2001, pp. 349-358.
3. P. Saab, Scaling Memcached at Facebook,
12 Dec. 2008; https://www.facebook.com/
________________
note.php?note id39391378919.
___________________
4. C. Kozyrakis et al., Server Engineering
Insights for Large-Scale Online Services,
IEEE Micro, vol. 30, no. 4, 2010, pp. 8-19.
5. L. Tang et al., The Impact of Memory Subsystem Resource Sharing on Datacenter
Applications, Proc. 38th Ann. Intl Symp.
6. K.T. Malladi et al., Towards Energy-Proportional Datacenter Memory with Mobile
DRAM, Proc. 39th Ann. Intl Symp. Computer Architecture, 2012, pp. 37-48.
7. Q. Deng et al., MemScale: Active LowPower Modes for Main Memory, Proc.
Systems, 2011, pp. 225-238.
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
8. N. Hardavellas et al., Toward Dark Silicon

in Servers, IEEE Micro, vol. 31, no. 4,
2011, pp. 6-15.
9. Y. Ye, K.A. Ross, and N. Vesdapunt,
Scalable Aggregation on Multicore Processors, Proc. 7th Intl Workshop Data Management on New Hardware, 2011, pp. 1-9.
10. S. Blanas, Y. Li, and J.M. Patel, Design and
Evaluation of Main Memory Hash Join Algorithms for Multi-Core CPUs, Proc. ACM
Sigmod Intl Conf. Management of Data,
2011, pp. 37-48.
11. C. Kim et al., Sort vs. Hash Revisited: Fast
Join Implementation on Modern Multi-Core
CPUs, Proc. Very Large Data Bases, vol. 2,
no. 2, 2009, pp. 1378-1389.
12. D. Chatziantoniou and K.A. Ross, Partitioned
Optimization of Complex Queries, Information Systems, vol. 32, no. 2, 2007, pp. 248282.
13. J. Cieslewicz and K.A. Ross, Data Partitioning on Chip Multiprocessors, Proc. 4th Intl
Workshop Data Management on New Hardware, 2008, pp. 25-34.
14. K.A. Ross and J. Cieslewicz, Optimal Splitters for Database Partitioning with Size
Bounds, Proc. 12th Intl Conf. Database
Theory, 2009, pp. 98-110.
15. MySQL, Date and Time Data Type Representation, 1997, 2014; ____________
http://dev.mysql.com/
Lisa Wu is a research staff member at Intel

Labs. Her research interests include computer architecture, accelerators, and
energy-efficient computing on high-performance computing and big data. Wu has
a PhD in computer science from Columbia
University, where she performed the work
for this article. She is a member of ACM
Sigarch.
Raymond J. Barker is a software engineer at

Google and an MS student in computer
engineering at Columbia University. His research interests include application-specific
computer architecture and parallel algorithms. Barker has a BS in computer engineering from Columbia University. He is a
member of IEEE.
Martha A. Kim is an assistant professor in

the Computer Science Department at
Columbia University. Her research interests
include computer architecture, parallel
hardware and software systems, and energyefficient computation on big data. Kim has
a PhD in computer science and engineering
from the University of Washington. She is a
doc/internals/en/date-and-time-data-type__________________________
representation.html.
_____________
16. N.P. Jouppi, Improving Direct-Mapped
Cache Performance by the Addition of a
Small Fully-Associative Cache and Prefetch
Buffers, Proc. 17th Ann. Intl Symp. Computer Architecture, 1990, pp. 364-373.
17. HP Labs, CACTI, 2008; _________
http://www.hpl.
hp.com/research/cacti.
_____________
Kenneth A. Ross is a professor of computer

science at Columbia University. His
research interests include database management systems, particularly their performance on modern multicore machines,
GPUs, and other accelerator platforms.
Ross has a PhD in computer science from
Stanford University. He is a member of the
ACM.
18. L. Wu et al., Navigating Big Data with

High-Throughput, Energy-Efficient Data Partitioning, Proc. 40th Ann. Intl Symp. Computer Architecture, 2013, pp. 249-260.
19. H. Subramoni et al., Intra-Socket and InterSocket
Communication
in
Multi-Core

article to Lisa Wu, Intel, Building SC12,
3600 Juliette Ln., M/S 303, Santa Clara,
CA 95052; ______________
lisa@cs.columbia.edu.
Systems, IEEE Computer Architecture

Letters, vol. 9, no. 1, 2010, pp. 13-16.
20. W.J. Dally et al., Efficient Embedded
Computing, Computer, vol. 41, no. 7,
2008, pp. 27-32.
____________
_______
.............................................................
MAY/JUNE 2014
micro
IEEE
119
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
................................................................................................................................................................................................................
EFFICIENT SPATIAL PROCESSING

ELEMENT CONTROL VIA TRIGGERED
INSTRUCTIONS
................................................................................................................................................................................................................
Angshuman Parashar
Michael Pellauer
Michael Adler
Bushra Ahsan
Neal Crago
Intel
Daniel Lustig
Princeton University
Vladimir Pavlov
Intel
Antonia Zhai
University of Minnesota
Mohit Gambhir
Aamer Jaleel
Randy Allmon
Rachid Rayess
Stephen Maresh
Joel Emer
Intel
.......................................................
120
micro
IEEE
THE AUTHORS PRESENT TRIGGERED INSTRUCTIONS, A NOVEL CONTROL PARADIGM

FOR ARRAYS OF PROCESSING ELEMENTS (PES) AIMED AT EXPLOITING SPATIAL
PARALLELISM. TRIGGERED INSTRUCTIONS ELIMINATE THE PROGRAM COUNTER AND
ALLOW PROGRAMS TO TRANSITION CONCISELY BETWEEN STATES WITHOUT EXPLICIT
BRANCH INSTRUCTIONS. THE APPROACH ALSO ALLOWS EFFICIENT REACTIVITY TO
INTER-PE COMMUNICATION TRAFFIC AND PROVIDES A UNIFIED MECHANISM TO AVOID
OVERSERIALIZED EXECUTION.
......
Recently, single-instruction, multiple-data (SIMD) and single-instruction,

multiple-thread (SIMT) accelerators such as
GPGPUs have been shown to be effective as
ofoad engines when paired with generalpurpose CPUs. This results in a complementary approach where the CPU is responsible
for running the operating system and irregular programs, and the accelerator executes
inner loops of uniform data-parallel code.
Unfortunately, not every workload exhibits
sufciently uniform data parallelism to
exploit the efciencies of this pairing. There
remain many important workloads whose
best-known implementation involves asynchronous actors performing different tasks,
while frequently communicating with neighboring actors. The computation and communication characteristics of these workloads
cause them to map efciently onto spatially

programmed architectures such as eldprogrammable gate arrays (FPGAs). Furthermore, many important workload domains
exhibit such kernels, including signal processing, media codecs, cryptography, compression, pattern matching, and sorting.
As such, one way to boost these workloads
performance efciency is to add a new spatially programmed accelerator to the system,
complementing the existing SIMD/SIMT
accelerators. Although FPGAs are very general
in their ability to map a workloads computation, control, and communication structure,
their datapaths based on lookup tables (LUTs)
are decient in computational density compared to a traditional microprocessormuch
less a SIMD engine. Furthermore, FPGAs suffer from a low-level programming model
0272-1732/14/$31.00 c 2014 IEEE
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
inherited from logic prototyping that includes

unacceptably long compilation times, no support for dynamic context switching, and often
inscrutable debugging features.
Tiled arrays of coarse-grained arithmetic
logic unit- (ALU)-style datapaths can achieve
higher computational density than FPGAs.1-3
Several prior works have proposed spatial
architectures with a network of ALU-based
processing elements (PEs) onto which operations are scheduled in systolic or dataow
order, with limited or no autonomous control
at the PE level.4-6 Other approaches incorporate autonomous control at each PE using a
program counter (PC).7-9 Unfortunately, as
we will show, PC sequencing of ALU operations introduces several inefciencies when
attempting to capture intra- and inter-ALU
control patterns of a frequently communicating spatially programmed fabric. (For more
information, see the Related Work in
Instruction-Grained Spatial Architectures
sidebar.)
In this article, we present triggered instructions, a novel control paradigm for ALU-style
datapaths for use in arrays of PEs aimed at
exploiting spatial parallelism. Triggered
instructions remove the program counter
completely, letting the PE transition between
states of one or more nite-state machines
(FSMs) without executing instructions in the
datapath to determine the next state. This
also lets the PE react quickly to incoming
messages on communication channels. In
addition, triggered instructions provide a
unied mechanism to avoid overserialized
execution, essentially achieving the effect of
techniques such as dynamic instruction reordering and multithreading, which require
distinct hardware mechanisms in a traditional
sequential architecture.
We evaluate the triggered-instruction
approach by simulating a spatially programmed accelerator on several workloads
spanning a range of algorithm classes not
known to exhibit extensive uniform data
parallelism. Our analysis shows that such an
accelerator can achieve 8-times greater areanormalized performance than a traditional
general-purpose processor on this set of
workloads. We provide further analysis of
the critical paths of workload programs
to illustrate how a triggered-instruction
architecture contributes to this performance

gain.
Background and motivation

To understand the benets that triggered
instructions can provide to a spatially
programmed architecture, we must rst
understand how spatially programmed architectures in general can play a role in the computational landscape, and why traditional
program-counter-based approaches are limited in this context.
Spatial programming architectures

Spatial programming is a paradigm
whereby an algorithms dataow graph is broken into regions connected by producerconsumer relationships. Input data is then
streamed through this pipelined graph.
Ideally, the number of operations in each
stage is kept small, because performance is
usually determined by the rate-limiting step.
Just as vectorizable algorithms see large
efciency boosts when run on a vector engine,
workloads that are naturally amenable to spatial programming can see signicant boosts
when run on an enabling architecture. A traditional processor would execute such programs serially over time, but this does not
result in any noticeable efciency gain, and
could even be slower than other expressions
of the algorithm. A shared-memory multicore
can improve this by mapping different stages
onto different cores, but the small number of
cores available relative to the large number of
stages in the dataow graph means that each
core must multiplex between several stages, so
the rate-limiting step generally remains large.
In contrast, a typical spatial-programming
architecture is a fabric of hundreds of small
PEs connected directly via an on-chip network. Given enough PEs, an algorithm can
be taken to the extreme of mapping a single
operation in the kernels dataow graph to
each PE, resulting in a very ne-grained pipeline. In practice, it is desirable to have a small
number of local operations in each PE, allowing for a better balance between local control
decisions and pipeline parallelism.
To illustrate this, lets explore how a wellknown workload can benet from spatial
programming. Consider the simple spatially
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
121
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
..............................................................................................................................................................................................
Related Work in Instruction-Grained Spatial Architectures
We classify prior work on architectures for programmable accelerators according to the taxonomy shown in Figure A (although some
have been proposed as stand-alone processors instead of accelerators
complementing a general-purpose CPU). Temporal architectures (class
0 in the taxonomy) are best suited for data-parallel workloads and are
outside of this articles scope. Within the spatial domain (classes 1x),
the trade-offs between logic-grained architectures (class 10), such as
field-programmable gate arrays (FPGAs) and instruction-grained architectures (classes 11x), are well understood.1-3 In this sidebar, we
focus on prior work on instruction-grained spatial architectures with
centralized and distributed control paradigms.
Centralized processing element control schemes

In the centralized approach (class 110), a fabric of spatial processing elements (PEs) is paired with a centralized control unit. This unit
maintains the overall program execution order, managing PE configuration. The results of PE execution could influence the overall flow of
control, but in general, the PEs are not making autonomous decisions.
In the Transport-Triggered Architectures scheme, the systems
functional units are exposed to the compiler, which then uses MOV
operations to explicitly route data through the transport network.4 A
global program counter maintains overall control flow. Operation execution is triggered by the arrival of data from the network, but no
other localized control exists.
Trips (Tera-op, Reliable, Intelligently adaptive Processing System)
is an explicit dataflow graph execution (EDGE) processor that uses
many small PEs to execute general-purpose applications.5 Trips
dynamically fetches and schedules very-large instruction word (VLIW)
blocks across the small PEs using centralized program-counter-based
control tiles. Although large reservation stations within each PE enable when-ready execution of instructions, only single-bit predication
is used within PEs to manage small amounts of control flow.
WaveScalar is a dataflow processor for general-purpose applications that doesnt use a program counter.6 A PE consists of an arithmetic logic unit (ALU), I/O network connections, and a small window
of eight instructions. Blocks of instructions called waves are mapped
onto the PEs, and additional WaveAdvance instructions are allocated
at the edges to help manage coarse-grained or loop-level control.
Conditionals are handled by converting control-flow instructions to
data flow, resulting in filtering instructions that conditionally pass values to the next part of the dataflow graph. In WaveScalar, there is no
local PE register state; when an instruction issues, the result must be
communicated to another PE across the network.
DySER (Dynamically Specialized Execution Resource) integrates a
circuit-switched network of ALUs inside the datapath of contemporary
processor pipeline.7 DySER maps a single instruction to each ALU and
doesnt allow memory or complex control-flow operations within the
ALUs. TIA enables efficient control flow and spatial program mapping
across PEs, enabling high utilization of ALUs with PEs without the
Programmable accelerators
Temporally
programmed
Spatially
programmed
Class 0: SIMT,
SIMD, MIMD
Logic grained
Instruction grained
Class 10: FPGAs

Centralized control
Distributed control
Class 110: Dataflow,

WaveScalar, DySER
PC-controlled
Non-PC-controlled
Class 1110: RAW,

Class 1111: Triggered
PicoChip, PC+RegQ
instructions
Figure A. A taxonomy of programmable accelerators. Each

leaf node represents a distinguishable class of previously
proposed architectures.
need for an explicit control core. Other recent work such as Garp,2
Chimaera,8 and ADRES3 (Architecture for Dynamically Reconfigurable
Embedded System) similarly integrate lookup-table-based or coarsegrained reconfigurable logic controlled by a host processor, either as
a coprocessor or within the processors datapath.
Matrix is an array of 8-bit function units with a configurable network.1 With different configurations, Matrix can support VLIW, SIMD,
or Multiple-SIMD computations. The key feature of the Matrix architecture is its ability to deploy resources for control based on application regularity, throughput requirements, and space available.
PipeRench is a coarse-grained reconfigurable logic system
designed for virtualization of hardware to support high-performance
custom computations through self-managed dynamic reconfiguration.9
It is constructed from 8-bit PEs. The functional unit in each PE contains
eight three-input lookup tables (LUTs) that are identically configured.
In the dataflow computing paradigm, instructions are dispatched
for execution when tokens associated with input sources are ready.
Each instructions execution results in the broadcast of new tokens to
dependent instructions. Classical dataflow architectures used this as
a centralized control mechanism for spatial fabrics.10,11 However,
other projects use token triggering to issue operations in the PEs,5,6
whereas the centralized control unit uses a more serialized approach.
In a dataflow-triggered PE, the microarchitecture manages the
token-ready bits associated with input sources. The triggered-instruction
approach, in contrast, replaces these bits with a vector of architecturally
visible predicate registers. By specifying triggers that span multiple predicates, the programmer use these bits to indicate data readiness or for
other purposes, such as control flow decisions. In a classic dataflow
architecture, multiple pipeline stages are devoted to marshaling tokens,
............................................................
122
micro
IEEE
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
distributing tokens, and scoreboarding which instructions are ready. A

Wait-Match pipeline stage must dynamically pair incoming tokens of
dual-input instructions. In contrast, the set of predicates to be updated
by an instruction in the triggered-instruction approach is encoded in the
instruction itself. This reduces scheduler implementation cost and
removes the token-related pipeline stages.
Smith et al. extend the classic static dataflow model by allowing
each instruction to be gated on the arrival of a predicate of a desired
polarity.12 This approach adds some control-flow efficiency to the
dataflow model, providing for implicit disjunction of predicates by
allowing multiple predicate-generating instructions to target a single
destination instruction, and implicit conjunction by daisy-chaining
predicate operations. Although this makes conjunctions efficient, it
can lead to an overserialization of the possible execution orders inherent in the original nonpredicated dataflow graph. In contrast, compound conjunctions are explicitly supported in triggered instructions,
allowing for efficient mapping of state transitions that would require
multiple instructions in dataflow predication.
Distributed PE control schemes

In the distributed approach (classes 111x), a fabric of spatial PEs is
used without a central control unit. Instead, each PE makes localized
control decisions, and overall program-level coordination is established using distributed software synchronization. Within this domain,
the PC-based control model (long established for controlling distributed temporal architecturesclass 0) is a tempting choice, as demonstrated by a rich body of prior work. By removing the program counter,
the triggered-instruction approach (class 1111) offers many opportunities to improve efficiency.
The Raw project is a coarse-grained computation fabric comprising
16 large cores with instruction and data caches that are directly connected through a register-mapped and circuit-switched network.13
Although applications written for RAW are spatially mapped, program
counter management and serial execution of instructions reduces efficiency and makes the cores on RAW sensitive to variable latencies,
which TIA overcomes using instruction triggers.
The Asynchronous Array of Simple Processors (AsAP) is a 36-PE
processor for DSP applications, with each PE executing independently
using instructions in a small instruction buffer and communicating
using register-mapped network ports.14 Although early research on
AsAP avoided the need to poll for ready data, later work extended the
original architecture to support 167 PEs and zero-overhead looping to
reduce control instructions.15 Triggered instructions not only reduce
the amount of control instructions but also enable data-driven instruction issue, overcoming the serialization of AsAPs program-counterbased PE.
Picochip is a commercially available 308-PE accelerator for DSP
applications.16 Each PE has a small instruction and data buffer, and
communication is performed with explicit put and get commands. One
strength of Picochip is its computational density, but the architecture
is limited to serial three-way LIW instruction issue using a program
counter. Triggered instructions enable control flow at low cost and
dynamic instruction issue that is dependent on data arrival, resulting

in less instruction overhead.
References
1. E. Mirsky and A. DeHon, MATRIX: A Reconfigurable Computing Architecture with Configurable Instruction Distribution
and Deployable Resources, Proc. IEEE Symp. FPGAs for
Custom Computing Machines, 1996, pp. 157-166.
2. J. Hauser and J. Wawrzynek, Garp: A MIPS Processor with
a Reconfigurable Coprocessor, Proc. IEEE Symp. FPGAs for
Custom Computing Machines, 1997, pp. 12-21.
3. B. Mei et al., ADRES: An Architecture with Tightly Coupled
VLIW Processor and Coarse-Grained Reconfigurable Matrix,
Proc. 13th Intl Conf. Field-Programmable Logic and Applications, 2003, pp. 61-70.
4. J. Hoogerbrugge and H. Corporaal, Transport-Triggering vs.
Operation-Triggering, Compiler Construction, LNCS 786,
Springer-Verlag, 1994, pp. 435-449.
5. D. Burger et al., Scaling to the End of Silicon with EDGE
Architectures, Computer, vol. 37, no. 7, 2004, pp. 44-55.
6. S. Swanson et al., The WaveScalar Architecture, ACM
Trans. Computer Systems, vol. 25, no. 2, 2007, pp. 4:1-4:54.
7. V. Govindaraju, C.-H. Ho, and K. Sankaralingam, Dynamically
Specialized Datapaths for Energy Efficient Computing, Proc.
17th Intl Conf. High Performance Computer Architecture
(HPCA), 2011, pp. 503-514.
8. Z.-A. Ye et al., CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional
Unit, Proc. 27th Intl Symp. Computer Architecture, 2000,
pp. 225-235.
9. H. Schmit et al., PipeRench: A Virtualized Programmable
Datapath in 0.18 Micron Technology, Proc. IEEE Custom
Integrated Circuits Conf., 2002, pp. 63-66.
10. J.B. Dennis and D.P. Misunas, A Preliminary Architecture
for a Basic Data-Flow Processor, Proc. 2nd Ann. Symp.
11. K. Arvind and R.S. Nikhil, Executing a Program on the MIT
Tagged-Token Dataflow Architecture, IEEE Trans. Computers, vol. 39, no. 3, 1990, pp. 300-318.
12. A. Smith et al., Dataflow Predication, Proc. 39th Ann.
IEEE/ACM Intl Symp. Microarchitecture, 2006, pp. 89-102.
13. M. Taylor et al., The Raw Microprocessor: A Computational
Fabric for Software Circuits and General-Purpose Programs,
IEEE Micro, vol. 22, no. 2, 2002, pp. 25-35.
14. Z. Yu et al., An Asynchronous Array of Simple Processors
for DSP Applications, Proc. Solid-State Circuits Conf., 2006,
pp. 1696-1705.
15. D. Truong et al., A 167-Processor Computational Platform in
65 nm CMOS, IEEE J. Solid-State Circuits, vol. 44, no. 4,
2009, pp. 1130-1144.
16. G. Panesar et al., Deterministic Parallel Processing, Intl J.
Parallel Programming, vol. 34, no. 4, 2006, pp. 323-341.
.............................................................
MAY/JUNE 2014
micro
IEEE
123
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
if (incoming > cur)

send(cur); cur := incoming;
else
send(incoming);
5 83 32
12
PE
14
PE
PE
cur undef
cur = 17
cur = 27
Figure 1. Example of a spatially programmed sort. Although a pedagogical

example, this workload demonstrates several interesting properties.
PE
22
PE
14
PE
10 14 88
5 32 83
6 11 24
PE
11 30 72
for x = 1..NPASSES
for y = 1..k
// control loop
if (listA > listB ||

(listA.finished && !listB.finished))
send (listB);
else if (!listA.finished)
send (listA);
Figure 2. Expanding the example to a more realistic spatial merge sort

capable of sorting lists of any size. The large merge radix results in fewer
total loads and stores to sort the list, replacing them with more efficient
direct PE-to-PE communication.
............................................................
124
micro
IEEE
mapped sorting program shown in Figure 1.

In this approach, the worker PEs communicate in a straight pipeline. The unsorted array
is streamed in by the rst PE. Each PE simply
compares the incoming element to the largest
element seen so far. The larger of the two values is kept, and the smaller is sent on. Thus,
after processing k elements, worker 0 will be
holding the largest element, and worker k 1
the smallest. The sorted result can then be
streamed out to memory through the same
straightline communication network.
This example represents a limited toy
workload in many waysit requires k PEs to
sort an array of size k, and worker 0 will do
k 1 comparisons while worker k 1 will
do only one (an insertion sort, with a total of

k2 comparisons). However, despite its naivete,
this workload demonstrates some remarkable
properties. First, the systems peak utilization
is goodin the nal step, all k datapaths can
simultaneously execute a comparison. Second, the communication between PEs is local
and parallelon a typical mesh fabric, its
easy to map this workload so that no network
contention will ever occur. Finallyand most
interestinglythis approach sorts an array of
size k with exactly k loads and k stores. The
loads and stores that a traditional CPU must
use to overcome its relatively small register le
are replaced by direct PE-to-PE communication. This reduction in memory operations is
critical in understanding the benets of spatial
programming. We characterize the benets as
follows:
Direct communication uses roughly

20 times lower power than communication through a level-1 (L1) cache,
as the overheads of tag matching,
load-store queue search, and large
data array read are removed.
Cache coherence overheads, including network traffic and latency, are
likewise removed.
Reduced memory traffic lowers cache
pressure, which in turn increases
effective memory bandwidth for the
remaining traffic.
Finally, it is straightforward to expand our
toy example into a realistic merge sort engine
that can sort a list of any size (see Figure 2).
First, we begin by programming a PE into a
small control FSM that handles breaking the
array into subarrays of size k and looping over
the subarrays. Second, we slightly change the
worker PEs programming so that they are
doing a merge of two distinct sorted sublists.
With these changes, our toy workload is
now a radix k merge sort capable of sorting
a list of size n in n logk n loads. Because
k can be in the hundreds for a recongurable fabric, the benets can be quite large. In
our experiments, we observed 17 times
fewer memory operations compared to a
general-purpose CPU, and an area-normalized performance improvement of 8.8
times, which is better than the current bestknown GPGPU performance.10
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
Limitations of PC-based control

The PC-based control model has historically been the best choice for stand-alone
CPUs that run arbitrary and irregular programs. Unfortunately, this model introduces
unacceptable inefciencies in the context of
spatial programming. To understand these
inefciencies, let us code the merge sort PE
shown in Figure 2. We must rst address the
representation of the queues that pass the
sorted sublists between workers. In a multicore system, the typical approach is to use
shared memory for the queue buffering,
along with sophisticated polling mechanisms
such as memory monitors. In a spatially
programmed fabric, having hundreds of
PEs communicating using shared memory
would create unacceptable bandwidth bottlenecksin addition to increased overheads of
pointer chasing, address offset arithmetic,
and head and tail occupancy comparisons.
Thus, we dont consider shared memory
communication queues in this article.
Instead, let us assume that the instructionset architecture (ISA) directly exposes data
registers and status bits corresponding to direct
communication channels between PEs. The
ISA must contain a mechanism to query if the
input channels are not empty and output
channels are not full, to read the rst element,
and to enqueue and dequeue. Furthermore,
we add an architecturally visible tag to the
channel that merge sort uses to indicate that
the end of a sorted sublist has been reached
(EOL). Figure 3 shows an example of the
merge sort in this theoretical assembly language. Several inefciencies are immediately
noticeable. First, the worker uses active polling
to test the queue statusan obvious power
waste. Second, it falls victim to overserialization. For example, if new data on listA
arrives before that on listB, there is no
opportunity to begin processing the listAspecic part of the code. Finally, the code is
branch heavy when compared to that typically
found on a traditional core, and some of these
branches are hard to predict.
To be fair to this PC-based ISA, we must
try to improve the architecture somehow.
Table 1 summarizes the techniques that we
explore.
One idea to improve queue accesses is to
allow destructive reads of input channels. In
check_a: beqz
%in0.notEmpty, check_a // listA
check_b: beqz
%in1.notEmpty, check_b // listB
check_o: beqz
%out0.notFull, check_o // outList
beq
%in0.tag, EOL, a_done
beq
%in1.tag, EOL send_a
cmp.lt
%r0, %in0.first,%in1.first
bnez
%r0, send_a
send_b: enq
%in1
jump
check_a
send_a: enq
deq
jump
a_done: beq
done:
%out0, %in1.first
deq
%out0, %in0.first
%in0
check_a
%in1.first, EOL, done
jump
send_b
deq
%in0
deq
%in1
return;
Static instructions: 18
Average instructions per iteration: 10
Average branches per iteration: 7
Figure 3. PCRegQueue instruction-set architecture (ISA) merge sort

worker representation using register-mapped queues. First, queue status is
tested, then the end-of-list (EOL) condition is evaluated. Finally, the actual
data comparison results in either a swap or pass. This results in a poor ratio
of control decisions to useful work.
such an ISA, the instructions source elds are

supplemented with a bit indicating whether a
dequeue is desired. This is an important
improvement because it reduces both static and
dynamic instruction count. Merge sorts implementation on this architecture can remove
three instructions compared to Figure 3.
The next idea is to replace the active polling with a selectan indirect jump based on
queue status bits. This is a marginal improvement in instruction count but does not help
power efciency. A better idea is to add
implicit stalling to the ISA. In this case,
queue registers such as %in0 would be
treated speciallyany instruction that
attempts to read or write them would require
the issue logic to test the empty or full bits
and delay issue until the status becomes correct. Merge sorts implementation on this
architecture is the same as in Figure 3, but
removes the rst three instructions entirely.
.............................................................
MAY/JUNE 2014
micro
IEEE
125
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Table 1. Adding features to a PC-based ISA to improve efficiency for spatial programming.
Feature
Description
Notes
PC (baseline)
PEs use program counters and communicate using shared-
High latency, bottlenecks
RegQueue
memory queues
Expose register-mapped queues to ISA and test via active polling
Poor power efficiency
FusedDeq
Destructive read of queue registers without separate instructions
Good improvement
RegQSelect
RegQStall
Allow indirect jump based on register-queue status bits

Issue stalls on queue I/O registers without special instructions
Minimal improvement
Bubbles, overserialization
QMultiThread
Stalling on empty or full queue yields thread
Significant additional hardware
Predication
Augmented
Predicate registers that can be set using queue status bits

ISA augmented with all of the above features except QMultiThread
Boolean expressions dont compose

Used in our evaluations
start:
beq
%in0.tag, EOL, a_done
beq
%in1.tag, EOL, send_a
cmp.ge
p2, in0.first, in1.first
send_b:(p2) enq
%out0, %in1.first (deq %in1)
send_a:(!p2)enq
%out0, in0.first (deq %in0)
jump
a_done:
cmp.ne
(p2) jump
nop
start
p2, %in1.first, EOL
send_b
(deq %in0, deq %in1)
return;
Average instructions per iteration (Issued): 6
Average instructions per iteration (committed): 5
Average branches per iteration: 3
Speedup versus PC+RegQueue (see Figure 3): 1.4 times
Figure 4. PCAugmented ISA merge sort worker takes advantage of the

following features: implicit stalls on queue enqueue and dequeue,
destructive queue reads, and classical predication. Together, these features
reduce the overhead of the program counter, but the ratio of branches to
useful work remains high.
............................................................
126
micro
IEEE
Of course, the downside of this is that the

ALU will not be used when the PE is stalled.
Therefore, the next logical extension is to
consider a limited form of multithreading. In
this ISA, any read or write of a queue would
make the thread eligible to be switched out
and replaced with a ready one. This is a
promising approach, but we believe that the
overheads associated with itduplication of
state resources, additional multiplexing logic,
and scheduling fairnessrun counter to the
fundamental spatial-architecture principle of

replicating simple PEs. In other words, the
cost-to-benet ratio of multithreading is
unattractive. We reject out-of-order issue for
similar reasons.
The nal ISA extension we consider is
predication. We dene a variant of our ISA
that can test and set a dedicated set of Boolean
predicate registers. Figure 4 shows a reimplementation of the merge sort worker in a language with predication, implicit stalling, and
destructive reads, which we name PC+Augmented. Note how little predication improves
the control ow of the example. This is
because of several limitations:
Instructions cant read multiple predicate registers at once (inefficient

conjunction).
Composing multiple predicates into
more complex Boolean expressions
(such as disjunctions) must be done
using the ALU itself.
Jumping between regions requires
that the predicate expectations be set
correctly. (Note that the branch from
a finished is forced to use p2
with a positive polarity.)
Predicated false instructions introduce bubbles into the pipeline.
Taken together, these inefciencies mean
that conditional branching remains the most
efcient way to express the majority of the
code in Figure 4. Although we could continue to try to add features to PC-based
schemes in order to improve efciency, in the
rest of this article we demonstrate that taking
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
rule sendA
when listA.first() != EOL && listB.first() != EOL && listA.data < listB.data do
outList.send(listA.first()); listA.deq();
end rule
rule sendB
when listA.first() != EOL && listB.first() != EOL && listA.data >= listB.data do
outList.send(listB.first()); listB.deq();
end rule
rule drainA
when listA.first() != EOL && listB.first() == EOL do
outList.send(listA.first()); listA.deq();
end rule
rule drainB
when listA.first() == EOL && listB.first() != EOL do
outList.send(listB.first()); listB.deq();
end rule
rule bothDone
when listA.first() == EOL && listB.first() == EOL do
listA.deq(); listB.deq();
end rule
Figure 5. Traditional guarded-action merge sort worker algorithm. This paradigm naturally
separates the representation of data transformation (via actions) from the representation of
control flow (via guards). This results in a higher level of code readability, because the control
decisions related to each action are naturally grouped and isolated.
a different approach altogether can efciently

address these issues while simultaneously
removing overserialization and providing the
benets of multithreading.
Triggered instructions
A large degree of the inefciency we have
discussed here stems from the issue of efciently composing Boolean control-ow
decisions. To overcome this, we draw inspiration from the historical computing paradigm
of guarded actions, a eld that has a rich technical heritage including Dijkstras language of
guarded commands,11 Chandy and Misras
Unity,12 and the Bluespec hardware description language.13
Computation in a traditional guardedaction system is described using rules composed of actions (state transitions) and guards
(Boolean expressions that describe when a
certain action is legal to apply). A scheduler is
responsible for evaluating the guards of the
actions in the system and posting ready
actions for execution, taking into account

both inter-action parallelism and available
execution resources. Figure 5 illustrates our
merge sort worker in traditional guardedaction form. This paradigm naturally separates the representation of data transformation (via actions) from the representation of
control ow (via guards). Additionally, the
inherent side-effect-free nature of the guards
means that they are a good candidate for parallel evaluation by a hardware scheduler.
Triggered-instruction architecture
A triggered-instruction architecture (TIA)
applies this concept directly to controlling the
scheduling of operations on a PEs datapath at
an instruction-level granularity. In the historical guarded-action programming paradigm,
arbitrary Boolean expressions are allowed
in the guard, and arbitrary data transformations can be described in the action. To
adapt this concept into an implementable
ISA, both must be bounded in complexity.
.............................................................
MAY/JUNE 2014
micro
IEEE
127
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Furthermore, the scheduler must have the

potential for efcient implementation in
hardware. To this end, we dene a limited set
of operations and state updates that can be
performed by the datapath (instructions) and
a limited language of Boolean expressions
(triggers) built from several possible queries
on a PEs architectural state.
The architectural state of our proposed
TIA PE is composed of four elements:
A set of data registers (read/write)
A set of predicate registers (read/
write)
A set of input-channel head elements
(read only)
A set of output-channel tail elements
(write only)
Each channel has three components: data,
a tag, and a status predicate that reects
whether an input channel is empty or an output channel is full. Tags do not have any special semantic meaningthe programmer can
use them in many ways.
A trigger is a programmer-specied Boolean expression formed from the logical conjunction of a set of queries on the PEs
architectural state. (Although the architecture
natively allows only conjunctions in trigger
expressions, disjunctions can be emulated by
creating a separate triggered instruction for
each disjunctive term.) A hardware scheduler
evaluates triggers. The set of allowable trigger
query functions is carefully chosen to maintain scheduler efciency while allowing for
much generality in the useful expressions.
The query functions include the following:
............................................................
128
micro
IEEE
Predicate register values (optionally

negated): A trigger can specify a
requirement for one or more predicate registers to be either true or false
(for example, p0 && !p1 && p7).
I/O channel status (implicit): The scheduler implicitly adds the empty status
bits for each operand input channel to
the trigger for an instruction. Similarly,
a not-full check is implicitly added to
each output channel that an instruction
attempts to write. The programmer
doesnt have to worry about these
conditions, but must understand
while writing code that the hardware
will check them. This facilitates
convenient, fine-grained, producer/

consumer interaction.
Tag comparisons against input channels: A trigger might specify a value
that an input channels tag must
match (such as in0.tag EOL).
An instruction represents a set of data
and predicate computations on operands
drawn from the architectural state. Instructions selected by the scheduler are executed
on the PEs datapath. An instruction has
the following read, computation, and write
capabilities:
An instruction can read a number of

operands, each of which can be data
at the head of an input channel, a
data register, or the vector of predicate registers.
An instruction can perform a data
computation using one of the standard
functions provided by the datapaths
ALU. It can also generate one or more
predicate values that are either constants (true/false) or derived from the
ALU result via a limited set of datapath-supported functions, such as
reduction AND, OR, and XOR operations, bit extractions, and ALU flags
such as overflow.
An instruction can write the data
result and the derived predicate result
into a set of destinations within the
PEs architectural state. Data results
can be written into the tail of an output channel, a data register, or the
vector of predicate registers. Predicate
results can be written into one or
more predicate registers.
Figure 6 shows our merge sort expressed
using triggered instructions. Note the density
of the trigger control decisionseach trigger
reads at least two explicit Boolean predicates.
Additionally, conditions for the queues being
notEmpty or notFull are recognized implicitly.
Only the comparison between the actual
multibit queue data values is done using
the ALU datapath, as represented by the
doCheck instruction. Predicate p0 indicates that the check has been performed,
whereas p1 holds the result of the comparison. Note also the lack of overserialization.
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Only the explicitly programmer-managed

sequencing using p0 is present.
Figure 7 shows an example TIA PE. The
PE is precongured with a static set of
instructions. The triggers for these instructions are then continuously evaluated by a
dedicated hardware scheduler that dispatches
legal instructions to the datapath for execution. At any given scheduling step, the trigger
for zero, one, or more instructions can evaluate to true. The guarded-action modeland
by extension, our triggered-instruction
modelallows all such instructions to re in
parallel subject to datapath resource constraints and conicts.
Figure 8 shows the TIA hardware schedulers high-level microarchitecture. The scheduler uses standard combinatorial logic to
evaluate the programmer-specied query
functions for each trigger on the basis of values in the architectural state elements. This
yields a set of instructions that are eligible for
execution, among which the scheduler selects
one or more depending on the datapath
resources available. The example in this gure illustrates a scalar datapath that can only
re one instruction per cycle; therefore, the
scheduler selects one out of the available set
of ready-to-re instructions using a priority
encoder.
Observations about the triggered model

Having dened the basic structure of
a TIA, we can now make some key
observations.
A TIA PE doesnt have a program counter
or any notion of a static sequence of instructions. Instead, there is a limited pool of triggered instructions that are constantly bidding
for execution on the datapath. This ts very
naturally into a spatial programming model
where each PE is statically congured with a
small pool of instructions instead of streaming in a sequence of instructions from an
instruction cache.
There are no branch or jump instructions
in the triggered ISAevery instruction in the
pool is eligible for execution if its trigger conditions are met. Thus, every triggered
instruction can be viewed as a multiway
branch into a few possible states in an FSM.
With clever use of predicate registers, a
TIA can be made to emulate the behavior of
doCheck:
when (!p0 && %in0.tag != EOL
&& %in1.tag != EOL) do
cmp.ge p1, %in0.data, %in1.data (p0 := 1)
sendA:
when (p0 && p1) do
enq %out0, %in0.data (deq %in0, p0 := 0)
sendB:
when (p0 && !p1) do
enq %out0, %in1.data (deq %in1, p0 := 0)
drainA:
when (%in0.tag != EOL && %in1.tag == EOL) do
enq %out0, %in0.data (deq %in0)
drainB:
when (%in0.tag == EOL && %in1.tag != EOL) do
enq %out0, %in1.data (deq %in1)
bothDone:
when (%in0.tag == EOL && %in1.tag == EOL) do
nop (deq %in0, deq %in1)
Average instructions per iteration: 2
Speedup versus PC+RegQueue (see Figure 3): 5 times
Speedup versus PC+Augmented (see Figure 4): 3 times
Figure 6. The triggered instruction merge sort worker retains the clean
separation of control and data transformation of the generalized guarded
action version shown in Figure 5. The restriction is that the control decisions
must be stored in single-bit predicate registers, and the action is limited to
the granularity of one instruction. As a result, the sendA and sendB rules
are refactored such that the comparison takes place in the earlier doCheck
rule, which sets up predicate register p1 with the result of the comparison.
other control paradigms. For example, a

sequential architecture can be emulated by
setting up a vector of predicate registers to
represent the current state in a sequence
essentially, a program counter. Predicate
registers can also be used to emulate classic
predication modes, branch delay slots, and
speculative execution. Triggered instruction
is a superset of many traditional control paradigms. The costs of this generality are scheduler area and timing complexity, which
impose a restriction on the number of triggers (and thus, the number of instructions)
that the hardware can monitor at all times.
Although this restriction would be crippling
for a temporally programmed architecture, it
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
129
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Input links
Input switch
Input channels
gs
l ta
e
nn
Tag
Data
ull
y/f
t
mp
Reg 0
a
Ch nel e
an
Ch
Instruction
triggers
Scheduler
Data
Tag
Reg 1
Reg 2
Data
Tag
Data
Reg 3
Operand select
Operand select
Instruction
Instructions
P0 P1 P2 P3
Tag
Ch
an
ne
ALU
le
pt
y/
Predicate update
Data
update
fu
ll
Output channels
Tag
Data
Tag
Data
Data
Tag
Tag
Data
Output switch
Output links
Figure 7. A PE based on our triggered-instruction architecture (TIA). The PE is preconfigured with a static set of instructions.
is reasonable in a spatially programmed

framework because of the low number of
instructions typically mapped to a pipeline
stage in a spatial workload.
The hardware scheduler is built from
combinatorial logicit is simply a tree of
AND gates. Thus, only the state equations
that require reevaluation will cause the corresponding wires in the scheduler logic to
swing and consume dynamic power. In the
absence of channel activity or internal state
changes, the scheduler doesnt consume any
dynamic power whatsoever. The same control equations would have been evaluated
using a chain of branches in a PC-based
architecture.
Evaluation of workloads
............................................................
130
micro
IEEE
In this section, we quantitatively demonstrate both the applicability of the spatial-
programming approach to a set of workloads,

and the efciency that a triggered-instruction
architecture provides within the spatial
domain.
Approach
Our evaluation fabric is a scalable spatial
architecture built from an array of TIA PEs
organized into blocks, which form the granularity of replication of the fabric. Each block
contains a grid of interconnected PEs, a set of
scratchpad slices distributed across the block,
a private L1 cache, and a slice of a shared L2
cache that scales with the number of blocks
on the fabric. Figure 9 provides an illustration of a block and the parameters used in
our evaluation. Each PE has the following
architectural parameters:

Datapath: 32 bits
Sources per instruction: 2
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
Trigger
Instruction
Trigger
Instruction
Trigger
Instruction
Trigger
Instruction
Trigger
Instruction
Trigger
Instruction
Trigger
Instruction
Trigger
Instruction
Triggered instruction
Execute
Datapath
Priority encoder
Instruction ready
Trigger
resolution
Predicate updates
Predicate
registers
Channel status
Tags
Figure 8. Microarchitecture of a TIA scheduler. The Trigger Resolution stage is implemented as combinational logic. This is a
low-power approach because only local state updates and I/O channel activity consume dynamic power.

Registers: 8
Predicates: 8
Maximum triggered instructions: 16
We obtained area estimates of each PE via

the implementation feasibility analysis discussed in detail in our paper for the 2013
International Symposium on Computer
Architecture.14 Area estimates for the caches,
register les, multipliers, and on-chip network were added using existing industry
results. As a reference, 12 blocks (each
including PEs, caches, and so on) are about
the same size as a single core of an Intel Core
i7-2600 processor (including L1 and L2

caches), normalized to the same technology
node.
We developed a detailed cycle-accurate
performance model of our spatial accelerator
using Asim, an established performance
modeling infrastructure.15 We model the
detailed microarchitecture of each TIA PE in
the array, the mesh interconnection network,
the L1 and L2 caches, and the DRAM.
We evaluate our spatial fabric on application kernels from several domains. We do this
under the assumption that the workloads
computationally intensive portions will be
.............................................................
MAY/JUNE 2014
micro
IEEE
131
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
Scratchpad
slices
L1 cache
L2 cache slice
(a)
PEs: 32
Network: Mesh (one-cycle link latency)
Scratchpad: 8 Kbytes (distributed)
L1 cache: 4 Kbytes (four banks, 1 Kbyte per bank)
L2 cache: 24-Kbyte shared slice
DRAM: 200-cycle latency
Estimated clock rate: 2 GHz
(b)
Figure 9. Data used in our evaluation. Block illustration (a) and parameters
(b). Each block contains a grid of interconnected PEs, a set of scratchpad
slices distributed across the block, a private L1 cache, and a slice of a shared
L2 cache that scales with the number of blocks on the fabric.
ofoaded from the main processor, which will

handle peripheral tasks like setting up the
memory and handling rare, but slow, cases.
Our quantitative evaluation has two
objectives:
............................................................
132
micro
IEEE
to demonstrate the effectiveness of a

TIA-based spatial architecture compared to a traditional high-performance sequential architecture, and
to demonstrate the benefits of using
TIA-based PEs in a spatial architecture
compared to PC-based PEs using the
PCRegQueue and PCAugmented
architectures.
For the rst objective, we present performance numbers area-normalized against a

typical host processornamely, a single 3.4GHz out-of-order superscalar Intel Core i72600 core. As a baseline, we used sequential
software implementations running on the
host processor. When possible, we chose
existing optimized workload implementations. In other cases, we auto-vectorized the
workload using the Intel C/C compiler
(icc) version 13.0, enabling processorspecic ISA extensions.
For the second objective, we analyze
how much of the overall speedup benet is
attributable to triggered instructions (as opposed to spatial programming in general) using

the same framework described earlier. We demonstrate this by examining the critical loops
that form the rate-limiting steps in the spatial
pipeline of our workloads. We implemented
the loops on spatial accelerators using the traditional program-counter-based approaches.
This analysis demonstrates how frequently the
triggered-instruction control idiom advantage
translates to practical improvements.
For our analysis, we chose workloads
spanning data parallelism, pipeline parallelism, and graph parallelism. Table 2 presents
an overview of the chosen kernels.
We implemented the triggered instruction
versions of these kernels directly in our PEs
assembly language and hand-mapped them
spatially across our fabric. (In the future, we
expect this to be done by automated tools
from higher-level source code.)
Performance results
Figure 10 demonstrates the magnitude of
performance improvement that can be
achieved using a spatially programmed accelerator. Across our workloads, we observe
area-normalized speedup ratios ranging from
3 times (fast Fourier transform) to about 22
times (SHA-256) compared to the traditional
cores performance, with a geometric mean of
8 times.
Now lets analyze how much of this benet is attributable to the use of triggered
instructions by comparing the rate-limiting
inner loops of our workloads to implementations on spatial architectures using the
PCRegQueue and PCAugmented control schemes.
Table 3 shows the average frequency of
branches in the dynamic instruction stream
for the PC-based spatial architectures. The
branch frequency ranges from 8 to 70 percent, with an average of 50 percent. These
inner loops are all branchy and dynamicfar
more than traditional sequential code.
This dynamism manifests itself as additional control cycles for both PC-based architectures. Figure 11 shows the dynamic
execution cycles for all architectures broken
down into cycles spent on operations in relevant categories. The cycle counts are all normalized to the number of data computation
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
Table 2. Target workloads for evaluation.

Comparison software
Workload
Berkeley Dwarf16
Domain
implementations
Advanced Encryption Standard with

cypher-block chaining (AES-CBC)
Combinational logic
Cryptography
Intel reference using AES-ISA extensions
Knuth-Morris-Pratt
Finite state machines
Various
Nonpublic optimized implementation
(KMP) string search

Dense matrix multiply (DMM)
Dense linear algebra
Scientific computing
Intel Math Kernel Library (MKL)

implementation17
Fast Fourier transform (FFT)

Graph500-BFS
Spectral methods
Graph traversal
Signal processing
Supercomputing
FFT-W with auto-vectorization

k-means clustering
Dense linear algebra
Data mining
MineBench implementation with
Merge sort
Map/reduce
Databases
auto-vectorization
Flow classifier
Finite state machines
Networking
SHA-256
Combinational logic
Cryptography
Intel reference (x86 assembly)
25
Performance ratio
20
15
10
5
micro
ea
n
Figure 10. Area-normalized performance ratio of a TIA-based spatial

accelerator compared to a high-performance out-of-order core. Areanormalized speedup ratios range from 3 times to about 22 times compared
to the traditional cores performance.
serialization. Because these are critical ratelimiting loops in the spatial pipeline, there
are fewer opportunities for multiplexing
unrelated work onto shared PEs. Despite
this, the workloads show benets from avoiding overserialization.
Third, the workload that sees the largest
benet from triggered instructions is Merge
Sort. Merge Sort has the highest dynamic
.............................................................
MAY/JUNE 2014
IEEE
er
FF
T
ra
ph
5
k- 00
m
ea
ns
KM
P
se
ar
M
ch
er
ge
so
rt
SH
A25
6
G
Fl
ow
cl
as
s
ifi
D
M
0
AE
S
operations (D.ops) executed by PCRegQueue. We augment this data with Figures

12 and 13, which respectively show the static
and dynamic (average) instructions in the
inner loops of rate-limiting steps for each
workload. The data in these gures demonstrates that the triggered-instruction approach has measurable benets over program
counters in real-world kernels.
First, TIA demonstrates a signicant
reduction in dynamic instructions executed
compared to both PCRegQueue (64 percent) and PCAugmented (28 percent) on
average, and an average performance improvement of 2.0 times versus PCRegQueue and 1.3 times versus PCAugmented
in the critical loops. A large part of the performance gained by PCAugmented over
PCRegQueue is from the reduction of
Queue Management operations. TIA benets from this, too, but gets a further performance boost over PCAugmented from a
reduction in Control operations and Predicated-False operations.
Second, an additional benet of TIA over
PCAugmented comes from a reduction
in wait cycles. This is most evident in the
k-means (50 percent), Graph500 (100
percent), and SHA-256 (40 percent) workloads. This is because of the ability of triggered instructions to avoid unnecessary
133
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Table 3. Percentage of dynamic instructions that are branches in rate-limiting step inner loop.
Control
AES
DMM
FFT
Flow
Classifier
Graph-500
k-means
KMP
Search
Merge
scheme
(%)
(%)
(%)
(%)
(%)
(%)
(%)
(%)
(%)
sort (%)
PCRegQ
58
50
36
50
50
69
70
63
50
33
11
50
40
29
14
50
22
28
PCAug
5.0
Q.ops
F.ops
C.ops
Wait
D.ops
4.5
Dynamic cycles
4.0
3.5
3.0
2.5
2.0
1.5
1.0
AES
DMM
Flow
classifier
FFT
k-means
Graph500
KMP search
Merge sort
SHA-256
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
0.0
PC+RegQ
0.5
Mean
Figure 11. Breakdown of dynamic execution cycles in rate-limiting inner loops normalized to data computation operations
(D.ops) executed by PCRegQueue. This demonstrates the ability of triggered instructions to reduce queuing, control and
predicated-false operations, and wait cycles arising from over-serialization.
45
PC+RegQ
PC+Augmented
TIA
40
Static instructions
35
30
25
20
15
10
5
n
M
ea
25
6
rt
ASH
er
ge
so
ch
se
ar
ns
KM
km
ea
00
er
ra
ph
-5
as
s
ifi
T
Fl
ow
cl
FF
M
D
M
AE
Figure 12. Static instruction counts for rate-limiting inner loops. See our
previous work14 for an analysis of why triggered instructions can never
result in an increase in instruction count compared to PC-based approaches.
............................................................
134
micro
IEEE
branch rate (70 percent) of all workloads on

the PCRegQueue architecture. It also
spends several cycles polling queues.
PCAugmented eliminates all the queuepolling cycles, resulting in 1.6-times performance improvement in the rate-limiting
step. TIA further cuts down a large number
of control cycles, leading to a further 2.3times performance improvement versus
PCAugmented and a cumulative 3.7-times
performance benet over PCRegQueue.
Fourth, on the average, PCAugmented
does not see a signicant benet from predicated execution for these spatially programmed workloads.
Finally, triggered instructions use a substantially smaller static instruction footprint.
The reduction in footprint compared to
PCRegQueue is particularly signicant
62 percent on average. PCAugmenteds
enhancements help reduce footprint, but
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
Non-control
Control
35
30
25
20
15
10
AES
DMM
FFT
Flow
classifier
Graph-500
k-means
KMP search
Merge sort
SHA-256
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
PC+RegQ
TIA
PC+Aug
5
PC+RegQ
Average instructions per iteration
45
40
Mean
Figure 13. Average dynamic instruction counts for rate-limiting inner loops. In this context, removal of instructions can directly
translate into workload speedup.
TIA still has 30 percent fewer static instructions on average.

The static code footprint of these ratelimiting inner loops is, in general, fairly small
across all architectures. This observation,
along with the real-world performance benets we observed versus traditional highperformance architectures, provides strong
evidence of the viability and effectiveness of
the spatial programming model with small,
tight loops arranged in a pipelined graph.
ur results provide a solid foundation

of evidence for the merit of a triggered-instruction-based spatial architecture.
The ultimate success of this paradigm will be
premised on overcoming several challenges,
including providing a tractable memory
model, dealing with the nite size of the spatial array, and providing a high-level programming and debugging environment. Our
ongoing work makes us optimistic that these
MICRO
challenges are surmountable.
....................................................................
References
1. E. Mirsky and A. DeHon, MATRIX: A
2. J. Hauser and J. Wawrzynek, Garp: A MIPS

Processor with a Reconfigurable Coprocessor, Proc. IEEE Symp. FPGAs for Custom Computing Machines, 1997, pp. 12-21.
3. B. Mei et al., ADRES: An Architecture with
Tightly Coupled VLIW Processor and
Coarse-Grained
Reconfigurable
Matrix,
Proc. 13th Intl Conf. Field-Programmable

Logic and Applications, 2003, pp. 61-70.
4. D. Burger et al., Scaling to the End of Silicon with EDGE Architectures, Computer,
vol. 37, no. 7, 2004, pp. 44-55.
5. V. Govindaraju, C.-H. Ho, and K. Sankaralingam, Dynamically Specialized Datapaths
for Energy Efficient Computing, Proc. 17th
Intl Conf. High Performance Computer
Architecture (HPCA), 2011, pp. 503-514.
6. S. Swanson et al., The WaveScalar
Architecture, ACM Trans. Computer Systems, vol. 25, no. 2, 2007, pp. 4:1-4:54.
7. M. Taylor et al., The Raw Microprocessor:
A Computational Fabric for Software Circuits and General-Purpose Programs, IEEE
Micro, vol. 22, no. 2, 2002, pp. 25-35.
8. Z. Yu et al., An Asynchronous Array of Simple Processors for DSP Applications, Proc.
Reconfigurable Computing Architecture

with Configurable Instruction Distribution
Solid-State Circuits Conf., 2006, pp. 1696-
and Deployable Resources, Proc. IEEE
9. G. Panesar et al., Deterministic Parallel
Symp. FPGAs for Custom Computing

Machines, 1996, pp. 157-166.
1705.
Processing, Intl J. Parallel Programming,
vol. 34, no. 4, 2006, pp. 323-341.
.............................................................
MAY/JUNE 2014
micro
IEEE
135
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
10. D.G. Merrill and A.S. Grimshaw, Revisiting

Sorting for GPGPU Stream Architectures,
Proc. 19th Intl Conf. Parallel Architectures and Compilation Techniques, 2010,
pp. 545-546.
11. E.W. Dijkstra, Guarded Commands, Nondeterminacy and Formal Derivation of Programs, Comm. ACM, vol. 18, no. 8, 1975,
pp. 453-457.
12. K.M. Chandy and J. Misra, Parallel Program
Design: A Foundation, Addison-Wesley,
1988.
13. Bluespec, Bluespec SystemVerilog Reference Guide, 2007.
14. A. Parashar et al., Triggered Instructions: A
Control Paradigm for Spatially-Programmed
Architectures, Proc. Intl Symp. Computer
Architecture, 2013, pp. 142-153.
include building flexible microarchitecture

timing models of large systems using FPGAs
and building OS-like services to simplify
FPGA programming. Adler has a BA in philosophy from the University of Pennsylvania.
Bushra Ahsan is a component design engineer at Intel. Her research focuses on memory
systems architecture design and workloads for
spatial architectures. Ahsan has a PhD in electrical and computer engineering from the City
University of New York.
Neal Crago is an architecture research engineer in the VSSAD group at Intel. His
research interests include spatial and energyefficient architectures. Crago has a PhD in
computer engineering from the University
of Illinois at Urbana-Champaign.
15. J. Emer et al., Asim: A Performance Model

Framework, Computer, vol. 35, no. 2,
2002, pp. 68-76.
16. K. Asanovic et al., The Landscape of Parallel
Computing Research: A View from Berkeley, tech. report UCB/EECS-2006-183, Electrical Eng. and Computer Science Dept.,
Univ. California, Berkeley, Dec. 2006.
17. R.A. van de Geijin and J. Watts, SUMMA:
Scalable Universal Matrix Multiplication
Algorithm, tech. report, TR-95-13, Dept. of
Computer Sciences, Univ. of Texas at
Austin, 1995.
Angshuman Parashar is an architecture

research engineer in the VSSAD group at
Intel. His research interests include spatial
architectures, hardware/software interfaces,
and quantitative evaluation of computer systems. Parashar has a PhD in computer science and engineering from the Pennsylvania
State University.
Michael Pellauer is an architecture research
engineer in the VSSAD group at Intel. His
research interests include spatial architectures and high-level hardware description
languages. Pellauer has a PhD in computer
science from the Massachusetts Institute of
Technology.
............................................................
136
micro
IEEE
Michael Adler is a principal engineer in the

VSSAD group at Intel. His research interests
Daniel Lustig is a PhD candidate in the

Department of Electrical Engineering at
Princeton University. His research focuses
on the design and verification of memory
systems for heterogeneous computing platforms. Lustig has an MA in electrical engineering from Princeton University.
Vladimir Pavlov is a senior software engineer at Intel. His research focuses on programming and exploration tools for novel
programmable accelerators, such as application-specific instruction set processors and
spatial architectures. Pavlov has an MS from
the State University of Aerospace Instrumentation, Saint Petersburg, Russia.
Antonia Zhai is an associate professor in
the Department of Computer Science and
Engineering at the University of Minnesota.
Her research focuses on developing novel
compiler optimizations and architecture features to improve both performance and
nonperformance features, such as programmability, security, testability, and reliability. Zhai has a PhD in computer science
from Carnegie Mellon University.
Mohit Gambhir is an architecture modeling engineer in the VSSAD group at Intel.
His research interests include modeling and
simulation, performance analysis, and SoC
architectures. Gambhir has an MS in
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
computer science from the North Carolina

State University.
Aamer Jaleel is a principal engineer in the
VSSAD group at Intel. His research interests
include memory system optimizations, application scheduling, and performance modeling.
Jaleel has a PhD in electrical engineering from
the University of Maryland, College Park.
Randy Allmon is a senior principal engineer in the VSSAD group at Intel. His
research interests include low-power, highperformance circuit and layout design and
soft-error mitigation research. Allmon has a
BS in electrical engineering from the University of Cincinnati.
Rachid Rayess is a silicon architecture engineer in the MMDC group at Intel. His
research focuses on memory architecture
and memory design automation. Rayess has
an MS in electrical engineering from North
Carolina State University.
Stephen Maresh is a performance modeling

engineer at Intel. His research focuses on
fabrics, spatial architectures, cache circuit
design, and microprocessor design integration. Maresh has an MS in electrical and
computer engineering from Northeastern
University.
Joel Emer is an Intel Fellow and director of
microarchitecture research at Intel, where he
leads the VSSAD group. He is also a professor of the practice at the Massachusetts
Institute of Technology. His research interests include spatial architectures, performance modeling, and memory hierarchies.
Emer has a PhD in electrical engineering
from the University of Illinois. He is a Fellow of IEEE.
article to Michael Pellauer, Intel, 77 Reed
Road, MS HD2-330, Hudson, MA 01749;
michael.i.pellauer@intel.com.
__________________
________________
_______________
________________
_____________
________________
______________
________________
_______________
_______________
.............................................................
MAY/JUNE 2014
micro
IEEE
137
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
................................................................................................................................................................................................................
DENOVOND: EFFICIENT HARDWARE

FOR DISCIPLINED NONDETERMINISM
................................................................................................................................................................................................................
THE DENOVOND SYSTEM PROVIDES EFFICIENT HARDWARE SUPPORT FOR DISCIPLINED

NONDETERMINISTIC CODES WITH LOCKS WHILE RETAINING THE SIMPLICITY,
PERFORMANCE, AND ENERGY BENEFITS PREVIOUSLY ACHIEVED ONLY FOR DETERMINISTIC
CODES. THE AUTHORS DESIGNED AND IMPLEMENTED SIMPLE MEMORY CONSISTENCY
SEMANTICS FOR SAFE NONDETERMINISM USING DISTRIBUTED QUEUE-BASED LOCKS AND
ACCESS SIGNATURES. THE RESULTING PROTOCOL AVOIDS TRANSIENT STATES,
INVALIDATION TRAFFIC, DIRECTORY SHARER-LISTS, AND FALSE SHARING.
......
Hyojin Sung
Rakesh Komuravelli
Sarita V. Adve
University of Illinois at
Urbana-Champaign
.......................................................
138
micro
IEEE
Shared-memory remains the

most widely used model for both multicore
hardware and software. Cache coherence
and simple consistency models are key to
supporting shared memory; however, providing energy-efcient, scalable hardware
coherence is becoming a signicant challenge. A recent surge in research has
addressed this challenge, encompassing at
least three points of view. The rst is based
on the conviction that purely hardwarebased approaches that build on current
protocols will be adequate to meet the complexity, performance, and energy efciency
challenges of future systems.1-3 While this
approach would be least disruptive to the
existing software base, it is unclear whether
it will succeed.4-6 The second point of view
places the coherence burden entirely on software; Intels SCC platform is an example.4
The third, embodied in our research, takes a
more unconventional middle ground, a
combined hardware-software approach.
Our system DeNovoND makes critical
progress toward a codesign vision originally
proposed by B. Choi et al. in the DeNovo
system.7 DeNovo clearly established the

potential for a combined hardwaresoftware approach, but it is restricted to
deterministic programs. Although determinism is considered desirable for many
application classes, many common codes
contain nondeterministic parts, most commonly through lock synchronization. For
example, 21 out of the 25 PARSEC and
SPLASH-2 benchmarks contain locks.
DeNovo cannot run such codes. For industry to exploit the benets of DeNovo, we
must develop techniques to support nondeterministic codes with at least as much performance as conventional systems, without
losing the benets of DeNovo. DeNovoND takes a signicant step toward
achieving the DeNovo vision by providing
support for programs with disciplined
nondeterminism.
DeNovoNDs key insight is to separate
concerns. It deals with synchronization
(racy) accesses and data (nonracy) accesses
separately; and among data accesses, it
distinguishes between deterministic and
nondeterministic accesses. The distinction
0272-1732/14/$31.00 c 2014 IEEE
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
between the different accesses comes from

knowledge provided by the disciplined
software, and each access category is
implemented using an appropriately judicious combination of hardware and software. Using this insight, DeNovoND
signicantly broadens the application
space that the DeNovo approach can support, and additionally charts a path for our
future work to support more unstructured
(and eventually legacy) programs with
DeNovo.
Background
DeNovo is based on the observation
that although the global address space
offered by shared memory is attractive
to programmers, current wild sharedmemory programming environments that
allow data races, ubiquitous nondeterminism, unstructured parallelism, and complex
consistency models make programming,
testing, and maintaining software difcult.8 This has led to much recent software
research on more disciplined sharedmemory programming models.9 The datarace-free model adopted for C, C, and
Java is one successful initial example of discipline, but more work is required to
address all of the concerns just mentioned.8
The DeNovo project asks this question:
If software becomes more disciplined, can
we rethink the multicore memory hierarchy to provide more complexity-, performance-, and energy-efcient hardware
than the current state of the art? The prior
DeNovo work7 addressed this question for
deterministic programs that contain annotations motivated by the Deterministic Parallel Java (DPJ) language.10 DeNovo
proposed a coherence protocol for such
programs that has no transient states, no
invalidation message trafc, no sharer lists
in directories, and no false sharing. Overall,
compared to state-of-the-art MESI (modied, exclusive, shared, invalid) cache coherence protocols, DeNovo is much simpler
and easier to verify and extend, performs
comparably or better, and is more energy
efcient (since it reduces cache misses and
network trafc) for a range of deterministic
codes.
Figure 1 shows how DeNovo and DeNovoND can simplify coherence activities and
reduce network trafc for deterministic and
nondeterministic accesses, compared to the
conventional hardware MESI coherence protocol, as further elaborated in the rest of this
article.
Software assumptions
There is much recent research on disciplined shared-memory programming models
with explicit and structured parallelism, synchronization, and communication. Although
the details vary, they all share the goal of
making parallel programming easier and
safer, and they expose metadata about program structures and memory access patterns
(either automatically extracted or provided
by programmers) to prove desirable properties of programs.
Determinism. Many disciplined programming models (and DeNovo) target the safety
property of determinism. We assume the following properties about disciplined software
to guarantee determinism:
Parallelism is expressed through

structured constructs, such as nested
fork-joins (communicated to hardware through barriers). We assume
programs are divided into a series of
parallel phases whose boundaries are
demarcated by barriers.
We assume metadata that efficiently
summarizes which memory locations
may be written or read in a given parallel phase. This information may be conservative; for example, for phases where
such metadata is not available, we can
assume as a default that all memory
will be read and written in that phase.
There are no conflicting accesses
among concurrent tasks in a parallel
phase; that is, the program is datarace-free. This is already required by
common programming languages
such as C, C, and Java for reasonable semantics.
As an example of a disciplined language

with these properties, we use DPJ10,11 to
drive our software-hardware codesign approach. DPJ is a Java extension that
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
139
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Coherence activity
across cores
DeNovo/DeNovoND
High-level program flow
Coherence activity
across cores
MESI
Global
synchronization
(barriers)
Writeable regions
in the previous phase
Atomic regions
in critical sections
Lock
synchronization
(critical sections)
Figure 1. Conceptual comparison of the complexity and key coherence activities of the
MESI, DeNovo, and DeNovoND protocols. The explosion symbols represent invalidation
of data in each cores private cache, and the thin black arrows between explosion
symbols represent network messages. MESI sends invalidation messages on every
write miss assuming other cores may have concurrently read the accessed data.
DeNovo and DeNovoND constrain invalidations to well-defined synchronization points
such as barriers and lock acquires/releases with the data-race freedom guarantee. At
such synchronization points, the reading core performs a local self-invalidation in its
cache for potentially stale data without incurring additional network traffic. The dot-filled
explosion symbols represent self-invalidations by DeNovo. The black-filled explosion
symbols represent additional self-invalidations by DeNovoND to deal with
nondeterministic critical sections. To identify which data is potentially stale, DeNovo
relies entirely on programmer-provided region information, whereas DeNovoND uses
a combination of software information and simple hardware support in the form of
access signatures transferred on a lock hand-off.
............................................................
140
micro
IEEE
enforces deterministic-by-default semantics

via compile-time type checking. DPJ provides the structured parallel constructs
foreach and cobegin. It requires programmers to assign every object eld or
array element to a named region. Furthermore, the programmer must annotate each
method with read and write effects that
summarize the regions read and written by
that method (a region can be noncontiguous in memory). The DPJ compiler uses
this information to perform a modular
type-check that ensures that no two concurrent tasks have interfering effects.
DeNovo uses DPJs region and effect
annotations to generate metadata that
efciently summarizes the memory regions
that are read or written in each parallel

phase.
Nondeterminism. To enable the programmer
to express nondeterminism, DPJ provides
explicitly nondeterministic parallel constructs
(foreach nd and cobegin nd).11
These constructs allow conicting accesses
(effects) between concurrent tasks, but
require that such accesses be enclosed within
atomic sections, their read and write effect
declarations also include the atomic keyword,
and their region types be declared as atomic.
The compiler checks that all of these
constraints are satised by any type-checked
program, again using a simple, modular
type-checking algorithm.
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
With these constraints, DPJ can provide

the following strong guarantees:

Data-race freedom.
Strong isolation of accesses in all
atomic and deterministic parallel
constructs; that is, these constructs
appear to execute atomically.
Determinism by default; that is, any
parallel construct that does not contain an explicit nondeterministic construct provides deterministic output
for a given input.
Sequential composition for deterministic constructs; that is, tasks of a
deterministic construct appear to
occur in the sequential order implied
by the program.
These guarantees not only ensure sequential consistency but also allow programmers
to reason with very high-level, strongly isolated, and composable components such as
complete foreach constructs and all
atomic sections.
DeNovoND assumes that similar to DPJ
programs, atomic sections and accesses to
atomic regions are identied. It also assumes
that the atomic sections are converted to
locks, where the same lock is used to protect a
given atomic region in a given parallel phase.
DeNovo for deterministic codes

Conventional directory coherence protocols ensure reads return up-to-date values by
tracking all current sharers of the cache line
and invalidating them on a write, incurring
signicant storage (sharer-list) and network
trafc (invalidations and acknowledgments)
overhead. These protocols also have many
transient states, making them hard to verify
and extend. DeNovo uses software information to achieve lower overheads and simpler
design.
DeNovo divides the coherence problem
into two parts:
for DeNovo, this is sequential semantics

because of the determinism properties
described earlier.
Regarding no stale data, DeNovo exploits
the software knowledge of which regions are
written in a parallel phase of the program.
Before a new phase, each core issues a selfinvalidation to invalidate any data in its private cache that could have been written in the
previous phase by another core. Data-race
freedom implies that if a core reads some
data in the subsequent phase, no other core
could have concurrently written that data,
ensuring that a read never returns a stale
value. This eliminates the need for tracking
sharer lists in directories and the ensuing
writer-initiated invalidation and acknowledgment messages, signicantly simplifying the
protocol.
Regarding locatable up-to-date data, writers register themselves (at word granularity)
in a structure similar to the directory, which
we call the registry. However, unlike directories, the registry does not need additional
storage overhead in the presence of shared
last-level caches. The data arrays of the shared
last-level cache can also serve as the registryat a given time, they either store an upto-date copy of the data or the location (core
ID) of an up-to-date copy of the data.
The DeNovo protocol has several
advantages:
No stale data: A read should never see

stale data in its private cache.
Locatable up-to-date data: On a read
miss in the private cache, it should
know where to get an up-to-date
copy of the data.
Here, stale and up-to-date are dened by
the memory consistency model semantics
Simplicity: Largely because there are

no software data races, the DeNovo
protocol does not have any transient
states; it has exactly three stable states
(Invalid, Valid, and Registered). This
makes the protocol much easier to
verify compared to a conventional
hardware MESI protocol (demonstrated by experiments with a modelchecking verification tool7).
Extensibility: DeNovos simplicity
makes it easy to extend. For example,
DeNovo incorporated two significant
optimizations without adding any
additional protocol statesdirect
cache-to-cache transfer and flexible
communication granularity based on
regions.
Storage overhead: DeNovo incurs no
directory storage overhead, a potential
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
141
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
source of unscalability in current systems. DeNovo does incur additional

state overhead to track region information and keep per-word coherence
state; however, this is compensated
after a few tens of cores. Furthermore,
DeNovos overhead is scalable (constant per cache line).
Performance and energy: DeNovo performs about the same or better than
MESI for a range of applications with
up to 77 percent improved memory
stall time and up to 71 percent reduced
network traffic, which can translate
into significant energy savings.7
Overall, DeNovo showed that our software-hardware codesign approach can lead to
simpler and more efcient hardware than the
state of the art, but only for deterministic
programs.
The DeNovoND system

For deterministic programs, DeNovo
achieves its benets primarily by replacing
writer-induced invalidations (and eliminating related overheads) with compiler-inserted
self-invalidations for all writeable data in a
given parallel phase. DPJs data-race freedom
guarantee ensures that only the writing core
will read that data in a phase, ensuring up-todate values for all subsequent data reads in
that phase.
For nondeterministic programs, however,
we cannot assume that concurrent tasks will
have no conicting data accesses. DeNovoND assumes that such conicting accesses
will be protected by the same lock (this lock
may change in a different parallel phase),9
and that such accesses are explicitly identied
as atomic accesses (as guaranteed by DPJ,11
for example). DeNovoND provides a consistency model and coherence mechanism for
these atomic data accesses enabled by the
software information, while separately implementing an efcient hardware mechanism
for locks and continuing to use DeNovos
mechanism for the nonatomic (deterministic) accesses.
Consistency model
............................................................
142
micro
IEEE
Software that obeys our assumptions of

disciplined nondeterminism is guaranteed a
very strong consistency model. Specically,

we guarantee sequential consistency and
hence a total order over all memory operations (that is consistent with program order).
A read must return the value of the last write
to its location as dened by this total order.
The software assumptions also enforce additional rules that further constrain this last
write for data operations, simplifying reasoning for software and implementation for
hardware as follows.
Nonatomic accesses. Our software assumptions ensure that for a nonatomic access,
there cannot be a conicting access by
another concurrent task in the same phase.
Thus, for a nonatomic read, the last conicting write is either from its own task or from a
task in a previous phase (like DeNovo). Our
implementation is identical to DeNovo for
these accesses.
Atomic accesses. We allow conicting accesses
among concurrent tasks, but all such accesses
to a given location are known to be in critical
sections protected with the same lock. These
critical sections must execute atomically,
imposing a total order on all such critical sections and the conicting atomic accesses they
contain. A read therefore must return the
value from the (unique) last conicting write
from a critical section in the current phase; if
such a write does not exist, then the read
must return the (unique) last conicting
write from the previous phase.
Data coherence
The coherence mechanism must simply
ensure that a read returns the value from the
last write as dened by the consistency model
just described. As with DeNovo, we divide
the coherence mechanism into two components: no stale data, and locatable up-to-date
data. DeNovoND implements hardware
mechanisms to meet these requirements as
DeNovo does, dealing with the two issues
separately.
No stale data. For nonatomic accesses, we
take the same approach as DeNovo. Thus, at
the start of a parallel phase, the compiler
inserts self-invalidations for data regions that
could have been written by other cores in the
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
previous phase. For atomic reads (which may

conict with writes in concurrent tasks), we
also use self-invalidations to ensure the read
does not return stale data. We use the consistency model described earlier to determine
when and what to self-invalidate, as follows.
To determine when to self-invalidate, we
note that a concurrent conicting read must
be in a critical section itself and must return
the value of the last write also in a critical section protected by the same lock in the same
phase, as described earlier. Thus, it is sufcient to self-invalidate any time between the
start of a critical section and an atomic read
in that section.
We have several options regarding what to
self-invalidate, including the entire cache (if
the software information is not provided) or
only the atomic regions. Considering that
atomic regions generally have more dynamic
and unpredictable access patterns, we require
each core to update an access signature that
records all writes to atomic regions. When a
core releases a lock that is acquired by
another core, the releaser transfers its access
signature along with the lock to the acquirer.
On a rst atomic read to a location in a critical section, the acquiring core must check
the signature and self-invalidate the location
if it is present in the signature. The acquiring
core must forward the union of its signature
and the signatures it has received to the next
acquirer. The signature is reset at the end
of the parallel phase (barrier). (Combining
information on modied data with lock transfer resembles software-distributed sharedmemory consistency models.12,13 However,
DeNovoND implements a much simplied
yet effective scheme: it tracks only modied
atomic data in this way, and employs entirely
local software-driven self-invalidation for
nonatomic data.)
DeNovoND implements the access signature as a small Bloom lter in hardware,
which is a popular solution due to its storage
efciency, simplicity, and low access latency.
The keys to our Bloom lters are addresses
accessed in atomic regions, which are
mapped to lter entries through hash functions. To keep the false-positive rate low, the
size of each Bloom lter should be determined on the basis of the average size of the
key domain. We found that a very small size
(for example, 256-bit) sufces because our

lter tracks only atomic accesses. We conservatively keep one lter per core to track all
modications across different critical sections
on the same core.
Locatable up-to-date data. For the second
requirement of nding the value of the last
write on a miss, we use ideas similar to
DeNovo. On a write to valid or invalid data
in the private cache, a registration request is
sent to the last-level cache, which stores the
cores ID in its data bank (assuming a shared
last-level cache). A cores registrations to
atomic data are required to complete before
the next lock release so that conicting writes
from critical sections are serialized in the
right order. As a result of the registration, the
word transitions to a Registered state in the
private cache, which is equivalent to a Modied state in the conventional MESI protocol.
A read that misses in the cache simply goes to
the registry (shared last-level cache) to nd
the up-to-date value.
Example
Figure 2 provides an example to illustrate
how DeNovoND uses the Bloom lters. The
code snippet on the left depicts three variables, a, b, and c, in atomic region xR. It then
shows a critical section protected by lock x
with atomic read and write effects on region
xR. The right side of the gure shows an execution with two cores, C1 and C2. It also
shows the signatures at each core, assuming a
perfect hash function.
When a core performs an atomic write, it
inserts the accessed address into its Bloom lter. Thus, at the end of a critical section, all
addresses modied in the section are
recorded in the cores lter; that is, their
entries are nonzero. In Figure 2, on each store
request to a, b, and c in the lightly shaded
critical sections, the Bloom lters on C1 or
C2 are updated. The second critical-section
phase on C2 does not update C2s Bloom lter because it does not have atomic writes.
On an acquire, the access signature at the
releaser is transferred to the acquirer together
with the lock. As a result, all modications
preceding the release associated with the
acquire are made visible to the acquirer. The
acquirer, on receiving the Bloom lter,
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
143
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Caches and filters initially empty

C1
C2
Acquire x
St a
St b
Release x
Union and update

a c
1
Atomic region xR;
a,b,c in atomic xR;
Lock (x) reads atomic xR
writes atomic xR
{
...
}
Lock transfer
1
a
1
b
Union and update

a
b
1
Acquire x
Id a
Id b
St b
St c
Release x
1
a
1
c
1
b
Lock transfer
Set insertion
Set union
No change
Self-invalidation
Prefetch hit
Acquire x
Id a
Id b
St b
St c
Release x
1
Figure 2. An example of propagating atomic writes using access signatures: a code snippet
and an execution with two cores, C1 and C2. Assume a and b are in the same cache line.
............................................................
144
micro
IEEE
updates its own Bloom lter by making a

union of its local Bloom lter and the releasers Bloom lter. Figure 2 shows the resulting
Bloom lters at the beginning of each critical
section, of which the lightly shaded entries
come from the union operation. Note that
the releaser only sends the signature, not the
actual data.
Atomic reads need to consult the signatures obtained from remote releasers to determine if cache data is valid or stale. If the
accessed word is in the Registered or Invalid
state, then regardless of the signature state,
the read is always a cache hit or miss respectively. If the word is in the Valid state, then it
is also up-to-date if its address does not
appear in the access signature. If the word is
in the Valid state and its address hits in the
access signature, then it may or may not be
up-to-date depending on whether it has been
previously read or prefetched in this critical
section. We use an additional touchedatomic bit per word to identify (1) if the
word has already been read in this critical
section (that is, the previous read brought

up-to-date data that is still valid), and (2) if
the word is prefetched from L2 cache, memory, or a remote L1 cache in the Registered or
touched-atomic state. If the touched-atomic
bit for a word is set from either of these cases,
the read will be a cache hit; otherwise, the
word is considered stale and a read request is
sent to the L2 cache.
In Figure 2, assume that variables a and
b are in the same cache line. Then C1s load
b will be a hit since C1s load a will bring in b
as well and set its touched-atomic bit. On the
other hand, load b in C2s second critical section is a miss. This is because the preceding
load a will read a in its own cache in the Registered state and so will not prefetch b, which
is registered at C1.
Distributed queue-based locks

Our distributed queue-based lock
design is modeled after Queue-On-SyncBit
(QOSB),14 where the identities of the cores
waiting for a lock are maintained in a
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
queue of pointers distributed across the

waiting cores L1 caches and the L2 cache
(assuming without loss of generality a twolevel cache hierarchy with a shared last-level
cache). All requests to a given lock are serialized at the corresponding shared L2 cache
bank. The data portion of the L2 cache
entry for a contended lock tracks the last
requestor (that is, the tail of the queue of
waiters). When the L2 cache receives the
next request for the lock, it forwards it to
the current tails L1 cache. On receiving
such a forwarded request, the L1 cache
checks a bit in its copy of the lock word,
called the Locked bit, to determine if the
lock is still held or was unlocked. In the
former case, the L1 cache stores the requestors ID in another eld of the lock word,
referred to as nextPtr. In the latter case,
the L1 cache responds to the requestor with
its signature and transfers the lock, marking
its own lock word Invalid. When a core
releases a lock, its L1 cache checks its
nextPtrif not null, it transfers the
lock (with the signature) to the nextPtr
core; otherwise, it unsets its Locked bit. We
allow eviction of lock words from the L1
and L2 caches by reusing the data portion
of the lock words in the next level of the
memory hierarchy to store lock queue
information. This approach relies on using
L2 data banks to store (nondata) metadata,
which is similar to DeNovos tracking of
registration information for the Registered
state.
Overall, DeNovoND continues to not
have any transient states, sharer-lists in directories, invalidations, or false sharing, and is
still exible enough to incorporate DeNovos
optimizations for communicating at granularities other than cache lines (possibly without indirection via a directory) without
added states.
Evaluation
For our evaluations, we use the Simics
full-system functional simulator with the
Wisconsin GEMS (General Executiondriven Multiprocessor Simulator) memory
timing simulator and the Princeton Garnet
network simulator. Table 1 shows the key
parameters of our simulated systems. For the
Table 1. Simulated system parameters.

Parameter
Quantity
Core frequency
2 GHz
Number of cores
L1 data cache
16
64 Kbytes, 64 bytes per line
L2 cache (16 banks, Non-Uniform
16 Mbytes, 64 bytes per line
Cache Architecture)
Memory
4 Gbytes, 4 on-chip controllers
L1 hit latency
1 cycle
L2 hit latency
Remote L1 hit latency
29 to 61 cycles (bank-dependent)
35 to 83 cycles
Memory hit latency
197 to 261 cycles
Network parameters
Bloom filter size
2D mesh, 16-bit flits

256 bits (infinite for reference)
Hash function
4 H3
signature transfer, we add a 256-bit (32-byte)

payload to the lock transfer message and simulate network trafc and latency accordingly.
This is a conservative estimate since the signature could be compressed. For comparison,
we also simulate an idealized Bloom lter
with innite length (with no impact on
trafc).
We evaluated 11 benchmarks with lock
synchronization, taken from various sources
including SPLASH-2, PARSEC 2.1, and
STAMP benchmark suites to represent a
range of behaviors. We found that three of
the six transactional applications (from
STAMP) spent more than 70 percent of their
execution time on lock acquires for all
studied congurations. Because parallelization using lock synchronization is clearly
inappropriate for these applications, for both
MESI and DeNovoND, we focus our results
here on the other eight applications. (See our
ASPLOS paper for the complete results.9)
Results
Figure 3 shows the execution time and
network trafc of our applications for MESI,
DeNovoND with the idealized innite
Bloom lter (DInf), and DeNovoND with a
256-bit Bloom lter (D256). For MESI, we
use the Posix Threads Mutex library for locks,
because implementing distributed queuebased locks on MESI involves signicant
complexity to deal with numerous transient
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
145
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
Compute stalls
Data memory stalls
Barrier stalls
Lock acquire stalls
100
75
50
25
Ocean
Water
Fluidanimate
Streamcluster
TSP
K-means
SI
Dl
nf
D2
56
ME
SI
Dl
nf
D2
56
ME
SI
Dl
nf
D2
56
ME
SI
Dl
nf
D2
56
ME
SI
Dl
nf
D2
56
ME
SI
Dl
nf
D2
56
ME
ME
ME
Barnes
SI
Dl
nf
D2
56
SI
Dl
nf
D2
56
Execution cycles (percent)
125
SSCA2
(a)
125
Store
Queue lock/unlock
Writeback
Invalidation
100
75
50
25
Barnes
Ocean
Water
Fluidanimate
Streamcluster
TSP
K-means
56
D2
SI
ME
56
D2
SI
ME
56
D2
SI
ME
56
D2
SI
ME
56
D2
SI
ME
56
D2
SI
ME
D2
56
SI
ME
D2
ME
56
SI
Network traffic (percent)
Load
SSCA2
(b)
Barnes, Ocean, and Water are benchmarks in the SPLASH-2 suite. Fluidanimate and Streamcluster are PARSEC benchmarks.
TSP: Traveling-Salesman Problem. K-means and SSCA2 are from the STAMP benchmark suite.
Figure 3. Performance of DeNovoND versus MESI: execution time (a) and network traffic (b). All bars are normalized to MESI.
In (a), each bar is divided into compute time, stall time due to data memory accesses, barrier sync time, and lock acquire time.
The bars in (b) are divided by message type: load, store, queue lock/unlock, writeback, and invalidation.
states and race conditions in the protocol.

We also studied MESI and DeNovoND with
(the same) idealized lock implementation for
a fair comparison9the results are consistent
with or slightly better (for DeNovoND) than
those shown in Figure 3.
............................................................
146
micro
IEEE
Execution time. Figure 3a shows that

DeNovoND incurs virtually no performance degradation over MESI, demonstrating the efciency of Bloom lterbased
invalidations and distributed queue-based
locks. DInf shows the same or slightly better (up to 5 percent) execution time compared to MESI. In particular, DInf shows
either comparable or large benets in terms
of memory stall time owing to lack of false
sharing. Comparing DInf and D256, we

see that restricting the Bloom lter size to
256 bits gives virtually the same execution
times as the innite-length lters for almost
all applications.
Network traffic. Figure 3b shows that for all
the applications, DeNovoND has lower trafc than MESI (33 percent lower on average,
67 percent maximum). This directly translates into energy reduction.
The primary sources of these savings in
DeNovoND are as follows:
DeNovoND does not incur any traffic for invalidations, a significant

effect in all applications.
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
Store traffic is reduced in some applications owing to DeNovos writevalidate policy (a store miss does not
bring in the cache line).
The net reduction in load misses
(memory time) due to the lack of
false sharing directly leads to lower
load traffic in several applications.
Load traffic is further reduced
because a load response only contains
valid or registered words of a cache
line. Because the coherence state is
preserved per word, some words may
be invalid at the servicing cache.
Overall, our results show that for these

applications, the access signature mechanism
allows DeNovoND to enjoy all the benets
of DeNovo even in the presence of lockbased synchronization. Furthermore, the signature size needed is small (32 bytes).
he global address space provided by the

shared-memory programming model
simplies many aspects of parallel programming. Unfortunately, the hardware coherence
protocols and consistency models required to
provide the illusion of a single global address
space are becoming increasingly complex and
inefcient. Our combined hardware-software
approach, rst proposed in the DeNovo project, holistically rethinks the memory hierarchy from the ground up, driven entirely by
software requirements and eliminating needless inefciencies.
For the DeNovo vision to succeed, however, the class of programs that can be supported must be broadened signicantly. This
article takes a signicant step toward that
goal by showing how DeNovoND can support disciplined nondeterministic codes without giving up the complexity, performance,
and energy advantages of DeNovo. Although
DeNovoND uses DPJs model as a motivator, the annotations required are not language
dependent, and designing a language-independent instruction set architecture for
DeNovo coherence is an important part of
our ongoing effort toward a complete interface for the DeNovo system.
The key insight underlying DeNovoND is
of separating concerns of racy versus nonracy
accesses, and those of deterministic versus
nondeterministic nonracy accesses. We believe

that the insights from this work and the mechanisms proposed are also applicable to other
parallel patterns; for example, we are currently
using these mechanisms to support more
unstructured synchronization such as lockfree, nonblocking constructs. We believe there
is signicant momentum from the software
community to embrace the types of discipline
we seek. For example, there is signicant
recent work on more disciplined operating
systems.15
We do not expect all software will become
disciplined or that legacy codes will not need
to be supported. However, with our judicious
use of hardware and software mechanisms,
we believe we can support legacy codes, but
they may not see the full benets of this
approach. For example, C, C, and Java
already require that programs be data-racefree and annotated with atomic/volatile
annotations for a reasonable consistency
modelsuch codes can be supported with
self-invalidations on synchronization.
The data-race-free consistency model now
adopted for the most popular programming
languages is an initial example of the discipline and hardware-software codesigned
approach we advocate. It took multiple decades and many intermediate steps to convert
from the rst data-race-free proposals to current practice.8 We expect that if the DeNovo
approach is going to succeed, it will also
require many intermediate milestones. We
believe that DeNovoND provides a signicant milestone and hope it inspires further
unconventional approaches for more efcient
MICRO
global address space systems.
Acknowledgments
This work was supported in part by Intel
and Microsoft through the Universal Parallel
Computing Research Center at Illinois, by
Intel through the Illinois-Intel Parallelism
Center, and by the National Science Foundation under grant CCF-1018796.
....................................................................
References
1. M.M.K. Martin, M.D. Hill, and D.J. Sorin,

Why On-Chip Cache Coherence Is Here to
Stay, Comm. ACM, vol. 55, no. 7, 2012,
pp. 78-89.
.............................................................
MAY/JUNE 2014
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
147
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
TOP PICKS
2. M. Zhang, A.R. Lebeck, and D.J. Sorin,
13. L. Iftode, J.P. Singh, and K. Li, Scope Con-
Fractal Coherence: Scalably Verifiable
sistency: A Bridge between Release Consis-
Cache Coherence, Proc. 43rd Ann. IEEE/

tency and Entry Consistency, Proc. 8th

Ann. ACM Symp. Parallel Algorithms and
pp. 471-482.
Architectures (SPAA 96), 1996, pp. 277-287.
3. H. Zhao et al., Protozoa: Adaptive Granular-
14. J.R. Goodman, M.K. Vernon, and P.J.
ity Cache Coherence, Proc. 40th Ann. Intl

Symp. Computer Architecture (ISCA 40),
Woest, Efficient Synchronization Primitives for Large-Scale Cache-Coherent Multi-
2013, pp. 547-558.
processors,
4. Intel, The SCC Platform Overview, May

2010.
5. N.P. Carter et al., Runnemede: An Architecture for Ubiquitous High-Performance
Computing, Proc. IEEE 19th Intl Symp.
High-Performance Computer Architecture
(HPCA 13), 2013, pp. 198-209.
6. S.W. Keckler et al., GPUs and the Future
of Parallel Computing, IEEE Micro, vol. 31,
no. 5, 2011, pp. 7-17.
7. B. Choi et al., DeNovo: Rethinking the
Memory Hierarchy for Disciplined Parallelism, Proc. Intl Conf. Parallel Architectures and Compilation Techniques (PACT
Proc.
3rd
Intl
Conf.
Architectural Support for Programming Languages and Operating Systems (ASPLOS

89), 1989, pp. 64-75.
15. A. Baumann et al., The Multikernel: A New
OS Architecture for Scalable Multicore
Systems, Proc. ACM (SOSP 09), 2009,
pp. 29-44.
Hyojin Sung is a PhD candidate in the

Department of Computer Science at the University of Illinois at Urbana-Champaign. Her
research focuses on parallel computer architecture. Sung has an MS in parallelizing compiler
from the University of California, San Diego.
11), 2011, pp. 155-166.

8. S. Adve and H.-J. Boehm, Memory Models: A Case for Rethinking Parallel Languages and Hardware, Comm. ACM, Aug.
2010, pp. 90-101.
9. H. Sung, R. Komuravelli, and S.V. Adve,
DeNovoND: Efficient Hardware Support
for Disciplined Nondeterminism, Proc.
Systems (ASPLOS 13), 2013, pp. 13-26.
10. R. Bocchino Jr. et al., A Type and Effect
System for Deterministic Parallel Java,
Proc. 24th ACM SIGPLAN Conf. Object
Oriented Programming, Systems, Languages and Applications (OOPSLA 09),
2009, pp. 97-116.
Rakesh Komuravelli is a PhD candidate in

the Department of Computer Science at the
University of Illinois at Urbana-Champaign.
His research interests include computer architecture and parallel computing. Komuravelli
has an MS in protocol verification from the
University of Illinois at Urbana-Champaign.
Sarita V. Adve is a professor in the Department of Computer Science at the University
of Illinois at Urbana-Champaign. Her research interests are in computer architecture
and systems. Adve has a PhD in computer science from the University of Wisconsin. She is
a fellow of IEEE and the ACM.
11. R.L. Bocchino Jr. et al., Safe Nondeterminism in a Deterministic-by-Default Parallel

Language, Proc. 38th Ann. ACM SIGPLAN-SIGACT Symp. Principles of Programming Languages (POPL 11), 2011, pp. 535-

article to Hyojin Sung, 201 N. Goodwin Ave.,
Urbana, IL 61801; _____________
sung12@illinois. edu.
548.
12. B.N. Bershad, M.J. Zekauskas, and W.A.
Sawdon, The Midway Distributed Shared
Memory System, Compcon Digest of
Papers, 1993, pp. 528-537.
____________
_______
............................................................
148
micro
IEEE
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
Awards
................................................................................................................................................................
Reflections from the 2013

Eckert-Mauchly Award Recipient
JAMES GOODMAN
University of Auckland
......
I thank Editor in Chief Erik

Altman and Associate Editor in Chief
Lieven Eeckhout for inviting me to share
my experiences leading to the EckertMauchly Award. Nobody wins this award
on his own (sadly, this gender reference
is correct: it has been awarded 35 times,
all to men). Many people contributed to
the achievements that led to the award,
and my biggest fear in writing this is that
for lack of space I will leave too many
unrecognized.
A meandering path
Coming of age at the beginning of the
Vietnam War, I spent my twenties simultaneously protesting the war and supporting it by assisting the U.S. Navy in
deploying computers and navigation systems. This deep internal conict had
many consequences and affected important decisions later in life. But at Texas
Instruments I learned a lot about how
real computers worked from Tom Stringfellow, Quitman Liner, and many others.
Working as an engineer while a graduate student in the 1970s, I was involved
in the development of third-party memory systems attaching to IBM mainframes. I was fortunate that upstart Intel
allowed me a 30-hour work week during
more than ve years employment. I
learned about memory systems by
examining detailed designs of third-party
memory systems for the IBM System/
370 family, struggling particularly with
the problems of how to hold off a
0272-1732/14/$31.00 c 2014 IEEE
micro
IEEE
processor designed for core memory or

static RAM when a DRAM required
refresh and how to extend a transaction
look-aside buffer to support a larger
memory than the manufacturer had
intended. I learned much about caches
from Mr. Cache, Alan J. Smith at the
University of California, Berkeley. And I
learned a lot about how to conduct
research from my advisor Al Despain,
along with Carlo Sequin and David Patterson, the last of whom has provided me
with sage advice throughout my career,
despite being three years my junior.
Near the end of my studies at Berkeley, as Intel shifted emphasis from memory chips to microprocessors, I moved to
the microprocessor group, where the P1
and P2 (80186 and 80286) were being
designed. The state of the art was the
single-board computer, with multiple
processors sharing their memory by
communicating through a backplane
Multibus. Although computers costing
more than a half-million dollars had
caches, the microprocessors did not, but
Intel understood Moores law. Extending
the single-board model to computers
with caches led directly to the cache
coherence problem, which I discussed
with Jack Klebanoff, leading me to think
about using Multibus broadcast commands to keep the caches consistent.
Thus, I got an early start on the problems
of cache consistency for the coming generation of microprocessors capable of
shared-memory multiprocessing.
At the age of 36, with three degrees

in engineering but not yet having authored a published paper, I completed a
PhD and took a position as assistant professor in the Department of Computer
Sciences at the University of WisconsinMadison. Having investigated architectural support for databases in my
dissertation, I had listed research interests as computer architecture and databases. I was blessed with a colleague,
David DeWitt, who tactfully advised me
that I was more likely to succeed in architecture than databases. Over the next
two decades, I enjoyed a rich and productive environment with colleagues
Andy Pleszkun, Jim Smith, Guri Sohi,
Mark Hill, and David Wood. It was my
natural inclination to work on small
science projects, in part because big
science seemed inaccessible to one
determined to eschew support from military sources. My embrace of small science also persuaded me that not every
interesting idea was worth publishing,
and that I would best succeed by sifting
and winnowing ideas before publishing
them. Perhaps this was less about great
insight than simple laziness, but to this
day my list of publications over 45 years
is barely one per year.
In early collaboration with Jim Smith
and Andy Pleszkun on a decoupled
architecture, I learned a lot about compilers and simulation when Honesty
Young designed an early compiler for
PIPE, a decoupled architecture.1 Within
.............................................................
149
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
..............................................................................................................................................................................................
AWARDS
the same project, Wei-chung Hsu recognized that conicts between register allocation and code scheduling could be
handled best by doing both at once,
resulting in work2 Im very proud to claim
despite my minimal contribution.
Working with Mary Vernon and me,
Steve Scott did some excellent work evaluating the Scalable Coherent Interface
(SCI) ring.3 Working with IEEE standards
committees (Futurebus and SCI) over
the next 10 years, I learned a lot about
cache coherence protocols, and with Stefanos Kaxiras, developed a strong belief
that such protocols could be extended
without falling back to a scalablebut
slow directory-based scheme. Gradually it emerged that caches were highly
effective for the sharing of data, particularly if things could get out of order, but
that locks and critical sections could
exhibit disastrous memory behavior for
cache-based memory systems.
Steve Scott also came up with the brilliant notion of pruning caches, exploring
the novel concept of a distributed directory (or cache) that remembered regions
of the network where a line was not
cached. The concept has since come up
repeatedly in my work toward scalable,
non-directory-based cache consistency,
but this work is rarely referenced, perhaps because it ended up in an IEEE journal4 after multiple conference rejections.
I delved into locks and memory ordering, working with Phil Woest and Mary
Vernon to propose the concept of building hardware queues to avoid many of
the problems associated with spinlocks.
We initially called this Queue-on-Sync-Bit
(QOSB, pronounced Cosby), but soon
renamed it Queue-on-Lock-Bit (Colby,
after the Wisconsin town responsible for
a common cheese). This work5 inspired
Michael Scott and John Mellor-Crummey
to propose the popular MCS lock, a software-built queue that captured much of
the benet of QOLB.6 Meanwhile, Alain
Kagi and Doug Burger analyzed the
potential for QOLB,7 concluding that it
could be effective, but required sophisticated and disciplined programming, as
if programming SMPs wasnt hard
............................................................
150
micro
IEEE
enough. With Guri Sohi arguing that

speculation could be exploited in many
ways, Alain had the insight that hardware
could deduce when lock contention was
occurring, creating Implicit QOLB, a
queue similar to that of QOLB but without assistance from the programmer.8 A
key notion was that performance could
be improved by delaying a response to a
request for a cache line containing a lock,
allowing the thread holding the lock a
brief opportunity to complete execution
of the critical section.
Once we recognized the benet of
speculation regarding critical sections,
other ideas quickly followed. For example, recognizing that certain memory
locations could be associated with a
given lock suggested the possibility of
speculative push, passing cache lines
modied within the CS at the time the
now-available lock was replaced.9
Ravi Rajwar extended the notion of
speculating about critical sections one
step further, recognizing that a CS without data conicts need not acquire the
lock and therefore can share the cache
line containing the lock, permitting concurrent execution of critical sections protected by a common lock.10 I only
realized how counterintuitive this was
when I described it to knowledgeable
colleagues who initially insisted this was
impossible since the programmer explicitly invokes mutual exclusion.
Im delighted to claim partial credit for
this breakthrough, though my primary
contributions were presenting Ravi with
the context and insistingover his objectionson calling it Speculative Lock Elision (SLE). Like many brilliant insights,
this seems obvious in retrospect, and
after we disclosed the idea in 2000 I
expected it would soon appear in new
processors. It soon appeared in software
implementations with limited hardware
support in Azul Systems, and experimental software-only versions have been
widely discussed. Sun Microsystems
introduced support for software-hinted
lock elision in their experimental processor Rock, through instructions that
explicitly begin speculative execution
rather than acquire the lock. But only in

the past two yearsa full decade later
has this capability appeared in commercial products.
After a sabbatical in 2000 to 2001 at
Intel, I discovered the frustration of company secrecy preventing the disclosure
of interesting new ideas. Herbert Hum
and I conceived a novel cache-coherence
protocol intended to exploit the higher
bandwidth opportunities present with
the emerging transmitter equalization
(pre-emphasis clocking) technology, and
further increasing the speed advantage
of point-to-point networks over buses.
The starting point was the notion of a
broadcast coherence protocol, with the
introduced problem being event ordering, conveniently avoided by a bus.
Although our original MESIF Coherence
Protocol evolved some before becoming
a critical part of QPI source snooping in
Nehalem, we were initially prohibited by
Intel from publishing the idea, then had it
rejected twice by ISCA11,12 because of
limits on what we could disclose.
In 2003, after 23 years at Wisconsin
and with an empty nest, I took up a position in computer science at the University of Auckland in New Zealand. Fuad
Tabba and I collaborated extensively with
Mark Moir on transactional memory
issues, and Fuad experimented with a
Rock prototype, demonstrating that a
hybrid TM systembest-effort hardware
support for transactions, falling back to
software when necessaryis a promising approach to supporting transactional
memory.13
onsidering the incredible advances

made in computer architecture
over my career, it is easy to suggest that
architecture thrived because of Moores
law and dies with itwhere could it possibly go from here? But we still dont
know how to build general-purpose parallel computers that are easy to program. I
believe there are yet many opportunities
to contribute to the goal of truly scalable
systems that can be programmed by
unsophisticated users.
MICRO
Computer Architecture lives!
IEEE MICRO
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
............................................................
References
Push, Intl J. Parallel Processing, vol.
Protocol for Point-to-Point Intercon-
32, no. 3, 2004, pp. 225-258.
nects (2009), tech. report, Univ.
1. H.C. Young and J.R. Goodman, A

Simulation Study of Architectural Data
Queues
and
Prepare-To-Branch
Instruction, Proc. IEEE Intl Conf.
Computer Design (ICCD): VLSI in
10. R.
Rajwar
and
J.R.
Goodman,
Speculative Lock Elision: Enabling

Highly Concurrent Multithreaded Execution, Proc. 34th Ann. ACM/IEEE
Intl Symp. Microarchitecture, 2001,
Computers, 1984, pp. 544-549.
pp. 294-305.
2. J.R. Goodman and W.-C. Hsu, Code
Scheduling and Register Allocation in
Large Basic Blocks, Proc. 2nd Intl
Conf. Supercomputing, 1988, pp.
11. J.R.
Goodman
and
H.H.J
Hum,
MESIF: A Two-Hop Cache Coherency

Protocol for Point-to-Point Interconnects (2004), tech. report, Univ.
442-452.
of Auckland, https://researchspace.
_____________
3. S.L. Scott, J.R. Goodman, and M.K.
Vernon, Performance of the SCI
Ring, Proc. 19th Ann. Intl Symp.
Computer Architecture (ISCA 92),
1992, pp. 403-414.
auckland.ac.nz/bitstream/handle/2292/
______________________
11593/MESIF-2004.pdf?sequence7.
_____________________
12. J.R. Goodman and H.H.J. Hum,
MESIF: A Two-Hop Cache Coherency
of Auckland, https://researchspace.
auckland.ac.nz/bitstream/handle/2292/
11594/MESIF-2009.pdf?sequence6.
13. F. Tabba et al., NZTM: Non-Blocking
Zero-Indirection Transactional Memory, Proc. 21st Ann. Symp. Parallelism
in
Algorithms
and
Architectures
(SPAA 09), 2009, pp. 204-213.
James Goodman is a professor in the

Department of Computer Science at the
University of Auckland and an Emeritus
Professor in the Computer Sciences
Department at the University of Wisconsin-Madison. Contact him at goodman@
_______
cs.wisc.edu.
________
4. S.L. Scott and J.R. Goodman,

Performance
of
Pruning-Cache
Directories for Large-Scale Multiprocessors, IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 5, 1993,
pp. 520-534.
5. J.R. Goodman, M.K. Vernon, and P.J.
Woest, A Set of Efficient Synchronization Primitives for a Large-Scale
Shared-Memory
Multiprocessor,
Proc. 3rd Intl Conf. Architectural
Support for Programming Languages
and Operating
pp. 64-75.
Systems,
1989,
6. J.M. Mellor-Crummey and M.L. Scott,

Synchronization
without
Contention, Proc. 4th Intl Conf. Architectural Support for Programming
Languages and Operating Systems,
1991, pp. 269-278.
7. A. Kagi, D. Burger, and J.R. Goodman,
Efficient Synchronization: Let Them
Eat QOLB, Proc. 24th Ann. Intl
Symp. Computer Architecture (ISCA
97), 1997, pp. 170-180.
8. R. Rajwar, A. Kagi, and J.R. Goodman,
Improving the Throughput of Synchronization by Insertion of Delays,
Proc. 6th Intl Symp. High Performance Computer Architecture, 2000,
pp. 168-179.
9. R. Rajwar, A. Kagi, and J.R. Goodman,
Inferential Queueing and Speculative
__________________________________
.............................................................
MAY/JUNE 2014
micro
IEEE
151
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
Micro Economics
................................................................................................................................................................
The Academic and Business

Marriage
SHANE GREENSTEIN
Northwestern University
......
There is no formula for a successful relationship, but a partnership

of convenience negotiated by others
tends not to work very well for either
participant.
I know of such a relationship, and
nobody calls it quits. The relationship renews itself again and again. This arrangement does not involve any civil ceremony,
wedding, or ring.
I am talking about the marriage between business and academics.
Many participants in high technology
take this relationship for granted, and
some live in denial of its presence. The
general partnership is worth a deeper
look. It shapes many of the key cornerstones of high-technology economics.
The superficial relations

Lets rst focus on the visible aspects
of the relationship.
Universities provide degrees and
training for many participants in hightechnology business. That matters. Technology businesses are a magnet for the
educated. It is no accident, for example,
that Silicon Valley has one of the highest
fractions of college-educated workers of
any economic cluster in the world.
That said, does the presence of a
rst-rate university help a local high-tech
industry grow? Lots of economic research suggests the ofcial answer is
yes, probably.
The answer starts with the mission
statements. Most state public universities
are required to favor applications from

students in the state. Rarely remarked
upon, those mandates help keep many
talented state citizens from moving away
at one of the most mobile time in their
lives.
The mandate is usually broader.
Some of the great engineering institutions of the United States arose at landgrant public universitiesfor example,
Michigan, Illinois, Purdue, California, and
Cornell. Those state governments (with
a nudge from the federal government)
opened engineering schools, agricultural
extensions, and various other pragmatic
programs. Any history of technology
developments in those states nds
that those public institutions did, in fact,
play such a crucial role in economic
development.
Private universities can play a similar
role even though they have no mandate
to serve their local communities. For
example, Stanford and MIT today play an
important role in their local technology
industries, and Carnegie Mellon and Pitt
have helped Pittsburgh in that citys
recent renewal, and so on.
Most civic leaders recognize this relationship. The recently retired Mayor
Bloomberg was so deeply unsatised
with the situation in New York City that he
made the establishment of a new and
local engineering and technically oriented
university one of his major projects.
While there is plenty of evidence to
suggest a link between local economic
.......................................................
152
micro
IEEE
growth and an educated populace,

many states dont invest much in their
universities.
Let me explain part of that in a provocative way. In the last half-century, Californias economy got more benet from
the US university system than any other
state because more inventive people
moved there. For example, outside of
Massachusetts, the highest concentration of MIT alumni lives in California, and
I would bet every single engineering
school in the United States, public or private, nds the largest fraction of its
alumni in the golden state.
That has been noticed, and state governments have reacted in multiple ways.
Some have started campaigns to keep
their graduates local, taking a variety of
initiatives. Other state legislators have
simply given up, and calculated that the
state subsidy is not yielding a benet of
much value to the local taxpayers. That
reasoning rationalizes treating a students experience like a private investment. When budgets got tight, many
states raised tuition and fees, or cut
funding and subsidies, and, for a variety
of additional political reasons, other
expenditure became the priority for state
budgets.
More subtle interplay

Other more subtle issues arise when
inventions move between the academic
and business realms. Although the
two spheres share some overlapping
0272-1732/14/$31.00 c 2014 IEEE
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
research interests, occasionally their different missions come into conict.

Most pointedly, does money from
business corrupt independent research
at universities? From time to time, stories make the news where this seems to
have happened. For example, we hear of
medical experiments paid by pharmaceutical companies that yielded suspiciously favorable results.
Though I have no general data to back
up the following assertion, I suspect that
these stories are more the aberration
than the rule. The majority of the time
business managers recognize the importance of academic independence. Most
of time, the business merely wants an
efcient experiment.
Think of it this way. There is nothing
wrong with an integrated circuit rm
making a donation to the department in
material science at the local university.
That department might end up creating
an invention that makes its way to local
industry, where the graduate students
go to work. That looks like enlightened
self-interest.
Dont misinterpret the conclusion.
Any business seeking to make a direct
calculation on the return on investment
is unlikely to be able to precisely quantify
the benets. The movement of inventions and inventors muddies any picture,
as noted.
More to the point, there is no deterministic chain of connection between
funding of research, invention, and local
economic growth. Lots of examples can
illustrate why and how benets arise, but
each step involves risks, a distribution of
returns, and inherent unpredictability. Any
specic example might go well or poorly.
In my experience, businesses do get
their moneys worth overall because
their funding changes the direction of
attention. Nobody should ever underestimate the capacity of a very smart scientist to toil away the hours on seemingly
useless puzzles, even when the businesses next door could benet from a
short bit of their problem-solving skills. A
little money goes a long way in redirecting attention.
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
There are days when I wish these

connections were easier to trace. For
example, not too long ago I attended a
conference organized by a Silicon Valley
venture capital rm for some of their
institutional and limited partners (of
which I am a minor one). At one of the
lunches, a pugnacious older member of
the table launched into vigorous platitudes about the federal budget, calling
for major cuts, including funding of R&D.
He seemed to have no appreciation
that such R&D funds had led to invention
that had set the table for several investment opportunities for these VCs, and
that had made all the investors rich.
Maybe such fools never listen to anybody, but at that moment I wished I had
had a quick sound bite to shut him up.
Two-way relationship
A persons relationship with his or
her university does not end with graduation. Some alumni retain their ties, spend
time with other alumni, and dene their
social existence around their experience.
Universities know about this behavior
in other ways. Some alumni with disposable income and wealth make major
donations to universities. Those funds
contribute to buildings (to which the
donors attach their names), as well
as research institutes inside buildings
(which also display eponymous labels).
Those funds can make an enormous difference to researchers, freeing their time
for experiments.
Private universities have much longer
track records of success tapping into
their alumni base for donations than
public schools, but savvy public schools
have adopted similar practices more
recently. Think of the public institutions
with large, loyal, and nancially successful alumni such as the great state institutions in Berkeley, Los Angeles, Ann
Arbor, Champaign, or Austin.
Not that this is easy to actually pull
off. Business can have a hard time buying friends in academics, and both sides
can easily mess up the relationship. The
two partners seem not to have the same
perception of the universe.
I recently found myself taking part in

a discussion with a major high-tech rm
that made a small donation to a national
academic organization whose board I sit
on. The manager at the rm wanted to
shape the message at the organization,
which had remained independent over
its entire existence. It involved small
nancial stakes, so both parties found it
cheap to be righteous and indignant. The
conversation got everyone into a huff,
and destroyed some goodwill.
Even when the importance of independence is acknowledged and accounted for, big money at both the public
and private institutions changes their
priorities in subtle and overt ways. For
example, professional schoolsengineering, medicine, business, and lawtend to
be far better at getting donations. Their
buildings grow, their student populations
do too, and so do their faculty. When
resources are scare, the growth comes at
the expense of humanities.
Poets have rarely made out materially
well in any society in human history, so
you might reasonably ask why this time is
any different. Call me an economist with a
guilty conscience, but it just seemed easier to accept shorting the artists when it
could be blamed on societys bulliesits
arrogant aristocrats and disconnected
elites. It seems almost perverse to see
the modern universities cooperate in a
similar conspiracy for the indulgence of its
technical nerds and geeks.
he relationship between business

and academics is one of those institutions, such as marriage, that rarely
changes its character in the short run. In
the long run, however, a bad partnership
sours many aspects of economic life,
whereas a good partnership can make
enormous differences to happiness and
welfare. It seems worthwhile to invest in
MICRO
getting it right.
Shane Greenstein is the Kellogg Chair

in Information Technology at the Kellogg
School of Management, Northwestern
University. Contact him at ________
greenstein@
kellogg.northwestern.edu.
________________
M
q
M
q
M
q
M
q
MQmags
q
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q
________________________________
micro
IEEE
M
q
M
q
M
q
M
q
MQmags
q

Micro Computer Architecture

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Micro Computer Architecture

Uploaded by

Copyright:

Available Formats

Contents

For navigation instructions please click here

The Academic and Business Marriage

The magazine for chip and silicon systems designers

For navigation instructions please click here

THE WORLDS NEWSSTAND

Guest Editors Introduction: Top Picks from the 2013 Computer

Designing and Managing Datacenters Powered by

Quality-of-Service-Aware Scheduling in Heterogeneous

A Case for Specialized Processors for Scale-Out Workloads

Smart: Single-Cycle Multihop Traversals over a Shared

Networks on Chip with Provable Security Properties

Cache Coherence for GPU Architectures

A Configurable and Strong RAS Solution for Die-Stacked

Decoupled Compressed Cache: Exploiting Spatial Locality for

100 Sonic Millip3De: An Architecture for Handheld 3D Ultrasound

109 Hardware Partitioning for Big Data Analytics

120 Efficient Spatial Processing Element Control

via Triggered Instructions

138 DeNovoND: Efficient Hardware for

From the Editor in Chief

152 Micro Economics

Cover artwork by Giacomo Marchesi

The Academic and Business Marriage

THE WORLDS NEWSSTAND

Director, Products & Services

COMPUTER SOCIETY MAGAZINE

THE WORLDS NEWSSTAND

THE WORLDS NEWSSTAND

THE WORLDS NEWSSTAND

From the Editor in Chief

Top Picks from 2013

This double issue features our

to zero articles, as did Performance analysis, with other categories staying

Published by the IEEE Computer Society

This Top Picks issue is also unique

I encourage any of you who author

0272-1732/14/$31.00 c 2014 IEEE

THE WORLDS NEWSSTAND

THE WORLDS NEWSSTAND

Articles in this issue

Hardware Partitioning for Big Data Analytics

Decoupled Compressed Cache: Exploiting Spatial Locality for

A Configurable and Strong RAS Solution for

Die-Stacked DRAM Caches*

Decoupled Compressed Cache: Exploiting Spatial Locality for

None of the above

Quality-of-Service-Aware Scheduling in Heterogeneous

These articles fit in two categories from 2003.

in Chief Lieven Eeckhout, of noting major

With that, as with the Top Picks

mas J. Watson Research Center. Contact him at ealtman@us.ibm.com.

THE WORLDS NEWSSTAND

THE WORLDS NEWSSTAND

Guest Editors Introduction

TOP PICKS FROM THE 2013

We received a total of 101 submissions.

were also eligible for inclusion in this years

Published by the IEEE Computer Society

0272-1732/14/$31.00 c 2014 IEEE

The review process

THE WORLDS NEWSSTAND

THE WORLDS NEWSSTAND

The Selection Committee