You are on page 1of 8

InfiniBand Architecture:

Illuminata Bridge Over Troubled Waters

That is a bad bridge which is shorter than the stream.

German proverb
Research Note The PCI peripheral expansion bus has had a long and illustrious history. Since its
inception in 1991, system vendors and users have embraced it like few technical stan-
dards before or since. PCI provides a substantial volume of the I/O bandwidth and
peripheral connectivity across the range of RISC to CISC; PC to enterprise server;
proprietary to commodity. User requirements at the advent of the 21st century,
David Pendery however, have rapidly evolved. Not only has computer performance advanced enor-
Jonathan Eunice mously, the very landscape of IT use and connectivity has changed. The PCI standard
27 April 2000 we have converged and relied upon for close to a decade is being rapidly outstripped
by the demands of ever larger databases, transaction loads, and network user bases.
The bridge is beginning to look shorter than the stream.

Fortunately, help is on the way. The InfiniBand Architecture is the industrys answer
to the growing I/O problem. InfiniBand replaces the bus-based PCI with a high-band-
width (multiple gigabytes per second) switched network topology, and shifts I/O
control responsibility from processors to intelligent I/O engines commonly known as
channels. These approaches have long enabled the worlds largest servers, now
brought down to address virtually every server. InfiniBand is not yet a product, nor
even really a standard. The first full specification wont be available until this summer,
with the first products appearing in 2001. Initial indications, however, are greatly
encouraging. InfiniBand is the right technological advance, emerging at the right time
and for the right reasons. To employ a bit of adolescent patois, InfiniBand rocks.

O N!
O P E NING SO

Copyright 2000 Illuminata, Inc.


Illuminata, Inc. 187 Main Street Nashua, NH 03060
Licensed to InfiniBand Trade Ass'n 603.598.0099 603.598.0199 f www.illuminata.com
Web Use Only - Do Not Reproduce
2

Presto, Change-O ISAs maximum 10 MBps and EISAs 33 MBps data


transfer rates. As a commodity standard, it minimizes
When PCI established itself in the early 1990s, 66 MHz
cost to achieve high shipment volumes. Even so, PCI
processors and 10 Mbps networks were fast. 0.8 micron
has neatly outperformed virtually all alternatives,
CMOS semiconductor fabrication was state of the art.
including those quite proprietary and specialized.
Early transaction processing benchmarks churned out a
Then, in a classic case of volume sales providing the
whopping 54 transactions per minute.1 Data ware-
investment dollars needed to move a heretofore
housing had just been invented. Client-server applica-
commodity product upmarket, PCI has dramatically
tions and deployments were increasing, but only the
extended its reach. Enhanced versions have doubled
digerati had email, and the Internet as we know it was
both clock speed and bus width, making 264 MBps
still years distant.
easily achieved today, with 500+ MBps options avail-
What a difference a decade makes! Today, multi- able. The HotPlug PCI extension made PCI suitable for
terabyte databases running on clustered servers, if not high availability servers, and its CompactPCI deriva-
exactly commonplace, are a reality in many shops. tive has driven into embedded systems and telco gear.
Storage has been decoupled from the server, and often Although its attributes promised it a long life, PCIs
extended over a storage-optimized network (SAN). very architecture is ultimately limiting.
Intels Pentium III Xeons, now the workhorse of
PCI is built upon that simple connectivity structure,
servers not just PCs, are fabbed at 0.18 micron and run
the parallel bus. The simple, economical bus structure
at 800 MHz; 0.13 micron, 1 GHz chips are on the way.
has been at the base of so many electronic products for
The top TPC-C server does 135,815 transactions per
so long that its virtually taken for granted. Yet busses
minute, and the Internet is now the workshop of IT.
have inherent drawbacks:
These are the new reality, driving ever-higher user Disorderly contention for resources by periph-
expectations. Fast, unencumbered I/O is the lifeblood erals, memory, and CPUs. Disorder breeds ineffi-
of this evolving corpus. Never has such variety of I/O ciency and suboptimal performance.
been required to link such scale of hardware and soft-
Vexing failure modes. Not only is the bus a poten-
ware in such transparent and accelerated ways. And
tial single point of system failure, failure isolation
yet, never before have the incumbent I/O technologies
is difficult or impossible. If one attached card fails,
been so outstripped by processor capabilities.
it can cause the entire system to fail. Worse,
discovering which card caused the failure is at best
PCI = Problematic Computing Interface?
a hit-or-miss propositiona misery in a world
Introduced in response to a morass of incompatible needing high availability.
peripheral connectivity and I/O options of a decade Severe physical stipulations and limitations. As
ago, PCI has been a blessing. Over time it expunged the bus length increases to accommodate more, or
alphabet-soup that was AT/ISA, EISA, HP-PB, MCA, more widely dispersed, expansion devices,
VME, NuBus, SBus, and TurboChannel, among others. signaling properties become less stable. The same
It ushered in a long period of wide industry acceptance thing is true for clock rates. The faster the modu-
of a single standard, and thus a stability and predict- lation, the shorter the feasible bus, and the fewer
ability that made both product development and selec- peripheral interconnects are possible. In the
tion pleasingly straightforward. extreme case, the 133 MHz defined for PCI-X,
there can only be a single connector per bus!
PCI not only standardized I/O attributes, it enabled
high bandwidth. Its initial 133 MBps2 may seem PCIs shared structure cannot keep up on a perfor-
modest today, but it greatly outpaced then-standard mance basis, nor are its manageability and availability
1. The first TPC-C result, published in 1992. attributes acceptable. As next-generation computing
2. Megabytes per second. Bandwidth figures are nominal, platforms are being planned and implemented, PCI will
not typical. Such naive peak rates dont consider practical gradually be left behind, as antiquated as 66 MHz
slowdowns such as contention and protocol overhead. microprocessors and 40 MB disk drives.
Licensed to InfiniBand Trade Ass'n
Web Use Only - Do Not Reproduce
3

Incremental Upgrades community. But this is to be expected. The free market


is contentious by nature. And, as they say, you cant
One could continue to improve PCI a bit, or work make an omelette without breaking a few eggs. At the
around its limitations. Servers needing both high band- end of the day, these participants know that the
width and large numbers of expansion slots, for customer uptake rate for their next-generation servers
example, are often outfitted with multiple, independent depends on solving I/O bottlenecks, and on not
PCI buses. This comes at a cost, of course, but averts an creating a divisive standards war. Thus while disagree-
immediate capacity crisis. ments they may have, they are all highly motivated to
The latest PCI-X revision goes further, cleaning up the find a common and standard solution.
electrical signal definitions to drive towards 1 GBps IBTA leaders (officially, Steering Members) IBM,
(133 MHz x 64 bits).3 Its a significant and promising Intel, Compaq, Hewlett-Packard, Dell, Microsoft, and
extension that will extend PCIs life by several years. Sun Microsystems reason that its better to have a
Even improvements as extensive as PCI-X, however, smaller group get something practical and effective out
have ever diminishing returns. The writing is on the the door than to hear everyones wishlist. In addition to
wall. Despite PCIs notable run of success, and the fact the Steering Members, Sponsoring Members include
that it will remain with us for years to come, its ulti- 3Com, Adaptec, Cisco, Fujitsu-Siemens, Hitachi,
mate headroom is limited. Bus architectures are funda- Lucent, NEC and Nortel Networks. Its a potent brain
mentally outpaced by our users and applications trust, among them the owners of the best I/O technol-
voracious need for data, and thus high rates of I/O. ogies and intellectual property in the industry.
Rather than more patches, what we now need is a jump
as dramatic as PCI was when it was first introduced. As The Goods
Mitch Shults, Intels point man on I/O strategies says,
the industry has got to move to some fundamentally InfiniBand is the cavalry to the rescue, the I/O standard
new architecture. Enter InfiniBand. and workhorse emerging for the new generation. So
what exactly is it?
On the Way to IBTA
InfiniBand is a network approach to I/O. A system
The road to a future I/O standard has been rocky. Even connects to the I/O fabric with one or more Host
for PCI, vendors were reluctant to give up their favored Channel Adapters (HCAs). Devices, such as storage and
proprietary options. Sun for example, while it has network controllers, would attach to the fabric with a
supported PCI, to this day favors its own SBus design Target Channel Adapter (TCA). InfiniBand adapters
in its premium servers. But the vastly better economics (generically, CAs) are addressed by IPv6 addresses, just
of a single standard, both for IT producers and as any other network node might be.
consumers, has won the day.
The fabric concept may seem abstract to someone
The once-divergent groups such as NGIO (Next whos used to fitting a card in a slot, but its exactly
Generation I/O, led by Intel) and Future I/O (led by what happens on any other network, whether of the
IBM, Compaq, and HP) cast their fates together in traditional LAN/WAN/Internet variety, or the storage
August 1999, a move that led to the foundation of the area networks (SANs) now rapidly entering data
InfiniBand Trade Association (IBTA). The IBTA is centers. The physical fabric combines connectors,
largely based on the successful PCI SIG model. Even cables, and switches. Current specifications call for one,
after formation there have been some disagreements four, and twelve-wide link options, corresponding to
about how quickly InfiniBand should appear, and how 500 MBps, 2 GBps, and 6 GBps bandwidths.4,5 Whereas
encompassing it should be when it does appear. There PCI distances can be easily measured in inches or centi-
have also been tensions between IBTA members and 4. Serial links are conventionally described in bits/sec, not
external constituencies such as the embedded systems the bytes/sec of parallel links. Each InfiniBand width
drives 2.5Gbps (250 MBps) in each direction. Doing the
3. A speed that was aggressive for even the best system math, 4-wide = 10 Gbps (1 GBps/direction), 12-wide = 30
busses just five years back. Gbps (3 GBps/direction).

Licensed to InfiniBand Trade Ass'n


Web Use Only - Do Not Reproduce
4

meters, InfiniBand
links are designed to HOST Virtual Lanes
~17 meters (data Message
center distances) using Packets

copper cabling, or to Channel Adapter

100 meters (intra


building or small-
Physical Wire
campus distances) TARGET
Differential Pair
with fibre optic links.
The connectors will
resemble todays
Ethernet (RJ45) and
Fibre Channel ports. Four Wire
Channel Adapter
Extenders, protocol InfiniBand Link
switchers, and fibre
cabling may increase this a bit, say to 1km (with connecting servers to other servers (in clusters and
multimode fibre) or a few km (with single mode fibre). MPP systems), to storage (somewhat displacing Fibre
The 1,000km common to WAN links are impractical Channel SANs, especially in rack- and room-area
given the need to minimize end-to-end latency. fabrics), and to network adapters and infrastructure
Regardless of the number of wires (i.e., bandwidth (including directly into Internet routers and switches).
grade) or physical dispersion, InfiniBand uses a single
set of logical structures for how nodes are addressed Despite these high-end ambitions, as the successor to
(IPv6), what protocols and APIs are used, and how the PCI, InfiniBand is still about in-chassis I/O, shipped in
components are pieced together. high volume and low price points. This deployment,
which makes a switched network a rack- and mother-
Time-to-market issues will make early TCA imple- board-level feature, will remake system form factors.
mentations equivalent to HCAs, but later refined 2U and 1U rack-and-stack servers may seem like dense
implementations will rapidly cost- and space-mini- computing today, but InfiniBands small connectors,
mize TCAs to enable high-volume sales and inclusion flexible cabling, and network approach will fundamen-
into denser and more embedded configurations. Card- tally compress computing complexes. Within a few
sized CAs will give way to multi-chip semiconductor years, expect todays 0.51U per CPU densities to fall
implementations, then single-chip, and finally will be well under 0.5U/processor, perhaps beneath 0.2U per.
modules that can be optionally included in CPUs and Density isnt everythingcost and high-availability
ASICs. As with most I/O options, high-end servers, are also keybut ISPs, ASPs, and other service
storage arrays, and peripherals will be first to imple- providers will be particularly glad to further minimize
ment and deploy InfiniBand. These are the units that IT footprints.
most need the added performance, and for which the
higher initial costs will be most easily absorbed. Though further out than server and workstation
deployments, embedded computing is another area of
InfiniBand opportunity. Network switches, telco gear,
InfiniBand Everywhere
wireless hubs, industrial automation, and telemetry
Ultimately, InfiniBand everywhere will be the units are all eventual targets.6
rallying cry, just as PCI expanded its purview to both
larger and smaller systems. Within a few years, we Part and parcel of the InfiniBand transformation will
predict that InfiniBand will be the default way of be the leveraging of switch design skills and invest-
ments at system, network, and storage OEMs to
5. When implemented in copper links, each width unit
uses two wire pairs for differential signalling, resulting 6. Albeit in competition with the still-viable CompactPCI
in 4, 16, and 48-wire copper connections. and Motorolas emerging RapidIO initiative.

Licensed to InfiniBand Trade Ass'n


Web Use Only - Do Not Reproduce
5

support generalized InfiniBand fabrics. In great The idea is getting server I/O onto the network, and
measure, InfiniBand will work because it brings so ultimately the Internet. This goes well beyond the
many strong players to the table. remote I/O, for example, found in a few of todays
newest high-end servers. There are rich possibilities
Changing the Guard with this flexible methodology. The technologys
switched design, message/packet basis, fat pipes, and
Saying that InfiniBand is a networked I/O standard is extensive controlling mechanisms will underpin
true, but hardly scratches the surface of the design. It architectures and network schema for the next decade.
is, for example, also a channel-based approach.
Thinking Outside The Box
Instead of the memory-mapped load/store para-
digm of PCI, InfiniBand uses a message-passing Future IT will be largely dictated by Internet-style
send/receive model. This, in concert with the networked computing. In some ways, the Internet
endpoint addressabilty, is essential in ensuring utterly mindset is simply an extension of trends that had
robust, reliable operations. Transmissions are demar- been developing for three decades. Over time,
cated into distinct work queue pairs, with packets compute functions have become steadily more atom-
distributed and disseminated throughout the Infini- ized and distributed, devices have become more intel-
Band network. Adapters take on the responsibility for ligent, client-server has been integrated into the Web,
handling transmission protocols, and InfiniBand and the local network has extended into a global
switches take on responsibility for making sure network. In short, data have steadily been cast farther
packets get where theyre supposed to be. This distri- away from their home bases. I/Oby definition the
bution of work is common, for example, in S/390 movement of datahas of necessity had to be inte-
mainframes.7 grated across wider spans.

Tom Bradicich, IBMs Intel server technologist and a


This dilative phenomenon is the externalization of
prime mover behind InfiniBand, is fond of using a
I/O. External I/O requires common protocols to link
mailman metaphor. CPUs and hosts pass data into
the traffic between the computing devices and control-
memory for use by targets, and then move along to
ling mechanisms referred to above.
other tasks, just as a mailman drops your messages
and moves along to the next house. This functioning
The application is king. But oftentimes, applications
is fundamental to the specification. Wrenching I/O
and databases are starved for data. There are bottle-
off the PCI bus, imbuing it with higher-order organi-
necks, latencies, and congestion that simply arrest
zation schema, and pressing it into better-managed
performance. Here, InfiniBand will make a difference.
and more tightly controlled service inside and outside
Very large databases are now in the terabyte and
the box is InfiniBands raison detre.
above range, with some spanning to 50 TB. Very large,
distributed engines are needed to process such infor-
The controlling mechanisms are quite sophisticated.
mation. These massively parallel systems or compute
Addressing nodes with IPv6, for example, will allow
farms are essentially clustered systems. Whether the
easy and direct linkage with Internet routers and gate-
ways. And while physical layer implementations are DBMS provider is IBM or Oracle, NCR or Compaqs
organized around a given number of wires, the Tandem division, these clusters require high-band-
logical structure is very general. The links are bi- width, low-latency interconnects. Specialized propri-
directional and composed of up to 16 virtual lanes, etary designs are the common result. IBMs SP
any of which a given packet may travel. switch, NCRs BYNET, and Compaqs ServerNet and
Memory Channel are commonly used in their largest
7. Perhaps there really is nothing new under the sun! distributed engines.

Licensed to InfiniBand Trade Ass'n


Web Use Only - Do Not Reproduce
6

HOST

System Interconnect

Compression Engine
CPU
CPU Memory Target
Controller
Memory DMA HCA InfiniBand Switch TCA DMA Function
CPU
CPU

To HCAs, TCAs, by way


InfiniBand Target of IB switches or, directly to
TCA DMA
1, 4, or 12-wire Links Controller(s) devise functions which are
being controlled (disks, etc.)

HOST Network Connection


(InfiniBand or other
network protocols)
System Interconnect

CPU DMA
CPU Memory
Memory DMA HCA InfiniBand Switch CA DMA
CPU Controller CA
CPU

Network Router

As InfiniBand moves into its more refined switched Fibre


Tandem, and IBM
generation (in 2002-2003), it will provide exactly TCA DMA Channel Fibre Channel parallel/cluster
Controller
the sort of high-speed, low-latency packet switching systems discussed
needed by these clusters. Its support for cascaded above comprise some of the
switches, fabric partitions (also called zones), and highest-quality hardware and software technologies
inherent multicasting are well-suited for large clus- in IT. InfiniBand will take its place as the backbone of
ters. This prowess combines with its industry stan- these clusters.
dardization to drive what have been high end cluster
technologies into the mainstream. Further, the InfiniBand protocols support channel-to-
channel I/O failover in InfiniBand links, should a Host
Once fully formed, InfiniBand will enable massive detect a Target failure. Redundant Infiniband links will
horizontal scalability and transparent I/O sharing of course be required for this functionality.
among cluster nodes. Indeed, just as InfiniBand brings
the intelligent channel idea down from the S/390, it Note also that clustering is based on a distributed-
will enable the intelligent-everything (CPU module, memory model that improves availability by diffusing
disk, network controller) model of Compaqs Hima- points of failure. In a similar vein, InfiniBands
layas. This is the key to not only huge performance concept of creating myriad I/O controllers, most of
scalability, but the ability to do so in a highly avail- them located outside the server chassis, enables
ableeven fault tolerantway. component separation and redundancy, eliminating
the PCI buss single domain of failure. Finally, Infini-
Bands message-passing paradigm and protocols incor-
Three Precious Words
porate layers of error management. The technology is
Reliability, availability, and serviceability may not also being designed for device hot-addability,
compare to I love you, but CIOs cant say I love including device look-up and registration, which will
you to any server technology that doesnt have RAS aid IT professionals in dynamically managing, modi-
at the heart of its design. InfiniBand again connects fying, and augmenting their networks.
with the horsehide. Not only does InfiniBand directly
support the RAS attributes inherent in multi-system Clusters by definition increase scaling: Thats how
clustering, its physical, electronic, and logical design vendors get those 64, 128, 256, and 1,024-way
are RAS-friendly. Also remember that the NCR, systems. InfiniBands cascadeable switching will
Licensed to InfiniBand Trade Ass'n
Web Use Only - Do Not Reproduce
7

stretch clustering in a big way, dramatically accentu- Conclusion


ating horizontal scalability. Using InfiniBand
switching, partitioning and fabric management, The old order changeth, yielding place to new, wrote
combined with memory-management and control, Tennyson. IT professionals live these words. For
servers will be configured into first- and second-order system designers, InfiniBand starts this cycle anew
networks with many hosts and I/O end nodes. with a generational change in computing architectures.
Memory and other resources will be shared in these
overlapping subnets of physical and virtual servers and Bus-based I/O is giving way to serial communication
their supporting components, all playing roles in clus- links, and processor-driven I/O by intelligent I/O
tered environments. Additionally, these subnets can be
engines, or channels. InfiniBand both enables this
separated for functional isolation, increasing manage-
change and provides a standard for it that unifies inter-
ment control, availability and performance.
connectivity across servers, storage, and networking as
few technologies have done before.
You Can Take I/O Out of the Network,
But You Cant Take the Network Out of I/O
Five IBTA working groups are busily designing Infini-
We have referred to the Internet and its impact on Bandworking out its protocols, electrical signaling,
enterprise computing. The clustered servers and register models, data structures, verbs, memory and
storage that control inter- and intra-application semantic operations, software, management, and phys-
communications will be at the eye of the Internet ical/mechanical specifications. Version 0.9 runs some
computing whirlwind for years to come. InfiniBands 900 detailed pages. Version 1 is due in the summer. We
physical and logical attributes will extend server I/O
can hardly wait for the technologys improved band-
far and wideout of the box, out of the data center,
width and grand-scale architectural possibilities!
out of the network, and onto the Internet. The IBTA is
basing the addressabilty of the InfiniBand Architecture
That InfiniBand will eventually meet its own maturity
in large part on IPv6, enabling not only efficient local
and demise hardly quells our enthusiasm that
manageability, but also prepping InfiniBand for its
ventures out into the Internet. Source (HCA) and endgame is another ten years distant. We like Infini-
destination (TCA) IPv6 addresses are embodied in the Band because we like the idea of server and worksta-
InfiniBand Global Route Header, which is used to route tion data flowing along communication lines across
packets between HCAs and TCAs, across linked lattices of clustered, process-sharing hardware and
subnets, and out into the world at large. Its a very software. InfiniBand is the right technology at the
well-though out design, simultaneously corresponding right time for the right reasons to realize this dream.
to local-, wide-, and global-area fabric management. InfiniBand rocks!

Licensed to InfiniBand Trade Ass'n


Web Use Only - Do Not Reproduce
8

Components and Features of InfiniBand

InfiniBand Features Attributes Advantages and Uses

Allows direct access to memory by applica-


tions and other I/O; reduces CPU, OS kernel,
Multi-port; tool-free, single-axis
Host and Target and peripheral traffic to memory; allows for
insertion; commodity form factors;
Channel Adapters remote direct memory access; translates and
IPv6 addressable
validates messages; enables load-balancing
and redundancy qualities;

Relieves bus contention; enables partitioning


IPv6 addressable and routeable; and subnet creation for clustering and
Switches/Routers cascadeable switches; partitionable manageability; enables multicasting; allows
traffic zoning; QoS enablement for scalable cascading of multiple switches;
routers dispatch data across switched subnets

One, four and 12-wide bandwidths;


copper (differential signalling) and
fibre physical links; 1-16 bi-direc- Multiplexing and logically connected address
tional, independently assigned chan- spaces allow for refined manageability and
Links nels/lanes per link (one reserved for arbitration; connects hosts and targets of
fabric management, one for applica- different speeds and widths; multiple speed
tion usage); credit-based flow grades match varying price points
control; static rate control; auto-
negotiation/mapping algorithm

Allows for granularity in identifying and


Routing at the packet level; message
controlling IPC and I/O processes; enables
segmentation and re-assembly;
error detection and correction; allows for
Cyclical Redundancy Checking;
Messages/Packets refined application and I/O management and
interleaved packets across channels;
control; allows for IP compatible fabric
IPv6 addressing headers; memory
management; enables processor-independent
protection; remote DMA
and serverless I/O

Licensed to InfiniBand Trade Ass'n


Web Use Only - Do Not Reproduce

You might also like