You are on page 1of 3

Interconnect-Limited VLSl Architecture

William J. Dally Computer Systems Laboratory Stanford University


Abstract As semiconductor technology scales, wires are becoming the dominant factor in determining system performance and power dissipation. By 2008, it is expected that chip traversal will require 16 clocks. Modem superscalar architectures that depend on global register files, global bypass structures, and global instruction issue logic are poorly matched to tomorrows semiconductor technology. This technology demands architectures that exploit locality and minimize global communication. In this paper we describe three approaches to developing architectures that are well matched to interconnect-limited technology. These architectures reduce the use of global communication by clustering execution resources with their data and instruction storage and extending the storage hierarchy to the level of individual ALUs. They also make more efficient use of global interconnect by organizing it as a regular network, rather than a collection of ad-hoc dedicated wires. Introduction Contemporary computer architectures evolved in a logic-centric era when wires were considered free in terms of delay and power. As a result, these architectures depend heavily on implicit global communication to distribute data and instructions to execution units. These global architectures were well matched for a logic-limited technology; they use profligate communication to optimize the use of execution units. However, they are poorly matched to emerging wire-limited semiconductor technologies in which traversing a chip takes 16 clocks and dissipates considerable power. In these new technologies, it is more important to optimize the use of the wires than the use of the execution units. There is tremendous opportunity in designing architectures for wire-limited technologies. Early efforts have already shown that by eliminating gratuitous global communication, these architectures reduce global bandwidth demands by an order of magnitude or more. By making global communication explicit, and then optimizing it, these architectures can reduce average global wire delays by a factor of 3 or more. Also, by casting global communication as a packet routing rather than a wire routing problem, these architectures can greatly reduce the cost and complexity of designing future chips. This paper describes three complementary approaches to developing architectures that are well suited for interconnect-limited VLSI technology. These approaches are all designed to make global communication explicit thus

exposing it to the programmer and compiler for optimization. The first approach is to use a clustered architecture in which the execution resources are partitioned into clusters, each with their own data and instruction storage. With this approach, most data and control communication is local to the cluster and does not require global resources. Second, within each cluster we extend the storage hierarchy by adding register files local to each execution unit input. Using these local register files data can be routed directly from execution unit to execution unit using a minimum of wiring resources. Third, all remaining global communication is performed over a packet-switching network rather than on dedicated wires. This approach makes more efficient use of the global wires and greatly simplifies the design. The remainder of this paper introduces the problem of interconnect in contemporary architectures and describes each of our three approaches in more detail.
Superscalar architectures communication depend
Global

on

global

Register
File

Global

Instruction Fetch 8 Issue


L

Figure 1: Simplified view of a superscalar architecture A modern superscalar architecture such as the Intel Pentium I1 (1) uses considerable global communication for both control and data. As illustrated in Figure 1, a typical superscalar architecture is organized around a global register file, from which data is distributed, and a global instruction unit, from which control is distributed. The result is that global communication is required for every instruction, both to route data to and from arithmetic units, and to dispatch the control to the selected arithmetic unit. The communication problem is greater than the figure indicates. To achieve good performance, full bypass paths are provided from the output of every ALU to the input of every other ALU resulting in a full crossbar switch between the ALUs. A similar crossbar exists in the register file, and additional crossbars are needed to check instructions for dependencies and to route instructions from instructioncache output ports to available execution units.

0-7803-5 174-6/99/$10.00 0 1999 IEEE

IITC 99- 15

This global organization was appropriate back in 1995 when the number of execution units was relatively small and global wires could be traversed in less than a clock cycle. Then the wires were largely ignored. Today the wires dominate the delay, area, and power of our chips and can no longer be ignored. At the same time the number of execution units is increasing, making the problem worse. Because the communication in a typical superscalar architecture is implicit, hidden in the mechanics of instruction execution, it cannot be optimized by the programmer or compiler. As wires come to dominate our architectures we need to make this communication explicit so it can be managed and optimized.
Clustered architectures instruction locality exploit register and

without the need for expensive reordering hardware. This compile-time instruction placement is not possible on conventional architectures. Keckler and others have shown that, for a class of benchmarks, replacing the global registers and crossbars of a conventional architecture with the local registers and a lower bandwidth, explicit switch of a clustered architecture results in a 72% reduction in wiring area with little impact on clocks per instruction (2). This study shows that there is significant register locality in programs that can be exploited by making communication explicit.

A bandwidth hierarchy keeps most communication on short wires.

.
..e
Switch

Figure 3: An extended register hierarchy results in most communication occurring on local wires connecting ALUs.

Figure 2: A clustered architecture makes communication explicit. The communication requirements of a multiple-issue

processor can be greatly reduced by dividing the machine into clusters as illustrated in Figure 2. The ALUs are partitioned into clusters of 2-4 ALUs each (each cluster is denoted by a single ALU symbol in the figure). The registers are then partitioned into local register files, one for each cluster. Each cluster is controlled by its own instruction unit. To communicate data values, ALUs in different clusters exchange data via a packet switch. In a similar manner, the local instruction units synchronize with one another when required. The MIT Multi-ALU Processor pioneered this style of clustered architecture (2,3,4). It has recently been applied to the Alpha 21264 microprocessor for data only ( 5 ) , but without exposing the clustering to the compiler. A clustered architecture makes global communication explicit and hence exposed for optimization. A compiler can group instructions that communicate on the same cluster, while placing instructions that should run in parallel on different clusters. The compiler can also reduce latency sensitivity by placing instructions that are not dependent on a potentially high-latency instruction (e.g., a load) in a separate thread. This eliminates false control dependencies

Extending the storage hierarchy to include several levels of registers as illustrated in Figure 3 hrther reduces communication. In a conventional processor, there are several levels to the memory hierarchy, but only a single level of registers (all global). Clustered processors extend the register hierarchy one level to include a set of local registers for each cluster. This hierarchy can be extended one step further to include registers associated with each ALU input port as in the Imagine architecture (6). (A similar arrangement of local registers (in the form ofFIFOs) is found in the Cydrome architecture (7), but without the upper levels). The use of such local registers further reduces communication. With this arrangement, the input operands of each instruction are always available locally, at the input of the ALU. The only communication required is to route the result of each instruction to the register file(s) of the consuming ALU(s). This direct ALU to ALU routing is the minimum required to pass the result between the two ALUs. In contrast, a conventional organization would require three cluster-wide communications for each instruction to communicate with the cluster registers. In a series of experiments on graphics, multimedia, and signal processing workloads, Rixner and others have found that with this approach reduces global register bandwidth demand by a factor of 10-20 (6). In effect, slow global register (or even cluster register) accesses over long wires have been replaced in most cases by direct ALU to ALU communication.

IITC 99- 16

Route packets, not wires


Today, global connections on chips are made by routing wires from one point to another. This is true of the data paths that carry operands between memory arrays, register files, and operation units and also for the control paths that carry instructions and sequencing information. In the near future we envision that most global connections on chips, both data and control, will be made not by dedicated wires, but rather over on-chip communication networks (as was suggested for circuit boards in (8)). There are three compelling reasons to expect this change. First, the need to put repeaters into long wires allows us to add the switching needed to implement a network at little additional cost. Today, in an 0.25pm process, repeaters are required about every 4mm for performance critical signals. By 2008, it is expected that this repeater spacing will be less than Imm. Every lmm, even a dedicated wire will need to be connected down to a buffer fabricated on the substrate. With the addition of just a few transistors, a multiplexer can be included with the buffer to implement the data path of a network switch. Some logic is also needed to control this switch. However, only one copy of this logic is needed for a multi-bit switch, and this logic can be pipelined ahead of the data so as not to slow operation. Second, using a network rather than dedicated wiring makes more efficient use of critical global wiring resources by allowing these resources to be shared by different senders and receivers. With dedicated wiring, when a module is idle, the dedicated wires it is attached to go unused. With a network, these wires are available to route other traffic. Finally, restricting global wiring to a regular network greatly simplifies the design process. In doing so, it enables high-performance circuit techniques, and facilitates design re-use. With a network, only one set of global interconnect need be designed for all chips implemented in a given process. Regardless of the function of the chip, its global communication is performed by routing packets over this interconnect. Further, the interconnect is a regular structure: a single tile containing a basic routing element and associated wires that is repeated across this chip in both dimensions. As with a memory cell, considerable effort can be expended on optimizing and verifying this basic element yielding a highly-reliable, high-performance design. For example, it is possible to employ special low-energy signaling techniques, to use optimized transmission lines, and to carehlly analyze cross talk in such a regular structure in a way that is not practical for random, dedicated wiring. Once such a network is in use, it also provides a convenient interface to connect to semiconductor IP of all varieties.

superscalar architectures with their global registers, global bypass, and global instruction logic are poorly suited for these technologies. In this paper we have introduced three approaches to building interconnect-oriented architectures that are being pursued in ongoing projects at Stanford: clustering, register hierarchy, and global networks. By clustering execution units with their associated data and instruction storage, most communication can be kept local to the cluster and the demands on global communication greatly reduced. Within each cluster, the storage hierarchy is extended to the level of individual execution unit inputs reducing intra-cluster communication to a minimum value. When global communication is required, performing this communication over a shared packet network rather than dedicated wires makes more efficient use of the communication resource and simplifies design complexity. Designing an architecture to optimize the use of interconnect has already resulted in an order of magnitude reduction in global bandwidth and a factor of 3 reduction in latency. This is in contrast to the factor of 2-3 improvement that can be expected from better materials for conductors and insulators. We have only begun to investigate this very promising area and many challenges remain. Instruction sets that expose communication, yet hide implementation details are needed. Compilers and run-time software that optimizes the use of interconnect must be developed. We also need to discover the best way to organize on-chip interconnection networks.

References
Gwennap, Linley, P6 Underscores Intels Lead, Microprocessor Report 9(2), February 1995,pp 1,6-15. Keckler, Stephen and Dally, William, Processor Coupling: Integrating Compile-Time and Run-Time Parallelism: Proceedings of the Annual International Symposium on Computer Architecture, ISCA-I 9, 1992, pp. 202-2 13. Fillo, Marco, Keckler, Stephen, Dally, William, Carter, Nicholas, Chang, Andrew, Gurevich, Yevgeny, and Lee, Whay, The MMachine Multicomputer, International Journal o Parallel f Programming, 25(3), 1997 pp. 183-212. Keckler, Stephen, Dally, William, Maskit, Daniel, Carter, Nicholas, Chang, Andrew, and Lee, Whay, Exploiting Fine-Grain ThreadLevel Parallelism on the MIT Multi-ALU Processor: 25th Annual International Symposium on Computer Architecture, ISCA-25, 1998, pp. 306-3 17. Gwennap, Linley, Digital 21264 Sets New Standard, Microprocessor Report, October 1998. Rixner, Scott, Dally, William, Kapasi, Ujval, Khailany, Brucek, Lopez-Lagunas, Abelardo, Mattson, Peter, and Owens, John, A Bandwidth-Efficient Architecture for Media Processing, Proceedings o the 31st Annual International Symposium on Microarchitecture, f 1998, pp. 3-13. Rau, B., Yen, David, Yen, Wei, and Towle, Ross, The Cyrda-5 Departmental Supercomputer: Design Philosophies, Decisions, and Trade-offs, Computer, 22(1), January 1989, pp. 12-35. f Seitz, C., Lets Route Packets Instead of Wires: Proceedings o the Sixth MIT Conference on Advanced Research in VLSI, W. Dally Ed., MIT Press, 1990, pp. 133-138.

Conclusion New architectures that expose and reduce global communication are required to make effective use of emereine semiconductor technoloev. ContemDorarv

IITC 99-17

You might also like