Professional Documents
Culture Documents
exposing it to the programmer and compiler for optimization. The first approach is to use a clustered architecture in which the execution resources are partitioned into clusters, each with their own data and instruction storage. With this approach, most data and control communication is local to the cluster and does not require global resources. Second, within each cluster we extend the storage hierarchy by adding register files local to each execution unit input. Using these local register files data can be routed directly from execution unit to execution unit using a minimum of wiring resources. Third, all remaining global communication is performed over a packet-switching network rather than on dedicated wires. This approach makes more efficient use of the global wires and greatly simplifies the design. The remainder of this paper introduces the problem of interconnect in contemporary architectures and describes each of our three approaches in more detail.
Superscalar architectures communication depend
Global
on
global
Register
File
Global
Figure 1: Simplified view of a superscalar architecture A modern superscalar architecture such as the Intel Pentium I1 (1) uses considerable global communication for both control and data. As illustrated in Figure 1, a typical superscalar architecture is organized around a global register file, from which data is distributed, and a global instruction unit, from which control is distributed. The result is that global communication is required for every instruction, both to route data to and from arithmetic units, and to dispatch the control to the selected arithmetic unit. The communication problem is greater than the figure indicates. To achieve good performance, full bypass paths are provided from the output of every ALU to the input of every other ALU resulting in a full crossbar switch between the ALUs. A similar crossbar exists in the register file, and additional crossbars are needed to check instructions for dependencies and to route instructions from instructioncache output ports to available execution units.
IITC 99- 15
This global organization was appropriate back in 1995 when the number of execution units was relatively small and global wires could be traversed in less than a clock cycle. Then the wires were largely ignored. Today the wires dominate the delay, area, and power of our chips and can no longer be ignored. At the same time the number of execution units is increasing, making the problem worse. Because the communication in a typical superscalar architecture is implicit, hidden in the mechanics of instruction execution, it cannot be optimized by the programmer or compiler. As wires come to dominate our architectures we need to make this communication explicit so it can be managed and optimized.
Clustered architectures instruction locality exploit register and
without the need for expensive reordering hardware. This compile-time instruction placement is not possible on conventional architectures. Keckler and others have shown that, for a class of benchmarks, replacing the global registers and crossbars of a conventional architecture with the local registers and a lower bandwidth, explicit switch of a clustered architecture results in a 72% reduction in wiring area with little impact on clocks per instruction (2). This study shows that there is significant register locality in programs that can be exploited by making communication explicit.
.
..e
Switch
Figure 3: An extended register hierarchy results in most communication occurring on local wires connecting ALUs.
Figure 2: A clustered architecture makes communication explicit. The communication requirements of a multiple-issue
processor can be greatly reduced by dividing the machine into clusters as illustrated in Figure 2. The ALUs are partitioned into clusters of 2-4 ALUs each (each cluster is denoted by a single ALU symbol in the figure). The registers are then partitioned into local register files, one for each cluster. Each cluster is controlled by its own instruction unit. To communicate data values, ALUs in different clusters exchange data via a packet switch. In a similar manner, the local instruction units synchronize with one another when required. The MIT Multi-ALU Processor pioneered this style of clustered architecture (2,3,4). It has recently been applied to the Alpha 21264 microprocessor for data only ( 5 ) , but without exposing the clustering to the compiler. A clustered architecture makes global communication explicit and hence exposed for optimization. A compiler can group instructions that communicate on the same cluster, while placing instructions that should run in parallel on different clusters. The compiler can also reduce latency sensitivity by placing instructions that are not dependent on a potentially high-latency instruction (e.g., a load) in a separate thread. This eliminates false control dependencies
Extending the storage hierarchy to include several levels of registers as illustrated in Figure 3 hrther reduces communication. In a conventional processor, there are several levels to the memory hierarchy, but only a single level of registers (all global). Clustered processors extend the register hierarchy one level to include a set of local registers for each cluster. This hierarchy can be extended one step further to include registers associated with each ALU input port as in the Imagine architecture (6). (A similar arrangement of local registers (in the form ofFIFOs) is found in the Cydrome architecture (7), but without the upper levels). The use of such local registers further reduces communication. With this arrangement, the input operands of each instruction are always available locally, at the input of the ALU. The only communication required is to route the result of each instruction to the register file(s) of the consuming ALU(s). This direct ALU to ALU routing is the minimum required to pass the result between the two ALUs. In contrast, a conventional organization would require three cluster-wide communications for each instruction to communicate with the cluster registers. In a series of experiments on graphics, multimedia, and signal processing workloads, Rixner and others have found that with this approach reduces global register bandwidth demand by a factor of 10-20 (6). In effect, slow global register (or even cluster register) accesses over long wires have been replaced in most cases by direct ALU to ALU communication.
IITC 99- 16
superscalar architectures with their global registers, global bypass, and global instruction logic are poorly suited for these technologies. In this paper we have introduced three approaches to building interconnect-oriented architectures that are being pursued in ongoing projects at Stanford: clustering, register hierarchy, and global networks. By clustering execution units with their associated data and instruction storage, most communication can be kept local to the cluster and the demands on global communication greatly reduced. Within each cluster, the storage hierarchy is extended to the level of individual execution unit inputs reducing intra-cluster communication to a minimum value. When global communication is required, performing this communication over a shared packet network rather than dedicated wires makes more efficient use of the communication resource and simplifies design complexity. Designing an architecture to optimize the use of interconnect has already resulted in an order of magnitude reduction in global bandwidth and a factor of 3 reduction in latency. This is in contrast to the factor of 2-3 improvement that can be expected from better materials for conductors and insulators. We have only begun to investigate this very promising area and many challenges remain. Instruction sets that expose communication, yet hide implementation details are needed. Compilers and run-time software that optimizes the use of interconnect must be developed. We also need to discover the best way to organize on-chip interconnection networks.
References
Gwennap, Linley, P6 Underscores Intels Lead, Microprocessor Report 9(2), February 1995,pp 1,6-15. Keckler, Stephen and Dally, William, Processor Coupling: Integrating Compile-Time and Run-Time Parallelism: Proceedings of the Annual International Symposium on Computer Architecture, ISCA-I 9, 1992, pp. 202-2 13. Fillo, Marco, Keckler, Stephen, Dally, William, Carter, Nicholas, Chang, Andrew, Gurevich, Yevgeny, and Lee, Whay, The MMachine Multicomputer, International Journal o Parallel f Programming, 25(3), 1997 pp. 183-212. Keckler, Stephen, Dally, William, Maskit, Daniel, Carter, Nicholas, Chang, Andrew, and Lee, Whay, Exploiting Fine-Grain ThreadLevel Parallelism on the MIT Multi-ALU Processor: 25th Annual International Symposium on Computer Architecture, ISCA-25, 1998, pp. 306-3 17. Gwennap, Linley, Digital 21264 Sets New Standard, Microprocessor Report, October 1998. Rixner, Scott, Dally, William, Kapasi, Ujval, Khailany, Brucek, Lopez-Lagunas, Abelardo, Mattson, Peter, and Owens, John, A Bandwidth-Efficient Architecture for Media Processing, Proceedings o the 31st Annual International Symposium on Microarchitecture, f 1998, pp. 3-13. Rau, B., Yen, David, Yen, Wei, and Towle, Ross, The Cyrda-5 Departmental Supercomputer: Design Philosophies, Decisions, and Trade-offs, Computer, 22(1), January 1989, pp. 12-35. f Seitz, C., Lets Route Packets Instead of Wires: Proceedings o the Sixth MIT Conference on Advanced Research in VLSI, W. Dally Ed., MIT Press, 1990, pp. 133-138.
Conclusion New architectures that expose and reduce global communication are required to make effective use of emereine semiconductor technoloev. ContemDorarv
IITC 99-17