Superscalar Processing

SUPERSCALAR AND VLIW
TEACHER: SIR AQEEL

ASSIGNMENT: COMPUTER ARCHITECTURE MUHAMMAD HAYAT KARAM 4/30/2012
SUPERSCALAR PROCESSING
Superscalar processors are designed to exploit more instruction-level parallelism in user programs. Only independent instructions can be executed in parallel without causing a wait state. The amount of instruction-level parallelism varies widely depending on the type of code being executed.
Superscalar processing, the ability to initiate multiple instructions during the same clock cycle, is the latest in a long series of architectural innovations aimed at producing ever faster microprocessors. Introduced at the beginning of this decade, superscalar microprocessors are now being designed and produced by all the microprocessor vendors for high-end products. Although viewed by many as an extension of the Reduced Instruction Set Computer (RISC) movement of the 1980s, superscalar implementations are in fact heading toward increasing complexity. And superscalar methods have been applied to a spectrum of instruction sets, ranging from the DEC Alpha, the "newest" RISC instruction set, to the decidedly non-RISC Intel x86 instruction set. A typical superscalar processor fetches and decodes the incoming instruction stream several instructions at a time. As part of the instruction fetching process, the outcomes of conditional branch instructions are usually predicted in advance to ensure an uninterrupted stream of instructions. The incoming instruction stream is then analyzed for data dependences, and instructions are distributed to functional units, often according to instruction type. Next, instructions are initiated for execution in parallel, based primarily on the availability of operand data, rather than their original program sequence. This important feature, present in many superscalar implementations, is referred to as dynamic instruction scheduling. Upon completion, instruction results are re-sequenced so that they can be used to update the process state in the correct (original) program order in the event that an interrupt condition occurs. Because individual instructions are the entities being executed in parallel, superscalar processors exploit is referred to as instruction level parallelism (ILP).
Instruction level parallelism in the form of pipelining has been around for decades. A pipeline acts like an assembly line with instructions being processed in phases as they pass down the
pipeline. With simple pipelining, only one instruction at a time is initiated into the pipeline, but multiple instructions may be in some phase of execution concurrently. Pipelining was initially developed in the late 1950s [8] and became a mainstay of large scale computers during the 1960s. The CDC 6600 [61] used a degree of pipelining, but achieved most of its ILP through parallel functional units. Although it was capable of sustained execution of only a single instruction per cycle, the 6600s instruction set, parallel processing units, and dynamic instruction scheduling are similar to the superscalar microprocessors of today. Another remarkable processor of the 1960s was the IBM 360/91 [3]. The 360/91 was heavily pipelined, and provided a dynamic instruction issuing mechanism, known as Tomasulos algorithm [63] after its inventor. As with the CDC 6600, the IBM 360/91 could sustain only one instruction per cycle and was not superscalar, but the strong influence of Tomasulos algorithm is evident in many of todays superscalar processors. The pipeline initiation rate remained at one instruction per cycle for many years and was often perceived to be a serious practical bottleneck. Meanwhile other avenues for improving performance via parallelism were developed, such as vector processing [28, 49] and multiprocessing [5, 6]. Although some processors capable of multiple instruction initiation were considered during the 60s and 70s [50, 62], none were delivered to the market. Then, in the mid-to-late 1980s, superscalar processors began to appear [21, 43, 54]. By initiating more than one instruction at a time into multiple pipelines, superscalar processors break the single-instruction-per-cycle bottleneck. In the years since its introduction, the superscalar approach has become the standard method for implementing high performance microprocessors.
Because hardware and software evolve, it is rare for a processor architect to start with a clean slate; most processor designs inherit a legacy from their predecessors. Modern superscalar processors are no different. A major component of this legacy is binary compatibility, the ability to execute a machine program written for an earlier generation processor. When the very first computers were developed, each had its own instruction set that reflected specific hardware constraints and design decisions at the time of the instruction sets development. Then, software was developed for each instruction set. It did not take long, however, until it became apparent that there were significant advantages to designing instruction sets that were compatible with previous generations and with different models of the same generation [2]. For a number of very practical reasons, the instruction set architecture, or binary machine language level, was chosen as the level for maintaining software compatibility.
VLIW Very Long Instruction Word
EPIC Explicitly Parallel Instruction Computing
In order to fully utilise a superscalar processor of degree m, m instructions must be executable in parallel. This situation may not be true in all clock cycles. In that case, some of the pipelines may be stalling in a wait state. In a superscalar processor, the simple operation latency should require only one cycle, as in the base scalar processor.
Simultaneously fetch multiple instructions Logic to determine true dependencies involving register values Mechanisms to communicate these values Mechanisms to initiate multiple instructions in parallel Resources for parallel execution of multiple instructions Mechanisms for committing process state in correct order
PowerPC 604 six independent execution units: Branch execution unit Load/Store unit 3 Integer units Floating-point unit
in-order issue register renaming
Power PC 620 provides in addition to the 604 out-of-order issue
Pentium three independent execution units: 2 Integer units Floating point unit
in-order issue
Binary code compatibility among scalar & superscalar processors of same family Same compiler works for all processors (scalars and superscalars) of same family Assembly programming of VLIWs is tedious Code density in VLIWs is very poor Instruction encoding schemes
A typical VLIW (very long instruction word) machine has instruction words hundreds of bits in length. Multiple functional units are used concurrently in a VLIW processor. All functional units share the use of a common large register file.
Comparison: RISC vs SISC, VLIW
Compiler prepares fixed packets of multiple operations that give the full "plan of execution" dependencies are determined by compiler and used to schedule according to function unit latencies function units are assigned by compiler and correspond to the position within the instruction packet ("slotting") compiler produces fully-scheduled, hazard-free code => hardware doesn't have to "rediscover" dependencies or schedule
Compatibility across implementations is a major problem VLIW code won't run properly with different number of function units or different latencies unscheduled events (e.g., cache miss) stall entire processor
Code density is another problem
low slot utilization (mostly nops) reduce nops by compression ("flexible VLIW", "variable-length VLIW")
One of the great debates in computer architecture is static vs. dynamic. "static" typically means "let's make our compiler take care of this", while "dynamic" typically means "let's build some hardware that takes care of this". Each side has its advantages and disadvantages. The compiler approach has the benefit of time: a compiler can spend all day analyzing the heck out of a piece of code. However, the conclusions that a compiler can reach are limited, because it doesn't know what the values of all the variables will be when the program is actually run. As you can imagine, if we go for the hardware approach, we get the other end of the stick. There is a limit on the amount of analysis we can do in hardware, because our resources are much more limited. On the other hand, we can analyze the program when it actually runs, so we have complete knowledge of all the program's variables. VLIW approaches typically fall under the "static" category, where the compiler does all the work. Superscalar approaches typically fall under the "dynamic" category, where special hardware on the processor does all the work. Consider the following code sequence: sw $7, 4($2) lw $1, 8($5)
Suppose we have a dual pipeline where we can run two memory operations in parallel [but only if they have no dependencies, of course]. Are there dependencies between these two instructions? Well, it depends on the values of $5 and $2. If $5 is 0, and $2 is 4, then they depend on each other: we must run the store before the load. In a VLIW approach, our compiler decides which instructions are safe to run in parallel. There's no way our compiler can tell for sure if there is a dependence here. So we must stay on the safe side, and dictate that the store must always run before the load. If this were a bigger piece of code, we could analyze the code and try to build a proof that shows there is no dependence. If we decide on a superscalar approach, we have a piece of hardware on our processor that decides whether we can run instructions in parallel. The problem is easier, because this dependence check will happen in a piece of hardware on our processor, as the code is run. So we
will know what the values of $2 and $5 are. This means that we will always know if it is safe to run these two instructions in parallel. Hopefully you see some of the tradeoffs involved. Dynamic approaches have more program information available to them, but the amounts of resources available for analysis are very limited. For example, if we want our superscalar processor to search the code for independent instructions, things start to get really hairy. Static approaches have less program information available to them, but they can spend lots of resources on analysis. For example, it's relatively easy for a compiler to search the code for independent instructions.

Superscalar Processing

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Superscalar Processing

Uploaded by

Copyright:

Available Formats

SUPERSCALAR AND VLIW

TEACHER: SIR AQEEL

VLIW Very Long Instruction Word

EPIC Explicitly Parallel Instruction Computing

in-order issue register renaming

Power PC 620 provides in addition to the 604 out-of-order issue

Comparison: RISC vs SISC, VLIW

Code density is another problem

You might also like