You are on page 1of 23

Collision Analysis Assume we could implement on-chip cache and get the cache access time down to 1 clock,

but implement it as a unified cache. Our new pipeline is:

Instruction Fetch -- 5 ns

Instruction Decode -- 5 ns

Address Generate -- 5 ns

Operand Fetch -- 5 ns

Execute -- 5 ns

Operand Store -- 5 ns

Update Program Counter -- 5 ns

Our new reservation table is: Clock Operation Memory Op Inst Dec. Addr Gen Execute Update PC
1 2 3 4 5 6 7

X X X

X X

And the serial execution time is 7 x 5 ns = 35 ns. How often can we initiate an instruction with this configuration?

Pipeline Hazards

Architecture of Parallel Computers

The Collision Vector As the pipeline becomes more complicated, we can use a collision vector to analyze the pipeline and control initiation of execution. The collision vector is a method of analyzing how often we can initiate a new operation into the pipeline and maintain synchronous flow without collisions. We construct the collision vector by overlaying two copies of the reservation table, successively shifting one clock to the right, and recording whether or not a collision occurs at that step. If a collision occurs, record a 1 bit, if a collision does not occur, record a 0 bit. For example, our reservation table would result in the following collision vector: Collision vector = 011010

Using the collision vector, we construct a reduced state diagram to tell us when we can initiate new operations.

The Reduced State Diagram The reduced state diagram is a way to determine when we can initiate a new operation into the pipeline and avoid collisions when some operations are already in process in the pipeline.

1997, 1999 E.F. Gehringer, G.Q. Kenney

CSC 506, Summer 1999

Steps to create the reduced state diagram: Shift the collision vector left one position, filling in a 0 at the right end. If the left-most bit shifted out is a 1, you cannot initiate a new operation into the pipeline. If the left-most bit shifted out is a 0, you can initiate a new operation into the pipeline. Create a new state with a collision vector that is the shifted collision vector ORed with the original pipeline collision vector. Draw an arc to the new collision vector and label it with the number of shifts from the previous vector. Following is the resulting reduced state diagram:

011010

6 1
111110

4
111010

Note: Some texts reverse this notation, build the collision vector from right to left, and shift the vector right in order to determine when to initiate a new operation.

Pipeline Hazards

Architecture of Parallel Computers

The reduced state diagram tells us that we can initiate a new operation into the pipeline one cycle after we initiated one in an empty pipe. However, this brings us to a state where we cannot safely initiate another operation until 6 more clock periods.

Since we can initiate a second instruction on the next clock period but must wait six clock periods before we can initiate another instruction, we can initiate only two instructions every seven clock periods. We get only 2/7 of 100% efficiency (speedup of 7), so our speedup is only 2 for the seven stage pipeline.

An alternative would be to wait for 4 cycles after the initial initiation, and then initiate a new operation every 4 cycles. But this would give us a speedup of only 7(0.25) = 1.75.

1997, 1999 E.F. Gehringer, G.Q. Kenney

CSC 506, Summer 1999

Improving the speedup One way to improve this situation is to insert delays at appropriate points in the pipeline. Stone goes to great lengths to analyze where to insert the delays. As an example, if we added a delay in the pipeline after the Execute stage, we get:

Clock Operation Memory Op Inst Dec. Addr Gen Execute Update PC

X X X

Delay

And our new collision vector is: Collision vector = 0010010

Pipeline Hazards

Architecture of Parallel Computers

The new reduced state diagram follows.

0010010

7 4 1
0110110

2 5
1011010

4
1110010

2
1111010

1111110

4 4 4
0110010

5 5 4

1010010

1
1110110

Note that all states have an arc back to the beginning state with 7 clocks in addition to those noted. We can now look for movements from state to state that would improve our pipeline speedup. If we took the greedy cycle, we could initiate 3 operations out of every 9 cycles for a speedup of (3/9) 7 = 2.33. However, if we did not take the first possible initiation and waited for 2 cycles, we would get into the 2, 5, 2 cycle and initiate an operation 3 out of every 9 cycles also. There appears to be one other 3-out-of-9 cycle, but none better.

1997, 1999 E.F. Gehringer, G.Q. Kenney

CSC 506, Summer 1999

Other Pipeline Hazards Pipeline collisions occur when there is contention for shared hardware that is needed by more than one stage of a pipeline. Potential collisions prevent us from initiating (and thus completing) a new operation every clock period, and so slow down the effective execution rate of a processor. Other hazards that can prevent us from completing an instruction every clock period are: Conditional Branches Data dependencies

Conditional Branches (Jumps) A conditional branch changes the location where we are fetching instructions. A conditional branch instruction must execute before we know which location to fetch subsequent instructions from. Example Instruction Stream

------------------Cmp A, B BE NewLoc -------------

; Instruction ; Instruction ; Instruction ; Compare A to B ; Branch on condition code = 0 to NewLoc ; Next Sequential Instruction (NSI) ; Instruction

NewLoc -------------

; Instruction ; Instruction

Pipeline Hazards

Architecture of Parallel Computers

Reservation Table Analysis Assume we have the following reservation table: Clock Operation Inst Fetch Inst Dec. Addr Gen Data Fetch Execute Op Store
1 2 3 4 5 6 7 8 9 10 11 12

X X X X X X X

We can show successive instruction execution through the pipeline by indicating the instruction in each cell. Here, I will use: CC to indicate the instruction that sets the condition code. BR to indicate the branch condition instruction. NSI to indicate the next sequential instruction after the branch. 2SI to indicate the 2nd sequential instruction after the branch, etc. BT to indicate the branch target instruction. Following would be the instruction sequence for a branch not taken: Clock 1 Operation CC Inst Fetch Inst Dec. Addr Gen Data Fetch Execute Op Store
2 3 4 5 6 7 8 9 10 11 12

CC

BR CC

BR NSI NSI 2SI 2SI 3SI BR NSI 2SI CC BR NSI CC CC BR BR NSI CC BR CC

4SI 4SI 3SI 2SI 3SI NSI 2SI 2SI NSI BR NSI

3SI

1997, 1999 E.F. Gehringer, G.Q. Kenney

CSC 506, Summer 1999

Following would be the instruction sequence for a branch taken: Clock 1 Operation CC Inst Fetch Inst Dec. Addr Gen Data Fetch Execute Op Store
2 3 4 5 6 7 8 9 10 11 12

CC

BR CC

BR NSI NSI 2SI 2SI 3SI BR NSI 2SI CC BR NSI CC CC BR BR NSI CC BR CC

3SI
wait

BT
wait

BT

wait NSI wait wait wait wait BR

We have taken a penalty of 6 clock cycles because we assumed that we were going to be executing sequential instructions. We started these instructions into the pipeline, only to find that we had to abort executing them because the conditional branch was taken. The assumption here is that we know the outcome of the branch instruction at the end of its execute cycle, and so we can stop further execution of sequential instructions following the branch. The new program counter gets sent to the Instruction Fetch unit during the Operand Store cycle of the branch instruction, so it can begin to fetch the branch target instruction and succeeding instructions on the next cycle.

Reducing Branch Penalties We can use several methods to reduce the effects of branching: Delayed Branch Instruction Multiple Condition Codes (discussed with data dependencies) Branch Prediction with and without Branch History Speculative Execution

Pipeline Hazards

Architecture of Parallel Computers

Delayed Branch Instruction We can push some of the problem back on the programmer (or compiler) by designing a new branch instruction that telegraphs an intent to branch: Branch Condition after executing the Next Sequential Instruction. The instruction sequence for a branch not taken, using this new branch instruction (BA), is identical to a normal branch: Clock 1 Operation CC Inst Fetch Inst Dec. Addr Gen Data Fetch Execute Op Store
2 3 4 5 6 7 8 9 10 11 12

CC

BA CC

NSI NSI 2SI 2SI 3SI 3SI 4SI 4SI BA NSI 2SI 3SI CC BA NSI 2SI 3SI CC CC BA BA NSI NSI 2SI 2SI CC BA NSI CC BA NSI

BA

However, the instruction sequence for a branch taken, using the new branch condition after next instruction, would save us two clocks: Clock 1 Operation CC Inst Fetch Inst Dec. Addr Gen Data Fetch Execute Op Store
2 3 4 5 6 7 8 9 10 11 12

CC

BA CC

NSI NSI 2SI 2SI 3SI 3SI BT BT wait BA NSI 2SI wait wait CC BA NSI CC CC BA BA NSI NSI wait wait NSI CC BA NSI CC BA BA

Our penalty is now only 4 clock cycles instead of 6, because we followed through and completed execution of NSI (per definition of the delayed branch instruction). We had to abort executing only 2SI and beyond as a result of the conditional branch taken.

1997, 1999 E.F. Gehringer, G.Q. Kenney

CSC 506, Summer 1999

10

Branch Prediction We can make a better guess about whether or not a branch will be taken rather than just always assuming it will not be taken. Assume that a special end-of-loop branch instruction is usually taken. Assume that a branch to a location earlier in the code will usually be taken. Keep a history table of how this particular branch instruction behaved in the recent past. Some processors define special instructions to be used to terminate a loop. For example, BXLE branch on index low or equal combines decrementing an index register with a branch on condition. The processor can safely assume that whenever it fetches a BXLE instruction, the branch will normally be taken. This can be determined back at the Instruction Decode step. Note that the Unconditional Branch is a special case of this, in that it will always be taken. A conditional branch to an earlier address can be determined at the Address Generate stage. However, note that we are making an educated guess. Even when we guess correctly, we are taking some penalty. The instruction sequence for a branch taken, when we predict that it will be taken: Clock 1 Operation CC Inst Fetch Inst Dec. Addr Gen Data Fetch Execute Op Store
2 CC 3 BR CC 4 BR CC CC 5 NSI BR 6 NSI BR CC 7 BT wait BR CC 8 BT 9 NT BT 10 NT 11 2T NT 12 2T

wait BT NT BR wait wait BT BT BR wait wait CC BR

Pipeline Hazards

Architecture of Parallel Computers

11

Branch History Rather than depending on special instructions and branch target locations, we can keep a history of how this particular branch instruction behaved in the recent past, and assume that it will continue to behave that way in the future. Some implementations are: The Branch-history table (Stone page 196): The instruction fetch unit searches a branch-history table (BHT), similar to a TLB, on every instruction fetch. If we have a hit, use the corresponding address in the BHT for NSI instead of the real NSI. At the execute stage of the branch, update the BHT with the actual target (NSI or BT). Of course, we need to keep track of which way we predicted, and abort instructions on mis-predictions. Decode-history table (similar to Stone page 196): The instruction decode unit searches a decode-history table (DHT) when it encounters a conditional branch instruction. If we have a hit, redirect the instruction fetch unit to abort NSI and give it the BT address from the branch instruction. At the execute stage of the branch, add (or keep) a DHT entry for this branch when the branch is taken. Delete the DHT entry for this branch (if it exists) when the branch is not taken. Note that we always abort the prefetch of NSI on predicted branches taken.

1997, 1999 E.F. Gehringer, G.Q. Kenney

CSC 506, Summer 1999

12

Extra bits in the Instruction Cache For a processor with a fixed-length instruction set and a Harvard cache, we can organize the instruction cache such that we add an extra bit or two to each instruction (in the cache) and use them to keep a history on branch instructions. This works the same as the decode-history table without the time and logic for the lookup.
Instruction Cache 00 Other Instuction 00 Other Instuction 01 Branch 00 Other Instuction 00 Other Instuction 11 Branch 00 Other Instuction 00 Branch 00 Other Instuction 10 Branch 00 01 10 11 Legand Strongly not taken Weakly not taken Weakly taken Strongly taken

When a cache line is loaded from main memory, all branch indicator bits (BIB) for the line are set to 00. When a branch is taken, increment the corresponding BIB. When a branch is not taken, decrement the corresponding BIB. When the instruction is fetched: Use NSI if the BIB is 00 or 01. Use BT if BIB is 10 or 11.

Pipeline Hazards

Architecture of Parallel Computers

13

Speculative Execution The brute-force approach Provide enough logic in the processor to: Replicate the first several stages of the pipeline. Always follow both paths of execution (branch taken and branch not taken). When the outcome of the branch is known, discard the intermediate results of the wrong path(s) and continue execution with the correct path. For deep pipelines, the processor must be prepared to follow several paths in order to keep things moving along.

Stone (page 197) says that these mechanisms have not been widely used in practice (as of 1986). In fact, they have become very popular as a way to speed up execution of modern processors. Note: some literature defines speculative execution to mean performing any processing steps before you know the outcome of a conditional branch. That is, if there is any chance that you may need to discard intermediate results of an instruction, it is defined as speculative execution. We will not use this definition.

1997, 1999 E.F. Gehringer, G.Q. Kenney

CSC 506, Summer 1999

14

Data Dependencies An instruction may be stalled in the pipeline because it needs data that has not yet been produced by a prior instruction that is still in the pipeline. The data dependencies among instructions can take the following forms: READ/READ one instruction reads a data item and a following instruction reads the same data item. READ/WRITE one instruction reads a data item and a following instruction writes that same data item. WRITE/READ one instruction writes a data item and a following instruction reads that same data item. WRITE/WRITE one instruction writes a data item and a following instruction writes that same data item. The READ/READ combination is not a problem with pipelines because the data item does not change. However, the other three combinations can all produce invalid results unless we detect and interlock on them. We deal first with the WRITE/READ combination, and defer the others to a later discussion on superpipelined machines. WRITE/READ Consider the following sequence of instructions:

------------R2 R5 -------

; Instruction ; Instruction R3 + R4 ; Store Register 2 R2 + R4 ; Use Register 2 ; Instruction

Pipeline Hazards

Architecture of Parallel Computers

15

The reservation table, where S2 is the instruction that stores a new value into register 2, and U2 is the instruction that uses the new value in register 2: Clock Operation Inst Fetch Inst Dec. Addr Gen Data Fetch Execute Op Store

1 S2

2 S2

3 U2 S2

4 U2 S2

5 NSI U2 S2

6 NSI U2 S2

9 2SI NSI

10 2SI NSI U2

11 3SI 2SI NSI U2

12 3SI 2SI NSI U2

wait wait S2 S2

U2

The data fetch unit must detect that the value in register 2 that it needs is pending update from a prior instruction that has not yet completed. It must wait until the new value has been stored into register 2 by the Operand Store unit. The penalty is 2 cycles that we had to stall the pipeline.

Internal Forwarding and Register Renaming A way to reduce the penalty due to data dependencies is to forward the results of a computation directly to the data fetch unit or to the execute unit, and not wait for the data to be stored into the proper register. If we forward the results of the addition in instruction S2 to the data fetch unit, we reduce the data interlock penalty to one cycle. If we forward the results directly to the execute unit, we can eliminate the penalty altogether. The data is really available when we need it. It is just not in the right place. We rename the input register for the next operation from register R2 to the register where the computation results will appear. Note that the Operand Store unit still needs to put the results into register 2 as well.

1997, 1999 E.F. Gehringer, G.Q. Kenney

CSC 506, Summer 1999

16

The new reservation table if we forward the results to the data fetch unit: Clock Operation Inst Fetch Inst Dec. Addr Gen Data Fetch Execute Op Store
1 S2 2 S2 3 U2 S2 4 U2 S2 S2 5 NSI U2 6 NSI U2 S2 7 8 2SI NSI wait S2 U2 S2 9 2SI NSI U2 10 3SI 2SI NSI U2 11 3SI 2SI NSI U2 12 4SI 3SI 2SI NSI

The reservation table if we forward the results directly to the Execute unit: Clock Operation Inst Fetch Inst Dec. Addr Gen Data Fetch Execute Op Store
1 2 3 4 5 6 7 8 9 10 11 12

S2

S2

U2 S2

U2 S2

NSI NSI 2SI 2SI 3SI 3SI 4SI 4SI U2 NSI 2SI 3SI U2 NSI 2SI 3SI S2 S2 U2 U2 NSI NSI 2SI 2SI S2 U2 NSI S2 U2 NSI

The Condition Code Dependency Another type of data dependency is that between an instruction that generates a condition code setting and the branch instruction that uses the condition code. Internal forwarding can again be used to reduce or eliminate delays Another variant of the Branch after NSI, called multiple condition codes, puts the problem back on the programmer (or compiler). Multiple condition codes makes it easier for the programmer to have intervening instructions between the instruction that generates the CC and the branch instruction that uses it.

Pipeline Hazards

Architecture of Parallel Computers

17

Superscalar Architectures Up to now, we have been discussing computer architectures with a single pipeline for processing instructions. The objective was to complete one instruction per clock period by breaking the instructions into (approximately) equal pieces of work, and pipelining them through the process in a serial fashion. However, all of the hazards prevent us from ever achieving a processing rate of 1 instruction per clock. Given the circuit density we have today, we can replicate many of the pipeline units and process instructions in parallel, so long as we ensure that we produce results that are indistinguishable from those obtained if we executed the code in a strictly sequential fashion.
Instruction Instruction Instruction Instruction Instruction Instruction Instruction Instruction Instruction Instruction Instruction

I-cache

Instruction

Decode

Decode

Decode

Decode

Op Fetch

Op Fetch

Op Fetch

Op Fetch

Fixed Point Execute

Fixed Point Execute

Fixed Point Execute

Floating Point Execute

Store Results

Store Results

Store Results

Store Results

This brings us back to data dependencies. We must now consider the READ/WRITE and WRITE/WRITE sequences, because one instruction may get ahead of another through the parallel pipelines.

1997, 1999 E.F. Gehringer, G.Q. Kenney

CSC 506, Summer 1999

18

Consider the following sequence of instructions:

------------R3 R2 -------

; Instruction ; Instruction R2 + R4 R5 + R4 ; Use Register 2 ; Store Register 2

; Instruction

We must interlock on register 2 to ensure that the new value (R5 + R4) does not get stored into it before we obtain the old value to add to R4.

And the following sequence of instructions:

-------

; Instruction

------; Instruction Cmp A, B ; Compare A to B R2 R3 + R4 ; Store Register 2 BE NewLoc ; Possible branch R5 + R4 ; Store Register 2 R2 ------; Instruction

We must ensure that the second value of R2 gets stored if the branch is not taken.

Pipeline Hazards

Architecture of Parallel Computers

19

Extra Internal Registers When we have multiple pipelines and speculative execution in the processor, it is beneficial to have several extra sets of registers to keep intermediate results. Several paths are being followed due to speculative execution. Parallel execution is proceeding along each serial path. Many intermediate results are being forwarded to other instructions. Many tentative final results must be held until the final outcome is known. Retiring Instructions When the final outcome of a series of branches and data dependencies is known, the winning instruction is retired. Its tentative results are marked final. Any data in a renamed register is stored into the real named register. All other tentative instructions and results (the losers) are discarded, and any resources held are made available for processing new instructions. Only the retired instructions count toward the processing rate (the MIPS) of the processor. The objective of the computer architect is to retire more than one instruction per clock period.

1997, 1999 E.F. Gehringer, G.Q. Kenney

CSC 506, Summer 1999

20

CISC versus RISC Stone page 210 CISC Complex Instruction Set Computer RISC Reduced Instructed Set Computer CISC Architectures Traditional processor architectures (e.g. IBM S/360, Intel 8086) use variablelength instructions and provide variations on basic instructions with several addressing modes. 8086 example: Instructions can vary in length from 1 to 12 bytes long. There are 14 variations of the integer ADD instruction. There are 14 variations of the integer ADD with Carry instruction. There are 14 variations of the integer SUB (subtract) instruction. There are about 100 different instructions. There are four different prefixes that can modify instructions. This gives a lot of flexibility to the programmer and compiler writer, but causes many problems for the computer architect.

Pipeline Hazards

Architecture of Parallel Computers

21

RISC architectures RISC architectures attempted to make life easy on the computer architect by drastically simplifying the instruction set. John Cocke (IBM) reasoned that only compilers generate machine code, and so making life easier for the assembly language programmer should not be an objective. Example: Make all instructions four bytes long and aligned on a word boundary. Make lots of general-purpose registers so that most intermediate data can be held in the fast processor storage. Make all arithmetic instructions register-to-register addressing only. All instructions execute in a single clock. Add instructions to help the CPU architect make a fast processor. Over time, the CISC architectures have adopted RISC techniques and the RISC architectures have added CISC instructions. Today, the only real difference between the two are that CISC processors still have variable-length instructions and RISC processors have fixed-length instructions.

1997, 1999 E.F. Gehringer, G.Q. Kenney

CSC 506, Summer 1999

22

Superpipelined architecture Stone page 218. In the discussion on superscalar architectures, Stone describes a superpipelined architecture as one where the internal clock for issuing instructions is N times faster than the main clock. Virtually all processors today are superpipelined the internal clock is run faster than the external bus clock.

VLIW Very Long Instruction Word architecture Stone page 219 VLIW is typically called microcode, and the machine architectures are not general-purpose. They may be used in graphics processors, hard disk controllers or other dedicated function units. The advantage of a VLIW architecture is that the fields of the instruction directly control the hardware latches and gates, and thus can directly perform multiple functions in parallel. Normally, engineers program the microcontrollers and the programs are relatively short. VLIW microcontrollers formerly were used to implement the complex instructions of CISC architecture machines.

Pipeline Hazards

Architecture of Parallel Computers

23

You might also like