CMPE 382 / ECE 510 Computer Organization & Architecture Chapter 2 - Instruction Level Parallelism

CMPE 382 / ECE 510
Computer Organization & Architecture
Chapter 2 – Instruction Level Parallelism

based on text:
Computer Architecture : A Quantitative Approach (Paperback)
John L. Hennessy, David A. Patterson
Morgan Kaufmann; 4th edition 2006
Many lecture slides are courtesy of or based on the work of

Drs. Asanovic, Patterson, Culler and Amaral
When is a Register not a Register?
• Results may be forwarded around a pipeline architecture
and may never be written back to the register file (e.g. if
written over by a different result)
• We can replace registers and memory with queues
(FIFOs)
– “Queue Machines”
– too expensive, cost(FIFO) >> cost(register)
– power hungry
• Need to challenge your perception of a register if the rest
of this chapter is going to make sense
2
Goal of this Chapter
Make Superscalar Processors Work
Evolution
1. Detect hazards in hardware and enforce pipeline stalls
• basic correctness
2. Out of order completion
3. Scoreboard – out of order execution
4. Superscalar – multiple instruction issue using available functional
units in parallel, hardware hazard check
5. Wider superscalar – add more functional units
6. Register renaming
7. Speculative execution - branch prediction and damage control
8. Giving up on further ILP (multithreaded, multicore processors)
Make CPI < 1, or its reciprocal IPC > 1
3
Remember
[H&P p.177]
“The goal of both our software and hardware techniques is
to exploit parallelism by preserving program order only
where it affects the outcome of the program. Detecting
and avoiding hazards ensures that necessary program
order is preserved.” (sequential consistency)
4
Recall from Pipelining
• Pipeline CPI = Ideal pipeline CPI + Structural Stalls

+ Data Hazard Stalls + Control Stalls
– Ideal pipeline CPI: measure of the maximum performance
attainable by the implementation
– Structural hazards: HW cannot support this combination of
instructions
– Data hazards: Instruction depends on result of prior instruction still
in the pipeline
– Control hazards: Caused by delay between the fetching of
instructions and decisions about changes in control flow (branches
and jumps)
5
Recall Simple Data Hazard Resolution:
In-order issue, in-order
completion
Time (clock cycles)
I
n
lw r1, 0(r2)
ALU
Ifetch Reg DMem Reg
s
t
r.
ALU
sub r4,r1,r6 Ifetch Reg Bubble DMem Reg
O
r
d Bubble
ALU
Ifetch Reg DMem Reg
e and r6,r2,r7
r
ALU
Bubble Ifetch Reg DMem
or r8,r2,r9
Extend to Multiple instruction issue?
What if load had longer delay? Can AND issue?
6
In-Order Issue, Out-of-order Completion
ALU
Ifetch Reg Reg
Add
DMem DMem’ Reg
•Which hazards are present? RAW? WAR? WAW?

load r3 <- r1, r2
add r1 <- r5, r2
sub r3 <- r3, r1 / r3 <- r2, r1
•Register Reservations
– when issue mark destination register busy till complete
– check all register reservations before issue
7
Ideas to Reduce Stalls

Technique Reduces
Dynamic scheduling Data hazard stalls
Dynamic branch Control stalls
prediction
Issuing multiple Ideal CPI
instructions per cycle
Dynamic ILP
Speculation Data and control stalls
Dynamic memory Data hazard stalls involving
disambiguation memory
Loop unrolling Control hazard stalls
Basic compiler pipeline Data hazard stalls
scheduling
Compiler dependence Ideal CPI and data hazard stalls
Static ILP analysis
Software pipelining and Ideal CPI and data hazard stalls
trace scheduling
Compiler speculation Ideal CPI, data and control stalls
8
Instruction-Level Parallelism (ILP)
• Basic Block (BB) ILP is quite small
– BB: a straight-line code sequence with no branches in except to the
entry and no branches out except at the exit
– average dynamic branch frequency 15% to 25%
=> 4 to 7 instructions execute between a pair of branches
– Plus instructions in BB likely to depend on each other
• To obtain substantial performance enhancements, we
must exploit ILP across multiple basic blocks
• Simplest: loop-level parallelism to exploit parallelism
among iterations of a loop
– Vector is one way
– If not vector, then either dynamic via branch prediction or static via
loop unrolling by compiler
9
Review Data Dependence and Hazards
Data Dependence Potential Hardware Hazard
True Dependence RAW read after write

b=a; c=b;
Anti-Dependence WAR write after read
b=a; a=c;
Output Dependence WAW write after write
b=a; b=c;
10
Data Dependence and Hazards
• Dependences are a property of programs
• Presence of dependence indicates potential for a
hazard, but actual hazard and length of any stall is a
property of the pipeline
• Importance of the data dependencies
1) indicates the possibility of a hazard
2) determines order in which results must be calculated
3) sets an upper bound on how much parallelism can
possibly be exploited
• Next, look at HW schemes to avoid hazard
11
ILP and Data Hazards
• program order: order instructions would execute in if
executed sequentially 1 at a time as determined by
original source program
• HW/SW goal: exploit parallelism by preserving
appearance of program order
– modify order in manner than cannot be observed by program
– must not affect the outcome of the program
• Ex: Instructions involved in a name dependence can
execute simultaneously if name used in instructions is
changed so instructions do not conflict
– Register renaming resolves name dependence for regs
– Either by compiler or by HW
– add r1, r2, r3
– sub r2, r4,r5
– and r3, r2, 1
12
Control Dependencies
• Every instruction is control dependent on some

set of branches, and, in general, these control
dependencies must be preserved to preserve
program order
if p1 {
S1;
};
if p2 {
S2;
}
• S1 is control dependent on p1, and S2 is control
dependent on p2 but not on p1.
13
Control Dependence Ignored
• Control dependence need not always be preserved

– willing to execute instructions that should not have been executed,
thereby violating the control dependences, if can do so without
affecting correctness of the program
• Instead, 2 properties critical to program correctness
are exception behavior and data flow
14
Exception Behavior
• Preserving exception behavior => any changes in
instruction execution order must not change how
exceptions are raised in program (=> no new exceptions)
• Example:
DADDU R2,R3,R4
BEQZ R2,L1
NOP
LW R1,0(R2)
L1:
• Problem with moving LW before BEQZ?
15
Data Flow
• Data flow: actual flow of data values among

instructions that produce results and those that
consume them
– branches make flow dynamic, determine which instruction is supplier
of data
• Example:
DADDU R1,R2,R3
BEQZ R4,L
NOP
DSUBU R1,R5,R6
L: …
OR R7,R1,R8
• OR depends on DADDU or DSUBU?
Must preserve data flow on execution
16
Advantages of
Dynamic Scheduling
• Handles cases when dependences unknown at compile
time
– (e.g., because they may involve a memory reference)
• It simplifies the compiler
• Allows code that compiled for one pipeline to run
efficiently on a different pipeline
• Hardware speculation, a technique with significant
performance advantages, that builds on dynamic
scheduling
17
HW Schemes: Instruction Parallelism
• Key idea: Allow instructions behind stall to proceed
DIV.D F0,F2,F4
ADD.D F10,F0,F8
SUB.D F12,F8,F14
• Enables out-of-order execution
and allows out-of-order completion
• Will distinguish when an instruction begins execution and
when it completes execution; between 2 times, the
instruction is in execution
• In a dynamically scheduled pipeline, all instructions pass
through issue stage in order (in-order issue)
18
Data Hazards: An Example
dest src1 src2
I1 DIVD f6, f6, f4
I2 LD f2, 45(r3)
I3 MULTD f0, f2, f4
I4 DIVD f8, f6, f2
I5 SUBD f10, f0, f6
I6 ADDD f6, f8, f2
RAW Hazards
WAR Hazards
WAW Hazards
20
Complex Pipelining
ALU Mem
IF ID Issue WB
Fadd
GPR’s
Fmul
FPR’s
Fdiv
Pipelining becomes complex when we want high

performance in the presence of:
• Long latency or partially pipelined floating-point units
• Multiple function and memory units
• Memory systems with variable access time
• Precise exceptions
21
Complex In-Order Pipeline
Inst. Data
PC D Decode GPRs X1 + X2 Mem X3 W
Mem
• Delay writeback so all operations

have same latency to W stage FPRs X1 X2 Fadd X3 W
– Write ports never oversubscribed (one
inst. in & one inst. out every cycle)
– Instructions commit in order, simplifies
precise exception implementation Commit
How to prevent increased X2 Fmul X3 Point
writeback latency from
slowing down single cycle
integer operations? Unpipelined
divider
Bypassing FDiv X2 X3
22
Complex Pipeline
ALU Mem
IF ID Issue WB
Fadd
GPR’s
FPR’s
Fmul
Can we solve write

hazards without
equalizing all pipeline
depths and without
Fdiv
bypassing?
23
When is it Safe to Issue an
Instruction?
Suppose a data structure keeps track of all the
instructions in all the functional units
The following checks need to be made before the

Issue stage can dispatch an instruction
• Is the required function unit available?
• Is the input data available?  RAW?
• Is it safe to write the destination? WAR? WAW?
• Is there a structural conflict at the WB stage?
24
Scoreboard for In-order Issues
Busy[FU#] : a bit-vector to indicate FU’s availability.

(FU = Int, Add, Mult, Div)
These bits are hardwired to FU's.
WP[reg#] : a bit-vector to record the registers for which

writes are pending.
These bits are set to true by the Issue stage and set to
false by the WB stage
Issue checks the instruction (opcode dest src1 src2)

against the scoreboard (Busy & WP) to dispatch
FU available? Busy[FU#]
RAW? WP[src1] or WP[src2]
WAR? cannot arise
WAW? WP[dest]
25
In-Order Issue Limitations: an example
latency 1 2
1 LD F2, 34(R2) 1
2 LD F4, 45(R3) long

4 3
3 MULTD F6, F4, F2 3
4 SUBD F8, F2, F2 1

5
5 DIVD F4, F2, F8 4
6 ADDD F10, F6, F4 1 6
In-order: 1 (2,1) . . . . . . 2 3 4 4 3 5 . . . 5 6 6
In-order restriction prevents instruction 4
from being dispatched
(underline indicates cycle when instruction writes back)

26
Out-of-Order Issue
ALU Mem
IF ID Issue WB
Fadd
Fmul
• Issue stage buffer holds multiple instructions waiting

to issue.
• Decode adds next instruction to buffer if there is
space and the instruction does not cause a WAR or
WAW hazard.
• Any instruction in buffer whose RAW hazards are
satisfied can be issued (for now at most one
dispatch per cycle). On a write back (WB), new
instructions may get enabled.
27
In-Order Issue Limitations: an example
latency 1 2
1 LD F2, 34(R2) 1

4 3
4 SUBD F8, F2, F2 1

5
5 DIVD F4, F2, F8 4
6 ADDD F10, F6, F4 1 6
In-order: 1 (2,1) . . . . . . 2 3 4 4 3 5 . . . 5 6 6
Out-of-order: 1 (2,1) 4 4 . . . . 2 3 . . 3 5 . . . 5 6 6
Out-of-order execution did not allow any significant improvement!
28
How many instructions can be in
the pipeline?
Which features of an ISA limit the number of
instructions in the pipeline?
Number of Registers
Which features of a program limit the number of

instructions in the pipeline?
Control transfers
Out-of-order dispatch by itself does not provide

any significant performance improvement !
29
Overcoming the Lack of Register
Names
Floating Point pipelines often cannot be kept filled
with small number of registers.
IBM 360 had only 4 Floating Point Registers
Can a microarchitecture use more registers than

specified by the ISA without loss of ISA
compatibility ?
Robert Tomasulo of IBM suggested an ingenious

solution in 1967 based on on-the-fly register renaming
30
Little’s Law
Throughput (T) = Number in Flight (N) / Latency (L)
Issue Execution WB
Example:
• 4 floating point registers
• 8 cycles per floating point operation
 maximum of ½ issue per cycle!
31
Instruction-level Parallelism via Renaming
latency 1 2
1 LD F2, 34(R2) 1

4 3
4 SUBD F8, F2, F2 1

X
5
5 DIVD F4’, F2, F8 4
6 ADDD F10, F6, F4’ 1 6
In-order: 1 (2,1) . . . . . . 2 3 4 4 3 5 . . . 5 6 6
Out-of-order: 1 (2,1) 4 4 5 . . . 2 (3,5) 3 6 6
Any antidependence can be eliminated by renaming.
(renaming  additional storage)
Can it be done in hardware? yes!
32
Register Renaming
ALU Mem
IF ID Issue WB
Fadd
Fmul
• Decode does register renaming and adds instructions to

the issue stage reorder buffer (ROB)
 renaming makes WAR or WAW hazards impossible
• Any instruction in ROB whose RAW hazards have been

satisfied can be dispatched.
 Out-of-order or dataflow execution
33
Dynamic Scheduling Step 1
• Simple pipeline had 1 stage to check both structural
and data hazards: Instruction Decode (ID), also
called Instruction Issue
• Split the ID pipe stage of simple 5-stage pipeline into
2 stages:
• Issue—Decode instructions, check for structural
hazards
• Read operands—Wait until no data hazards, then
read operands
34
A Dynamic Algorithm:
Tomasulo’s Algorithm
• For IBM 360/91 (before caches!)
• Goal: High Performance without special compilers
• Small number of floating point registers (4 in 360) prevented
interesting compiler scheduling of operations
– This led Tomasulo to try to figure out how to get more effective registers —
renaming in hardware!
• Why Study 1966 Computer?
• The descendants of this have flourished!
– Alpha 21264, HP 8000, MIPS 10000, Pentium III,4,
PowerPC 604, …
35
Tomasulo Algorithm
• Control & buffers distributed with Function Units (FU)
– FU buffers called “reservation stations”; have pending operands
• Registers in instructions replaced by values or pointers to reservation
stations (RS);
– form of register renaming ;
– avoids WAR, WAW hazards
– More reservation stations than registers, so can do optimizations compilers
can’t
• Results to FU from RS, not through registers, over Common Data Bus
that broadcasts results to all FUs
• Load and Stores treated as FUs with RSs as well
• Integer instructions can get ahead & go past branches, allowing
FP ops beyond basic block in FP queue
36
Tomasulo Organization
From Mem FP Op FP Registers
Queue
Load Buffers
Load1
Load2
Load3
Load4
Load5 Store
Load6
Buffers
Add1
Add2 Mult1
Add3 Mult2
Reservation To Mem
Stations
FP
FP adders
adders FP
FP multipliers
multipliers
Common Data Bus (CDB)

37
Reservation Station Components
Op: Operation to perform in the unit (e.g., + or –)

Vj, Vk: Value of Source operands
– Store buffers has V field, result to be stored
Qj, Qk: Reservation stations producing source registers
(value to be written)
– Note: Qj,Qk=0 => ready
– Store buffers only have Qi for RS producing result
Busy: Indicates reservation station or FU is busy
Register result status—Indicates which functional unit will

write each register, if one exists. Blank when no pending
instructions that will write that register.
38
Three Stages of Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue

If reservation station free (no structural hazard),
control issues instr & sends operands (renames registers).
2. Execute—operate on operands (EX)
When both operands ready then execute;
if not ready, watch Common Data Bus for result
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting units;
mark reservation station available
• Normal data bus: data + destination (“go to” bus)
• Common data bus: data + source (“come from” bus)
– 64 bits of data + 4 bits of Functional Unit source address
– Write if matches expected Functional Unit (produces result)
– Does the broadcast
• Example speed:
3 cycles for Fl .pt. +,-; 10 for * ; 40 for /
39
Tomasulo Example
Instruction stream
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 Load1 No
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2 3 Load/Buffers
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
FU count Add2 No
3 FP Adder R.S.
Add3 No
down 2 FP Mult R.S.
Mult1 No
Mult2 No
Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
0 FU
Clock cycle
counter
40
Tomasulo Example Cycle 1
LD F6 34+ R2 1 Load1 Yes 34+R2
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
1 FU Load1
41
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
2 FU Load2 Load1
Note: Can have multiple loads outstanding
42
LD F6 34+ R2 1 3 Load1 Yes 34+R2
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Add1 No
Add2 No
Add3 No
Mult1 Yes MULTD R(F4) Load2
Mult2 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
3 FU Mult1 Load2 Load1
• Note: registers names are removed (“renamed”) in

Reservation Stations; MULT issued
• Load1 completing; what is waiting for Load1?
43
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 Load2 Yes 45+R3
SUBD F8 F6 F2 4
DIVD F10 F0 F6
ADDD F6 F8 F2
Add1 Yes SUBD M(A1) Load2
Add2 No
Add3 No
Mult1 Yes MULTD R(F4) Load2
Mult2 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
4 FU Mult1 Load2 M(A1) Add1
• Load2 completing; what is waiting for Load2?

44
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4
DIVD F10 F0 F6 5
ADDD F6 F8 F2
2 Add1 Yes SUBD M(A1) M(A2)
Add2 No
Add3 No
10 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Mult1 M(A2) M(A1) Add1 Mult2
• Timer starts counting down for Add1, Mult1

45
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
Add2 Yes ADDD M(A2) Add1
Add3 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
6 FU Mult1 M(A2) Add2 Add1 Mult2
• Issue ADDD here despite name dependency on F6?

46
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4 7
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
Add2 Yes ADDD M(A2) Add1
Add3 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
7 FU Mult1 M(A2) Add2 Add1 Mult2
• Add1 (SUBD) completing; what is waiting for it?

47
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
Add1 No
2 Add2 Yes ADDD (M-M) M(A2)
Add3 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 M(A2) Add2 (M-M) Mult2
48
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
Add1 No
Add3 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
49
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10
Add1 No
Add3 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
• Add2 (ADDD) completing; what is waiting for it?

50
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Add1 No
Add2 No
Add3 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
11 FU Mult1 M(A2) (M-M+M)(M-M) Mult2
• Write result of ADDD here?

• All quick instructions complete in this cycle!
51
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Add1 No
Add2 No
Add3 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
52
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Add1 No
Add2 No
Add3 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
53
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Add1 No
Add2 No
Add3 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
54
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 15 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Add1 No
Add2 No
Add3 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
• Mult1 (MULTD) completing; what is waiting for it?

55
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 15 16 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Add1 No
Add2 No
Add3 No
Mult1 No
40 Mult2 Yes DIVD M*F4 M(A1)

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
16 FU M*F4 M(A2) (M-M+M)(M-M) Mult2
• Just waiting for Mult2 (DIVD) to complete

56
Bored waiting for divide to
complete?
(skip a couple of cycles)
57
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Add1 No
Add2 No
Add3 No
Mult1 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
58
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5 56
ADDD F6 F8 F2 6 10 11
Add1 No
Add2 No
Add3 No
Mult1 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
• Mult2 (DIVD) is completing; what is waiting for it?

59
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5 56 57
ADDD F6 F8 F2 6 10 11
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 Yes DIVD M*F4 M(A1)

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
56 FU M*F4 M(A2) (M-M+M)(M-M) Result
• Once again: In-order issue, out-of-order execution

and out-of-order completion.
60
Tomasulo Drawbacks
• Complexity
– design delays of 360/91, MIPS 10000, Alpha 21264,
IBM PPC 620, but not in silicon!
• Many associative stores (CDB) at high speed
• Performance limited by Common Data Bus
– Each CDB must go to multiple functional units
high capacitance, high wiring density
– Number of functional units that can complete per cycle limited to
one!
» Multiple CDBs  more FU logic for parallel assoc stores
• Non-precise interrupts!
– We will address this later
61
Tomasulo Loop Example
Loop: L.D F0 0(R1)
MULT.D F4 F0 F2
S.D F4 0(R1)
DSUBI R1 R1 #8
BNE R1 R0 Loop
• This time assume Multiply takes 4 clocks

• Assume 1st load takes 8 clocks
(L1 cache miss), 2nd load takes 1 clock (hit)
• To be clear, will show clocks for SUBI, BNE
– Reality: integer instructions get ahead of Fl. Pt. Instructions
• BNEZ R1 is shorthand for BNE R1,R0
• Show 2 iterations
62
Loop Carried Name Dependencies
Two iterations shown –what the compiler generated
Loop: L.D F0 0(R1)
MULT.D F4 F0 F2
S.D F4 0(R1)
DSUBI R1 R1 #8
BNE R1 R0 Loop
for(j=max;j>0;j--) A[j] = A[j] * c;
63
Loop Carried Name Dependencies
Two iterations shown –what EXE unit sees ...

Loop: L.D F0 0(R1)
MULT.D F4 F0 F2
S.D F4 0(R1)
DSUBI R1 R1 #8
BNE R1 R0 Loop
Loop: L.D F0 0(R1)
MULT.D F4 F0 F2
S.D F4 0(R1)
DSUBI R1 R1 #8
BNE R1 R0 Loop
...
64
[aside] Compiler Register Renaming
Requires Unrolling the Loop
Loop: L.D F0 0(R1)
MULT.D F4 F0 F2
S.D F4 0(R1)
DSUBI R1 R1 #8
Odd: L.D F10 0(R1)
MULT.D F14 F10 F2
S.D F14 0(R1)
DSUBI R1 R1 #8
BNE R1 R0 Loop
65
Loop Example
ITER Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 Load1 No
1 MULTD F4 F0 F2 Load2 No
1 SD F4 0 R1 Load3 No
Iter- 2 LD F0 0 R1 Store1 No
ation 2 MULTD F4 F0 F2 Store2 No
Count 2 SD F4 0 R1 Store3 No
Reservation Stations: S1 S2 RS Added Store Buffers
Time Name Busy Op Vj Vk Qj Qk Code:
Add1 No LD F0 0 R1
Add2 No MULTD F4 F0 F2
Add3 No SD F4 0 R1
Mult1 No SUBI R1 R1 #8
Mult2 No BNEZ R1 Loop
Register result status Instruction Loop
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
0 80 Fu
Value of Register used for address, iteration control

66
Loop Example Cycle 1
1 LD F0 0 R1 1 Load1 Yes 80
Load2 No
Load3 No
Store1 No
Store2 No
Store3 No
Reservation Stations: S1 S2 RS
Add1 No LD F0 0 R1
Add3 No SD F4 0 R1
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
1 80 Fu Load1
67
1 MULTD F4 F0 F2 2 Load2 No
Load3 No
Store1 No
Store2 No
Store3 No
Add1 No LD F0 0 R1
Add3 No SD F4 0 R1
Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
2 80 Fu Load1 Mult1
68
1 SD F4 0 R1 3 Load3 No
Store1 Yes 80 Mult1
Store2 No
Store3 No
Add1 No LD F0 0 R1
Add3 No SD F4 0 R1
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
3 80 Fu Load1 Mult1
• Implicit renaming sets up data flow graph

69
Store1 Yes 80 Mult1
Store2 No
Store3 No
Add1 No LD F0 0 R1
Add3 No SD F4 0 R1
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
4 80 Fu Load1 Mult1
• Dispatching SUBI Instruction (not in FP queue)

70
Store1 Yes 80 Mult1
Store2 No
Store3 No
Add1 No LD F0 0 R1
Add3 No SD F4 0 R1
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
5 72 Fu Load1 Mult1
• And, BNEZ instruction (not in FP queue)

71
1 MULTD F4 F0 F2 2 Load2 Yes 72
2 LD F0 0 R1 6 Store1 Yes 80 Mult1
Store2 No
Store3 No
Add1 No LD F0 0 R1
Add3 No SD F4 0 R1
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
6 72 Fu Load2 Mult1
• Notice that F0 never sees Load from location 80

72
2 MULTD F4 F0 F2 7 Store2 No
Store3 No
Add1 No LD F0 0 R1
Add3 No SD F4 0 R1
Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
7 72 Fu Load2 Mult2
• Register file completely detached from computation
• First and Second iteration completely overlapped
73
2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2
2 SD F4 0 R1 8 Store3 No
Add1 No LD F0 0 R1
Add3 No SD F4 0 R1
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
8 72 Fu Load2 Mult2
74
1 LD F0 0 R1 1 9 Load1 Yes 80
Add1 No LD F0 0 R1
Add3 No SD F4 0 R1
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
9 72 Fu Load2 Mult2
• Load1 completing: who is waiting?

• Note: Dispatching SUBI
75
1 LD F0 0 R1 1 9 10 Load1 No
2 LD F0 0 R1 6 10 Store1 Yes 80 Mult1
Add1 No LD F0 0 R1
Add3 No SD F4 0 R1
4 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
10 64 Fu Load2 Mult2
• Load2 completing: who is waiting?

• Note: Dispatching BNEZ
76
1 LD F0 0 R1 1 9 10 Load1 No
1 SD F4 0 R1 3 Load3 Yes 64
2 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult1
Add1 No LD F0 0 R1
Add3 No SD F4 0 R1
4 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
• Next load in sequence

77
1 LD F0 0 R1 1 9 10 Load1 No
Add1 No LD F0 0 R1
Add3 No SD F4 0 R1
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
• Why not issue third multiply?

78
1 LD F0 0 R1 1 9 10 Load1 No
Add1 No LD F0 0 R1
Add3 No SD F4 0 R1
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
• Why not issue third store?

79
1 LD F0 0 R1 1 9 10 Load1 No
1 MULTD F4 F0 F2 2 14 Load2 No
Add1 No LD F0 0 R1
Add3 No SD F4 0 R1
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
• Mult1 completing. Who is waiting?

80
1 LD F0 0 R1 1 9 10 Load1 No
1 MULTD F4 F0 F2 2 14 15 Load2 No
2 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R2
2 MULTD F4 F0 F2 7 15 Store2 Yes 72 Mult2
Add1 No LD F0 0 R1
Add3 No SD F4 0 R1
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
• Mult2 completing. Who is waiting?

81
1 LD F0 0 R1 1 9 10 Load1 No
2 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R2
2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R2
Add1 No LD F0 0 R1
Add3 No SD F4 0 R1
4 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
82
1 LD F0 0 R1 1 9 10 Load1 No
2 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R2
2 SD F4 0 R1 8 Store3 Yes 64 Mult1
Add1 No LD F0 0 R1
Add3 No SD F4 0 R1
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
83
1 LD F0 0 R1 1 9 10 Load1 No
1 SD F4 0 R1 3 18 Load3 Yes 64
2 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R2
2 SD F4 0 R1 8 Store3 Yes 64 Mult1
Add1 No LD F0 0 R1
Add3 No SD F4 0 R1
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
84
1 LD F0 0 R1 1 9 10 Load1 No
1 SD F4 0 R1 3 18 19 Load3 Yes 64
2 LD F0 0 R1 6 10 11 Store1 No
2 SD F4 0 R1 8 19 Store3 Yes 64 Mult1
Add1 No LD F0 0 R1
Add3 No SD F4 0 R1
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
85
1 LD F0 0 R1 1 9 10 Load1 Yes 56
1 SD F4 0 R1 3 18 19 Load3 Yes 64
2 LD F0 0 R1 6 10 11 Store1 No
2 MULTD F4 F0 F2 7 15 16 Store2 No
2 SD F4 0 R1 8 19 20 Store3 Yes 64 Mult1
Add1 No LD F0 0 R1
Add3 No SD F4 0 R1
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
• Once again: In-order issue, out-of-order execution

and out-of-order completion.
86
Memory Hazards
• With the Tomasulo algorithm, memory loads and stores
can complete in any order, provided that:
– Only one processor in system using those memory addresses
– All loads and stores are to different addresses
• To avoid RAW, WAR, WAW hazards with a single
processor,memory operations to the same address must
occur in program order
• Simple resolution, check addresses of previously issued
loads and stores
– Loads from same memory address must wait for existing stores (RAW -
use bypassing – more in chapter 5)
– Stores must wait for existing stores and loads to same address
87
Why can Tomasulo overlap iterations of
loops?
• Register renaming
– Multiple iterations use different physical destinations for registers
(dynamic loop unrolling).
• Reservation stations
– Permit instruction issue to advance past integer control flow operations
– Also buffer old values of registers - totally avoiding the WAR stall that we
saw in the scoreboard.
• Other perspective: Tomasulo building data flow

dependency graph on the fly.
88
Tomasulo’s scheme offers 2 major
advantages
(1) the distribution of the hazard detection logic
– distributed reservation stations and the CDB
– If multiple instructions waiting on single result, & each instruction has
other operand, then instructions can be released simultaneously by
broadcast on CDB
"Multicast result forwarding"
– If a centralized register file were used, the units would have to read their
results from the registers when register buses are available.
(2) the elimination of stalls for WAW and WAR hazards
89
What about Precise Interrupts?
• State of machine looks as if no instruction beyond faulting
instructions has issued
• Tomasulo had:
In-order issue, out-of-order execution, and out-of-order

completion
• Need to “fix” the out-of-order completion aspect so that

we can find precise breakpoint in instruction stream, and
provide single PC to return to.
90
Relationship between precise
interrupts and speculation:
• Speculation: guess and check
• Important for branch prediction:
– Need to “take our best shot” at predicting branch direction.
• If we speculate and are wrong, need to back up and
restart execution to point at which we predicted
incorrectly:
– This is exactly same as precise exceptions!
• Technique for both precise interrupts/exceptions and
speculation: in-order completion or commit
91
HW support for precise interrupts
• Need HW buffer for results of
uncommitted instructions:
reorder buffer
– 3 fields: instr, destination, value
Reorder
– Use reorder buffer number instead of
Buffer
reservation station when execution FP
completes Op
– Supplies operands between execution Queue FP Regs
complete & commit
– (Reorder buffer can be operand
source => more registers like RS)
– Instructions commit Res Stations Res Stations
– Once instruction commits, FP Adder FP Adder
result is put into register
– As a result, easy to undo speculated
instructions
on mispredicted branches
or exceptions
92
Four Steps of Speculative
Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue
If reservation station and reorder buffer slot free, issue instr & send operands
& reorder buffer no. for destination (this stage sometimes called “dispatch”)
2. Execution—operate on operands (EX)
When both operands ready then execute; if not ready, watch CDB for result;
when both in reservation station, execute; checks RAW (sometimes called
“issue”)
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting FUs
& reorder buffer; mark reservation station available.
4. Commit—update register (or memory) with reorder result
When instr. at head of reorder buffer & result present, update register with
result (or store to memory) and remove instr from reorder buffer. Mispredicted
branch flushes reorder buffer (sometimes called “graduation”)
93
What are the hardware complexities with
reorder buffer (ROB)?
Compare network
Program Counter
Reorder
Buffer
Exceptions?
FP
Dest Reg
Op
Result
Queue
Valid
FP Regs
Reorder Table Res Stations Res Stations

FP Adder FP Adder
• How do you find the latest version of a register?

– (As specified by Smith paper) need associative comparison network
– Could use future file or just use the register result status buffer to track which specific reorder buffer has received the value
• Need as many ports on ROB as register file
94
Tomasulo Summary
• Reservations stations: implicit register renaming to larger
set of registers + buffering source operands
– Prevents registers as bottleneck
– Avoids WAR, WAW hazards of Scoreboard
– Allows loop unrolling in HW
• Not limited to basic blocks
(integer units gets ahead, beyond branches)
• Today, helps cache misses as well
– Don’t stall for L1 Data cache miss (insufficient ILP for L2 miss?)
• Lasting Contributions
– Dynamic scheduling
– Register renaming
– Load/store disambiguation
• 360/91 descendants are Pentium III++; PowerPC 604;
MIPS R10000; HP-PA 8000; Alpha 21264
95
Tomasulo Algorithm and Branch
Prediction
• 360/91 predicted branches, but did not speculate: pipeline
stopped until the branch was resolved
– No speculation; only instructions that can complete
• Speculation with Reorder Buffer allows execution past
branch, and then discard if branch fails
– just need to hold instructions in buffer until branch can commit
96
Case for Branch Prediction when
Issue N instructions per clock cycle
1. Branches will arrive up to n times faster in an n-issue
processor
2. Amdahl’s Law => relative impact of the control stalls will
be larger with the lower potential CPI in an n-issue
processor
98
7 Branch Prediction Schemes
1. 1-bit Branch-Prediction Buffer
2. 2-bit Branch-Prediction Buffer
3. Correlating Branch Prediction Buffer
4. Tournament Branch Predictor
5. Branch Target Buffer

6. Integrated Instruction Fetch Units
7. Return Address Predictors
99
Dynamic Branch Prediction
• Performance = ƒ(accuracy, cost of misprediction)

• Branch History Table: Lower bits of PC address index
table of 1-bit values
– Says whether or not branch taken last time
– No address check (saves HW, but may not be right branch)
• Problem: in a loop, 1-bit BHT will cause
2 mispredictions (avg. is 9 iterations before exit):
– End of loop case, when it exits instead of looping as before
– First time through loop on next time through code, when it predicts exit
instead of looping
– Only 80% accuracy even if loop 90% of the time
100
Dynamic Branch Prediction
(Jim Smith, 1981)
• Solution: 2-bit scheme where change prediction only if get

misprediction twice:
T
NT
Predict Taken Predict Taken
T
T NT
NT
Predict Not Predict Not
T Taken
Taken
• Red: stop, not taken NT

• Green: go, taken
• Adds hysteresis to decision making process
101
Dynamic Branch Prediction With
more Hysteresis fig 3.7
T
NT
Predict Taken Predict Taken
T
T NT
NT
Predict Not Predict Not
T Taken
Taken
NT
102
Correlating Branches
Idea: taken/not taken of
recently executed branches Branch address (4 bits)
is related to behavior of next
branch (as well as the 2-bits per branch
history of that branch local predictors
behavior)
– Then behavior of recent
branches selects between, say,
4 predictions of next branch,
updating just that prediction Prediction
Prediction
– One of these 4 predictors is
selected by the global branch
history
• (2,2) predictor: 2-bit global,
2-bit local
2-bit global
branch history
(01 = not taken then taken)
103
Accuracy of Different Schemes
(Figure 3.8, p. 200)
18% 4096 Entries 2-bit BHT

Frequency of Mispredictions
Unlimited Entries 2-bit BHT

1024 Entries (2,2) BHT
0%
104
Re-evaluating Correlation
• Several of the SPEC benchmarks have less than

a dozen branches responsible for 90% of
executed branches:
program branch % static # = 90%
compress 14% 236 13
eqntott 25% 494 5
gcc 15% 9531 2020
mpeg 10% 5598 532
real gcc 13% 17361 3214
• Real programs + OS more like gcc
• Small benefits beyond benchmarks for
correlation? problems with branch aliases?
105
BHT Accuracy
• Mispredict because either:
– Wrong guess for that branch
– Got branch history of wrong branch when index the table
• 4096 entry table programs vary from 1% misprediction
(nasa7, tomcatv) to 18% (eqntott), with spice at 9% and
gcc at 12%
• For SPEC92,
4096 about as good as infinite table
106
Tournament Predictors
• Motivation for correlating branch predictors is 2-bit

predictor failed on important branches; by adding
global information, performance improved
• Tournament predictors: use 2 predictors, 1 based on
global information and 1 based on local information,
and combine with a selector
• Hopes to select right predictor for right branch
108
Tournament Predictor in Alpha 21264
• 4K 2-bit counters to choose from among a global predictor and

a local predictor
• Global predictor also has 4K entries and is indexed by the
history of the last 12 branches; each entry in the global
predictor is a standard 2-bit predictor
– 12-bit pattern: ith bit 0 => ith prior branch not taken;
ith bit 1 => ith prior branch taken;
• Local predictor consists of a 2-level predictor:
– Top level a local history table consisting of 1024 10-bit entries;
each 10-bit entry corresponds to the most recent 10 branch
outcomes for the entry. 10-bit history allows patterns 10 branches
to be discovered and predicted.
– Next level Selected entry from the local history table is used to
index a table of 1K entries consisting a 3-bit saturating counters,
which provide the local prediction
• Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits!
(~180,000 transistors)
109
% of predictions from local
predictor in Tournament Prediction
Scheme
0% 20% 40% 60% 80% 100%
nasa7 98%
matrix300 100%
tomcatv 94%
doduc 90%
spice 55%
fpppp 76%
gcc 72%
espresso 63%
eqntott 37%
li 69%
110
Accuracy of Branch Prediction
99%
tomcatv 99%
100%
95%
doduc 84%
97%
86%
fpppp 82% Profile-based
98%
2-bit counter
88% Tournament
li 77%
98%
86%
espresso 82%
96%
88%
gcc 70%
94%
0% 20% 40% 60% 80% 100%

Branch prediction accuracy
• Profile: branch profile from last execution
(static in that in encoded in instruction, but profile)
111
Accuracy v. Size (SPEC89)
10%
9%
Conditional branch misprediction rate
8%
7%
Local
6%
5%
Correlating
4%
3%
2%
Tournament
1%
0%
0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Total predictor size (Kbits)

112
Predicated Execution
no branches to worry about predicting
• Avoid branch prediction by turning branches into
conditionally executed instructions:
if (x) then A = B op C else NOP
x
– If false, then neither store result nor cause exception
– Expanded ISA of Alpha, MIPS, PowerPC, SPARC have
conditional move; PA-RISC can annul any following instr.
– IA-64: 64 1-bit condition fields selected
A=
so conditional execution of any instruction B op C
– This transformation is called “if-conversion”
• Drawbacks to conditional instructions
– Still takes a clock even if “annulled”
– Stall if condition evaluated late
– Complex conditions reduce effectiveness;
condition becomes known late in pipeline
113
MIPS Conditional Move
Move Conditional on not Zero
MOVN rd, rs, rt # if rt != 0 then rd <= rs
Move Conditional on Zero
MOVZ rd, rs, rt # if rt = 0 then rd <= rs
also floating point versions
114
MAX function without branches
r1=MAX(r2,r3)
if (r2>r3) r1=r2; else r1=r3;
slt r4 <- r3,r2 #set on less than r3<r2

or r1 <- r3,r0 #unconditional move
movn r1 <- r2,r4 #conditional move (r4!=0)
Also fewer instructions than using branches
115
Pitfall: Sometimes bigger and
dumber is better
• 21264 uses tournament predictor (29 Kbits) with 1K local
predictors
• Earlier 21164 uses a simple 2-bit predictor with 2K entries
(or a total of 4 Kbits)
• SPEC95 benchmarks, 21264 outperforms
– 21264 avg. 11.5 mispredictions per 1000 instructions
– 21164 avg. 16.5 mispredictions per 1000 instructions
• Reversed for transaction processing (TP) !
– 21264 avg. 17 mispredictions per 1000 instructions
– 21164 avg. 15 mispredictions per 1000 instructions
• TP code much larger & 21164 hold 2X branch predictions
based on local behavior (2K vs. 1K local predictor in the
21264)
116
Limitations of BHTs
Only predicts branch direction. Therefore, cannot redirect
fetch stream until after branch target is determined.
Correctly A PC Generation/Mux
predicted P Instruction Fetch Stage 1
taken branch F Instruction Fetch Stage 2
penalty B Branch Address Calc/Begin Decode
I Complete Decode
Jump Register J Steer Instructions to Functional units
penalty
R Register File Read
E Integer Execute
Remainder of execute pipeline
(+ another 6 stages)
UltraSPARC-III fetch pipeline

127
Branch Target Buffer
predicted BPb
target
Branch
Target
IMEM
Buffer
(2k entries)
k
PC
target BP
BP bits are stored with the predicted target address.
IF stage: If (BP=taken) then nPC=target else nPC=PC+4

later: check prediction, if wrong then kill the instruction
and update BTB & BPb else update BPb
128
Address Collisions
132 Jump 100

Assume a
128-entry
BTB 1028 Add .....
target BPb
236 take
Instruction
What will be fetched after the instruction at 1028? Memory
BTB prediction = 236
Correct target = 1032

 kill PC=236 and fetch PC=1032
Is this a common occurrence?

Can we avoid these bubbles?
129
BTB is only for Control Instructions
BTB contains useful information for branch and

jump instructions only
 Do not update it for other instructions
For all other instructions the next PC is PC+4 !
How to achieve this effect without decoding the

instruction?
130
Branch Target Buffer (BTB)
2k-entry direct-mapped BTB
I-Cache PC (can also be associative)
Entry PC Valid predicted
target PC
match valid target

• Keep both the branch PC and target PC in the BTB
• PC+4 is fetched if match fails
• Only taken branches and jumps held in BTB
• Next PC determined before branch fetched and decoded
131
Consulting BTB Before Decoding
132 Jump 100
entry PC target BPb

132 236 take 1028 Add .....
• The match for PC=1028 fails and 1028+4 is fetched

 eliminates false predictions after ALU instructions
• BTB contains entries only for control transfer instructions

 more room to store branch targets
132
Combining BTB and BHT
• BTB entries are considerably more expensive than BHT, but can
redirect fetches at earlier stage in pipeline and can accelerate
indirect branches (JR)
• BHT can hold many more entries and is more accurate
A PC Generation/Mux
BTB P Instruction Fetch Stage 1
F Instruction Fetch Stage 2
BHT in later BHT B Branch Address Calc/Begin Decode
pipeline stage
I Complete Decode
corrects when
BTB misses a J Steer Instructions to Functional units
predicted R Register File Read
taken branch
E Integer Execute
BTB/BHT only updated after branch resolves in E stage
133
Uses of Jump Register (JR)
• Switch statements (jump to address of matching case)
BTB works well if same case used repeatedly
• Dynamic function call (jump to run-time function address)
BTB works well if same function usually called, (e.g., in

C++ programming, when objects have same type in
virtual function call)
• Subroutine returns (jump to return address)

BTB works well if usually return to the same place
 Often one function called from many distinct call sites!
How well does BTB work for each of these cases?
134
Subroutine Return Stack
Small structure to accelerate JR for subroutine returns,
typically much more accurate than BTBs.
fa() { fb(); }
fb() { fc(); }
fc() { fd(); }
Pop return address
Push call address when
when subroutine
function call executed
return decoded
&fd() k entries
&fc() (typically k=8-16)
&fb()
135
Mispredict Recovery
In-order execution machines:

– Assume no instruction issued after branch can write-back before
branch resolves
– Kill all instructions in pipeline behind mispredicted branch
Out-of-order execution?
– Multiple instructions following branch in program

order can complete before branch resolves
136
In-Order Commit for Precise Exceptions
In-order Out-of-order In-order
Fetch Decode Reorder Buffer Commit
Kill Kill
Kill
Execute
Inject handler PC Exception?
• Instructions fetched and decoded into instruction

reorder buffer in-order
• Execution is out-of-order (  out-of-order completion)
• Commit (write-back to architectural state, i.e., regfile &
memory, is in-order
Temporary storage needed in ROB to hold results before commit
137
Branch Misprediction in Pipeline
Inject correct PC
Branch Kill Branch

Prediction Resolution
Kill Kill
PC Fetch Decode Reorder Buffer Commit
Complete
Execute
• Can have multiple unresolved branches in ROB

• Can resolve branches out-of-order by killing all the
instructions in ROB that follow a mispredicted branch
138
Recovering ROB/Renaming Table
Rename t t t vvv Rename Register

t v
Table r1 Snapshots File
r2
Ins# use exec op p1 src1 p2 src2 pd dest data t1

Ptr2
next to commit t2
rollback .
next available
.
Ptr1
tn
next available
Reorder
buffer Load Commit
Store
FU FU FU
Unit Unit
< t, result >
Take snapshot of register rename table at each predicted

branch, recover earlier snapshot if branch mispredicted
139
Speculating Both Directions
An alternative to branch prediction is to execute
both directions of a branch speculatively
• resource requirement is proportional to the
number of concurrent speculative executions
• only half the resources engage in useful work

when both directions of a branch are executed
speculatively
• branch prediction takes less resources

than speculative execution of both paths
With accurate branch prediction, it is more cost
effective to dedicate all resources to the predicted
direction
140
“Data in ROB” Design
(HP PA8000, Pentium Pro, Core2Duo)
Register File
holds only
committed state
Ins# use exec op p1 src1 p2 src2 pd dest data t1
Reorder t2
buffer .
.
tn
Load Store Commit

FU FU FU
Unit Unit
< t, result >
• On dispatch into ROB, ready sources can be in regfile or in ROB

dest (copied into src1/src2 if ready before dispatch)
• On completion, write to dest field and broadcast to src fields.
• On issue, read from ROB src fields
141
Unified Physical Register File
(MIPS R10K, Alpha 21264, Pentium 4)
t1
Snapshots for t2 Reg
r1 ti mispredict recovery File
.
r2 tj
tn
Rename
Load Store
Table FU FU
FU FU
Unit Unit
(ROB not shown) < t, result >
• One regfile for both committed and speculative values (no data in ROB)
• During decode, instruction result allocated new physical register, source
regs translated to physical regs through rename table
• Instruction reads data from regfile at start of execute (not in decode)
• Write-back updates reg. busy bits on instructions in ROB (assoc. search)
• Snapshots of rename table taken at every branch to recover mispredicts
• On exception, renaming undone in reverse order of issue (MIPS R10000)
142
Pipeline Design with Physical Regfile
Update predictors
kill Branch
Resolution
Branch kill
kill
Prediction
kill Out-of-Order In-Order
Decode &
PC Fetch Reorder Buffer Commit
Rename
In-Order
Physical Reg. File
Branch Store
ALU MEM D$
Unit Buffer
Execute
143
Lifetime of Physical Registers
• Physical regfile holds committed and speculative values
• Physical registers decoupled from ROB entries (no data in ROB)
ld r1, (r3) ld P1, (Px)

add r3, r1, #4 add P2, P1, #4
sub r6, r7, r9 sub P3, Py, Pz
add r3, r3, r6 Rename add P4, P2, P3
ld r6, (r1) ld P5, (P1)
add r6, r6, r3 add P6, P5, P4
st r6, (r1) st P6, (P1)
ld r6, (r11) ld P7, (Pw)
When can we reuse a physical register?
When next write of same architectural register commits
144
Physical Register Management
Rename Physical Regs Free List
Table P0 P0
R0 P1 P1
R1 P8 P2 P3 ld r1, 0(r3)
R2 P3 P2
R3 P7 P4 P4 add r3, r1, #4
R4 P5 <R6> p
R5 P6 <R7> p sub r6, r7, r6
R6 P5 P7 <R3> p add r3, r3, r6
R7 P6 P8 <R1> p
ld r6, 0(r1)
Pn
ROB
use ex op p1 PR1 p2 PR2 Rd LPRd PRd (LPRd requires
third read port
on Rename
Table for each
instruction)
145
Table P0 P0
R0 P1 P1
R1 P8 P0 P2 P3 ld r1, 0(r3)
R2 P3 P2
R3 P7 P4 P4 add r3, r1, #4
R4 P5 <R6> p
R5 P6 <R7> p sub r6, r7, r6
R6 P5 P7 <R3> p add r3, r3, r6
R7 P6 P8 <R1> p
ld r6, 0(r1)
Pn
ROB
use ex op p1 PR1 p2 PR2 Rd LPRd PRd
x ld p P7 r1 P8 P0
146
Table P0 P0
R0 P1 P1
R1 P8 P0 P2 P3 ld r1, 0(r3)
R2 P3 P2
R3 P7 P1 P4 P4 add r3, r1, #4
R4 P5 <R6> p
R5 P6 <R7> p sub r6, r7, r6
R6 P5 P7 <R3> p add r3, r3, r6
R7 P6 P8 <R1> p
ld r6, 0(r1)
Pn
ROB
x ld p P7 r1 P8 P0
x add P0 r3 P7 P1
147
Table P0 P0
R0 P1 P1
R1 P8 P0 P2 P3 ld r1, 0(r3)
R2 P3 P2
R3 P7 P1 P4 P4 add r3, r1, #4
R4 P5 <R6> p
R5 P6 <R7> p sub r6, r7, r6
R6 P5 P3 P7 <R3> p add r3, r3, r6
R7 P6 P8 <R1> p
ld r6, 0(r1)
Pn
ROB
x ld p P7 r1 P8 P0
x add P0 r3 P7 P1
x sub p P6 p P5 r6 P5 P3
148
Table P0 P0
R0 P1 P1
R1 P8 P0 P2 P3 ld r1, 0(r3)
R2 P3 P2
R3 P7 P1 P2 P4 P4 add r3, r1, #4
R4 P5 <R6> p
R5 P6 <R7> p sub r6, r7, r6
R6 P5 P3 P7 <R3> p add r3, r3, r6
R7 P6 P8 <R1> p
ld r6, 0(r1)
Pn
ROB
x ld p P7 r1 P8 P0
x add P0 r3 P7 P1
x add P1 P3 r3 P1 P2
149
Table P0 P0
R0 P1 P1
R1 P8 P0 P2 P3 ld r1, 0(r3)
R2 P3 P2
R3 P7 P1 P2 P4 P4 add r3, r1, #4
R4 P5 <R6> p
R5 P6 <R7> p sub r6, r7, r6
R6 P5 P3 P4 P7 <R3> p add r3, r3, r6
R7 P6 P8 <R1> p
ld r6, 0(r1)
Pn
ROB
x ld p P7 r1 P8 P0
x add P0 r3 P7 P1
x ld P0 r6 P3 P4
150
Table P0 <R1> p P0
R0 P1 P1
R1 P8 P0 P2 P3 ld r1, 0(r3)
R2 P3 P2
R3 P7 P1 P2 P4 P4 add r3, r1, #4
R4 P5 <R6> p P8
R5 P6 <R7> p sub r6, r7, r6
R6 P5 P3 P4 P7 <R3> p add r3, r3, r6
R7 P6 P8 <R1> p
ld r6, 0(r1)
Pn
ROB
use ex op p1 PR1 p2 PR2 Rd LPRd PRd Execute &
x x ld p P7 r1 P8 P0 Commit
x add p P0 r3 P7 P1
x ld p P0 r6 P3 P4
151
Table P0 <R1> p P0
R0 P1 <R3> p P1
R1 P8 P0 P2 P3 ld r1, 0(r3)
R2 P3 P2
R3 P7 P1 P2 P4 P4 add r3, r1, #4
R4 P5 <R6> p P8
R5 P6 <R7> p P7 sub r6, r7, r6
R6 P5 P3 P4 P7 <R3> p add r3, r3, r6
R7 P6 P8
ld r6, 0(r1)
Pn
ROB
x x ld p P7 r1 P8 P0 Execute &
x x add p P0 r3 P7 P1 Commit
x add p P1 P3 r3 P1 P2
x ld p P0 r6 P3 P4
152
Reorder Buffer Holds
Active Instruction Window
… (Older instructions) Commit …
ld r1, (r3) ld r1, (r3)
add r3, r1, r2 add r3, r1, r2
sub r6, r7, r9 Execute sub r6, r7, r9
ld r6, (r1) ld r6, (r1)
st r6, (r1) Fetch st r6, (r1)
ld r6, (r1) ld r6, (r1)
(Newer instructions)
… …
Cycle t Cycle t + 1
153
Superscalar Register Renaming
• During decode, instructions allocated new physical destination register
• Source operands renamed to physical register with newest value
• Execution unit only sees physical register numbers
Inst 1 Op Dest Src1 Src2 Op Dest Src1 Src2 Inst 2
Update Read Addresses

Register
Write
Ports
Mapping Rename Table

Read Data
Free List
Op PDest PSrc1 PSrc2 Op PDest PSrc1 PSrc2
Does this work?

154
Superscalar Register Renaming
Inst 1 Op Dest Src1 Src2 Op Dest Src1 Src2 Inst 2
Update Read Addresses

Register
Write
Ports
Mapping Rename Table =? =?

Read Data
Free List
Must check for
RAW hazards
between
instructions
issuing in same
cycle. Can be
done in parallel
with rename Op PDest PSrc1 PSrc2 Op PDest PSrc1 PSrc2
lookup.
MIPS R10K renames 4 serially-RAW-dependent insts/cycle
155
Memory Dependencies
st r1, (r2)
ld r3, (r4)
When can we execute the load?
156
In-Order Memory Queue
• Execute all loads and stores in program order
=> Load and store cannot leave ROB for execution until
all previous loads and stores have completed
execution
• Can still execute loads and stores speculatively, and

out-of-order with respect to other instructions
157
Conservative O-o-O Load Execution
st r1, (r2)
ld r3, (r4)
• Split execution of store instruction into two phases: address

calculation and data write
• Can execute load before store, if addresses known and r4 != r2
• Each load address compared with addresses of all previous

uncommitted stores (can use partial conservative check i.e.,
bottom 12 bits of address)
• Don’t execute load if any previous store address not known
(MIPS R10K, 16 entry address queue)
158
Address Speculation
st r1, (r2)
ld r3, (r4)
• Guess that r4 != r2
• Execute load before store address known
• Need to hold all completed but uncommitted load/store addresses in

program order
• If subsequently find r4==r2, squash load and all following

instructions
=> Large penalty for inaccurate address speculation
159
Memory Dependence Prediction
(Alpha 21264)
st r1, (r2)
ld r3, (r4)
• Guess that r4 != r2 and execute load before store
• If later find r4==r2, squash load and all following instructions, but
mark load instruction as store-wait
• Subsequent executions of the same load instruction will wait for

all previous stores to complete
• Periodically clear store-wait bits
160
Speculative Loads / Stores
Just like register updates, stores should not modify
the memory until after the instruction is committed
- A speculative store buffer is a structure introduced to hold

speculative store data.
161
Speculative Store Buffer
Speculative Load Address
L1 Data
Store
Cache
Buffer
V S Tag Data
V S Tag Data
V
V
S
S
Tag
Tag
Data
Data
Tags Data
V S Tag Data
V S Tag Data
Store Commit Path
Load Data
• On store execute:
– mark entry valid and speculative, and save data and tag of instruction.
• On store commit:
– clear speculative bit and eventually move data to cache
• On store abort:
– clear valid bit
162
Speculative Store Buffer
Speculative Load Address
L1 Data
Store
Cache
Buffer
V S Tag Data
V S Tag Data
V
V
S
S
Tag
Tag
Data
Data
Tags Data
V S Tag Data
V S Tag Data
Store Commit Path
Load Data
• If data in both store buffer and cache, which should we use:

Speculative store buffer
• If same address in store buffer twice, which should we use:
Youngest store older than load
163
Datapath: Branch Prediction
and Speculative Execution
Update predictors
kill Branch
Branch
Prediction kill Resolution
kill
kill
Decode &
PC Fetch Reorder Buffer Commit
Rename
Reg. File
Branch Store
ALU MEM D$
Unit Buffer
Execute
164
Dynamic Branch Prediction Summary
• Prediction becoming important part of scalar execution

• Branch History Table: 2 bits for loop accuracy
• Correlation: Recently executed branches correlated with
next branch.
– Either different branches
– Or different executions of same branches
• Tournament Predictor: more resources to competitive
solutions and pick between them
• Branch Target Buffer: include branch address &
prediction
• Predicated Execution can reduce number of branches,
number of mispredicted branches
• Return address stack for prediction of indirect jump
167
Getting CPI < 1:
Issuing Multiple Instructions/Cycle
• Vector Processing: Explicit coding of independent loops as
operations on large vectors of numbers
– Multimedia instructions being added to many processors
• Superscalar: varying no. instructions/cycle (1 to 8),
scheduled by compiler or by HW (Tomasulo)
– IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium III/4
• (Very) Long Instruction Words (V)LIW:
fixed number of instructions (4-16) scheduled by the
compiler; put ops into wide templates (TBD)
– Intel Architecture-64 (IA-64) 64-bit address
» Renamed: “Explicitly Parallel Instruction Computer (EPIC)”
– Will discuss in a few lectures
• Anticipated success of multiple instructions lead to
Instructions Per Clock cycle (IPC) vs. CPI
168
Getting CPI < 1: Issuing
Multiple Instructions/Cycle
• Superscalar MIPS: 2 instructions, 1 FP & 1 integer
– Fetch 64-bits/clock cycle; Int on left, FP on right
– Can only issue 2nd instruction if 1st instruction issues
– More ports for FP registers to do FP load & FP op in a pair
Type Pipe Stages
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
• 1 cycle load delay expands to 3 instructions in SS
– instruction in right half can’t use it, nor instructions in next slot
169
Multiple Issue Issues
• issue packet: group of instructions from fetch unit that

could potentially issue in 1 clock
– If instruction causes structural hazard or a data hazard either due to
earlier instruction in execution or to earlier instruction in issue packet,
then instruction does not issue
– 0 to N instruction issues per clock cycle, for N-issue
• Performing issue checks in 1 cycle could limit clock
cycle time: O(n2) comparisons
– => issue stage usually split and pipelined
– 1st stage decides how many instructions from within this packet can
issue, 2nd stage examines hazards among selected instructions and
those already been issued
– => higher branch penalties => prediction accuracy important
170
Multiple Issue Challenges
• While Integer/FP split is simple for the HW, get CPI of 0.5 only
for programs with:
– Exactly 50% FP operations AND No hazards
• If more instructions issue at same time, greater difficulty of
decode and issue:
– Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2
instructions can issue; (N-issue ~O(N2) comparisons)
– Register file: x-way issue: need 2x reads and 1x writes/cycle
– Rename logic: must be able to rename same register multiple times in one
cycle! For instance, consider 4-way issue:
add r1, r2, r3 add p11, p4, p7
sub r4, r1, r2  sub p22, p11, p4
lw r1, 4(r4) lw p23, 4(p22)
add r5, r1, r2 add p12, p23, p4
Imagine doing this transformation in a single cycle!
– Result buses: Need to complete multiple instructions/cycle
» So, need multiple buses with associated matching logic at every
reservation station.
» Or, need multiple forwarding paths
171
Dynamic Scheduling in Superscalar
The easy way
• How to issue two instructions and keep in-order
instruction issue for Tomasulo?
– Assume 1 integer + 1 floating point
– 1 Tomasulo control for integer, 1 for floating point
• Issue 2X Clock Rate, so that issue remains in order
• Only loads/stores might cause dependency between
integer and FP issue:
– Replace load reservation station with a load queue;
operands must be read in the order they are fetched
– Load checks addresses in Store Queue to avoid RAW violation
– Store checks addresses in Load Queue to avoid WAR,WAW
172
How much to speculate?
• Speculation Pro: execute instructions beyond events that

would otherwise stall the pipeline (control hazards, cache
misses)
• Speculation Con: speculate costly if exceptional event occurs
when speculation was incorrect
• Typical solution: speculation allows only low-cost exceptional
events (1st-level cache miss)
• When expensive exceptional event occurs, (2nd-level cache
miss or TLB miss) processor waits until the instruction causing
event is no longer speculative before handling the event
• Easier with single branch per cycle
– newer processors speculate across multiple branches &
– multiple branches in instruction issue
174
Limits to ILP
• Conflicting studies of amount

– Benchmarks (vectorized Fortran FP vs. integer C programs)
– Hardware sophistication
– Compiler sophistication
• How much ILP is available using existing mechanisms
with increasing HW budgets?
• Do we need to invent new HW/SW mechanisms to keep
on processor performance curve?
– Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints
– Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock
– Motorola AltaVec: 128 bit ints and FPs
– Supersparc Multimedia ops, etc.
175
Limits to ILP
Initial HW Model here; MIPS compilers.
Assumptions for ideal/perfect machine to start:
1. Register renaming – infinite virtual registers
=> all register WAW & WAR hazards are avoided
2. Branch prediction – perfect; no mispredictions
3. Jump prediction – all jumps perfectly predicted
2 & 3 => machine with perfect speculation & an
unbounded buffer of instructions available
4. Memory-address alias analysis – addresses are known
& a store can be moved before a load provided
addresses not equal
Also:
unlimited number of instructions issued/clock cycle;
perfect caches;
1 cycle latency for all instructions (FP *,/);
176
Upper Limit to ILP: Ideal Machine
(H&P-3ed Figure 3.35, page 242)
160 150.1
FP: 75 - 150
140
Instruction Issues per cycle
120 Integer: 18 - 60 118.7
100
75.2
IPC
80
62.6
54.8
60
40
17.9
20
0
gcc espresso li fpppp doducd tomcatv
Programs
177
More Realistic HW: Branch Impact
Change from Infinite FP: 15 - 45

window to examine to
2000 and maximum
issue of 64 instructions
per clock cycle
Integer: 6 - 12
IPC
Perfect Tournament BHT (512) Profile 178prediction

No
More Realistic HW:
Renaming Register Impact
Vary window size, 64 FP: 11 - 45
instr issue, 8K 2 level
Prediction
Integer: 5 - 15
IPC
Infinite 256 128 64 32 None179

More Realistic HW:
Memory Address Alias Impact
49 49
50
45 45
45 Change 2000 instr window,
64 instr issue, 8K 2 level
40
Prediction, 256 renaming
FP: 4 - 45
(Fortran,
Instruction issues per cycle
35 registers
30 no heap)
25
20 Integer: 4 - 9 16 16
IPC
15
15
12
10
10 9
7 7
5 5 6
4 4 4 5
3 3 3 4 4
5
gcc espresso li fpppp doducd tomcatv
Program
Perfect Global/stack Perfect Inspection None
Perfect Global/Stack perf; Inspec. None

heap conflicts Assem. 180
Realistic HW: Window Impact
60
Perfect disambiguation (HW), 56
1K Tournament Prediction, 16 52
50 entry return, 64 registers, issue 47
as many as window FP: 8 - 45 45
Instruction issues per cycle
40
35
34
30
22 22
IPC
20
Integer: 6 - 12
15 15
14
17 16
15 14
13
12 12 11 11 12
10 10 10 10
9 8 9 8 9 9
10 8
6 6 6 7
5 6
4 4 4 4
3 2 3 3 3 3
gcc expresso li fpppp doducd tomcatv
Program
Infinite 256 128 64 32 16 8 4
Infinite 256 128 64 32 16 8 4

181
How to Exceed ILP Limits of this study?
• WAR and WAW hazards through memory: eliminated

WAW and WAR hazards through register renaming,
but not in memory usage
• Unnecessary dependences (compiler not unrolling
loops so iteration variable dependence)
• Overcoming the data flow limit: value prediction,
predicting values and speculating on prediction
– Address value prediction and speculation predicts addresses and
speculates by reordering loads and stores; could provide better
aliasing analysis, only need predict if addresses =
182
Moore's Law & Processor Speed Over
Time
185
Timeline
• 1985-2000: 1000X performance

– Moore’s Law transistors/chip => Moore’s Law for Performance/MPU
• Hennessy: industry been following a roadmap of ideas
known in 1985 to exploit Instruction Level Parallelism and
(real) Moore’s Law to get 1.55X/year
– Caches, Pipelining, Superscalar, Branch Prediction, Out-of-order execution,
…
• ILP limits: To make performance progress in future need to
have explicit parallelism from programmer vs. implicit
parallelism of ILP exploited by compiler, HW?
– Otherwise drop to old rate of 1.35X per year?
– Less than 1.35X because of processor-memory performance gap?
• Impact on you: if you care about performance,
better think about explicitly parallel algorithms
vs. rely on ILP?
186
Example: IA-32 P6 Microarchitecture
Intel Pentium Pro, II, III
• CISC instruction set (the distinction gets blurry)
• IA-32 instructions translated to internal native micro-
operations (uops)
• out-of-order with register renaming
– 20 reservation stations, 40 entry reorder buffer (early P6)
– 128 virtual registers (Pentium 4)
• branch prediction
• example pipeline (early P6)
– 8 fetch, decode, dispatch
– 1-32 cycles execute, out-of-order
– 3 cycles instruction commit
187
P6 Pipeline: PentiumPro, M, Core Dual
• Note
– translation (renaming) CISC instructions to RISC uops
– out of order execution
– in order graduation for precise interrupts and
correct speculative execution
188
Pentium 4 Integer ALU
• Operates 2x clock rate
(e.g. 3.3GHz processor -> 6.6GHz integer ALU)
• Serialize data-dependent instructions, yet issue and
graduate together
DADD r1,r1,r2
DADD r1,r1,r3 #issued with instruction above
DADD r1,r1,r4
DADD r1,r1,r5 #second cycle, issued with above
189
Instructions to micro-ops
190
Functional Unit Utilization Low
• 5 FUs
• potential for 3
uops to
complete
• see zero
complete ~half
of cycles
191
Endosymbiotic theory
• Eukaryotic cells engulf / are invaded by bacteria
3.2 billion years ago, become organelles (mitochondria,
chloroplasts), forming a symbiotic union
– mitochondria are the powerhouses in our cells
• RISC processors outperform CISC processors

• CISC processors engulf / are invaded by RISC
processors, forming a symbiotic union
– RISC processors are the powerhouses in CISC processors of this
decade
• On-the-fly instruction translation
– Invented by DEC
• No one listens to RISC vs. CISC theologians anymore
192
Head to Head ILP competition
Processor Micro architecture Fetch / Func- Clock Transis- Power
Issue / tional Rate tors,
Execute Units (GHz) Die size
Intel Speculative 3/3/4 7 int. 3.8 125 M, 115

Pentium 4 dynamically 1 FP 122 W
Extreme scheduled; deeply mm2
pipelined; SMT
AMD Speculative 3/3/4 6 int. 2.8 114 M, 104
Athlon 64 dynamically 3 FP 115 W
FX-57 scheduled mm2
IBM Speculative 8/4/8 6 int. 1.9 200 M, 80W
Power5 dynamically 2 FP 300 (est.)
(1 CPU scheduled; SMT; mm2
only) 2 CPU cores/chip (est.)
Intel Statically scheduled 6/5/11 9 int. 1.6 592 M, 130
Itanium 2 VLIW-style 2 FP 423 W
mm2
193
Performance on SPECint2000
194
Performance on SPECfp2000
195
Normalized Performance: Efficiency
I P
t e
a n A P
n t t o
i I h w
u u l e
m m o r
Rank 2 4 n 5
Int/Trans 4 2 1 3
FP/Trans 4 2 1 3
Int/area 4 2 1 3
FP/area 4 2 1 3
Int/Watt 4 3 1 2
FP/Watt 2 4 3 1
196
No Silver Bullet for ILP
• No obvious over all leader in performance
• The AMD Athlon leads on SPECInt performance
followed by the Pentium 4, Itanium 2, and Power5
• Itanium 2 and Power5, which perform similarly on
SPECFP, clearly dominate the Athlon and Pentium 4 on
SPECFP
• Itanium 2 is the most inefficient processor both for Fl.
Pt. and integer code for all but one efficiency measure
(SPECFP/Watt)
• Athlon and Pentium 4 both make good use of transistors
and area in terms of efficiency,
• IBM Power5 is the most effective user of energy on
SPECFP and essentially tied on SPECINT
197
Limits to ILP
• Doubling issue rates above today’s 3-6 instructions per
clock, say to 6 to 12 instructions, probably requires a
processor to
– Issue 3 or 4 data memory accesses per cycle,
– Resolve 2 or 3 branches per cycle,
– Rename and access more than 20 registers per cycle, and
– Fetch 12 to 24 instructions per cycle.
• Complexities of implementing these capabilities likely
means sacrifices in maximum clock rate
– E.g, widest issue processor is the Itanium 2, but it also has the slowest
clock rate, despite the fact that it consumes the most power!
198
Limits to ILP
• Most techniques for increasing performance increase power
consumption
• The key question is whether a technique is energy efficient: does
it increase power consumption faster than it increases
performance?
• Multiple issue processors techniques all are energy inefficient:
1. Issuing multiple instructions incurs some overhead in logic that grows
faster than the issue rate grows
2. Growing gap between peak issue rates and sustained performance
• Number of transistors switching = f(peak issue rate), and
performance = f( sustained rate),
growing gap between peak and sustained performance
 increasing energy per unit of performance
199
Recall from Pipelining
• Pipeline CPI = Ideal pipeline CPI + Structural Stalls

+ Data Hazard Stalls + Control Stalls
– Ideal pipeline CPI: measure of the maximum performance
attainable by the implementation
– Structural hazards: HW cannot support this combination of
instructions
– Data hazards: Instruction depends on result of prior instruction still
in the pipeline
– Control hazards: Caused by delay between the fetching of
instructions and decisions about changes in control flow (branches
and jumps)
200
Ideas to Reduce Stalls
Technique Reduces
Dynamic scheduling Data hazard stalls
Dynamic branch Control stalls
prediction
Issuing multiple Ideal CPI
instructions per cycle
Dynamic
Speculation Data and control stalls
Dynamic memory Data hazard stalls involving
disambiguation memory
Loop unrolling Control hazard stalls
Basic compiler pipeline Data hazard stalls
scheduling
Static/ Compiler dependence Ideal CPI and data hazard stalls
Compiler analysis
Software pipelining and Ideal CPI and data hazard stalls
trace scheduling
Compiler speculation Ideal CPI, data and control stalls
201
Review Data Dependence and Hazards
Data Dependence Potential Hardware Hazard
True Dependence RAW read after write

b=a; c=b;
Anti-Dependence WAR write after read
b=a; a=c;
Output Dependence WAW write after write
b=a; b=c;
202
Static Branch Prediction Performance
203
Recall: Branch Impact
Change from Infinite FP: 15 - 45

window to examine to
2000 and maximum
issue of 64 instructions
per clock cycle
Integer: 6 - 12
IPC
Perfect Tournament BHT (512) Profile 204prediction

No
Software Techniques - Example
• This code, add a scalar to a vector:

for (i=1000; i>0; i=i–1)
x[i] = x[i] + s;
• Assume following latencies for all examples
Instruction Instruction Latency stalls between

producing result using result in cycles in cycles
FP ALU op Another FP ALU op 4 3
FP ALU op Store double 3 2
Load double FP ALU op 1 1
Load double Store double 1 0
Integer op Integer op 1 0
205
FP Loop: Where are the Hazards?
• First translate into MIPS code:
-To simplify, assume 8 is lowest address
Loop: L.D F0,0(R1) ;F0=vector element

ADD.D F4,F0,F2 ;add scalar from F2
S.D F4,0(R1) ;store result
DSUBUI R1,R1,8 ;decrement pointer 8B (DW)
BNEZR1,Loop ;branch R1!=zero
NOP ;delayed branch slot
206
FP Loop Showing Stalls
1 Loop: L.D F0,0(R1) ;F0=vector element
2 stall
3 ADD.D F4,F0,F2 ;add scalar in F2
4 stall
5 stall
6 S.D F4,0(R1) ;store result
7 DSUBUI R1,R1,8 ;decrement pointer 8B (DW)
8 BNEZ R1,Loop ;branch R1!=zero
9 NOP ;delayed branch slot
Instruction Instruction Latency in
producing result using result clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
• 9 clock cycles: Rewrite code to minimize stalls?

207
Revised FP Loop Minimizing Stalls
1 Loop: L.D F0,0(R1)
2 stall
3 ADD.D F4,F0,F2
4 DSUBUI R1,R1,8
5 BNEZ R1,Loop ;delayed branch
6 S.D F4,8(R1) ;altered when move past DSUBUI
Swap BNEZ and S.D by changing address of S.D

Instruction Instruction Latency in
producing result using result clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
6 clock cycles, but just 3 for execution, 3 for loop overhead; How make faster?
208
Unroll Loop Four Times
(straightforward way)
1 cycle stall
1 Loop:L.D F0,0(R1) Rewrite loop to
3 ADD.D F4,F0,F2 2 cycles stall
6 S.D F4,0(R1)
minimize
;drop DSUBUI & BNEZ
stalls?
7 L.D F6,-8(R1)
9 ADD.D F8,F6,F2
12 S.D F8,-8(R1) ;drop DSUBUI & BNEZ
13 L.D F10,-16(R1)
15 ADD.D F12,F10,F2
18 S.D F12,-16(R1) ;drop DSUBUI & BNEZ
19 L.D F14,-24(R1)
21 ADD.D F16,F14,F2
24 S.D F16,-24(R1)
25 DSUBUI R1,R1,#32 ;alter to 4*8
26 BNEZ R1,LOOP
27 NOP
27 clock cycles, or 6.8 per iteration

Assumes R1 is multiple of 4
209
Unrolled Loop Detail
• Do not usually know upper bound of loop
• Suppose it is n, and we would like to unroll the loop to
make k copies of the body
• Instead of a single unrolled loop, we generate a pair of
consecutive loops:
– 1st executes (n mod k) times and has a body that is the original loop
– 2nd is the unrolled body surrounded by an outer loop that iterates (n/k)
times
– For large values of n, most of the execution time will be spent in the
unrolled loop
210
Unrolled Loop That Minimizes Stalls
1 Loop:L.D F0,0(R1)
2 L.D F6,-8(R1) • What assumptions made
3 L.D F10,-16(R1) when moved code?
4 L.D F14,-24(R1) – OK to move store past
5 ADD.D F4,F0,F2 DSUBUI even though the
6 ADD.D F8,F6,F2 store changes register
7 ADD.D F12,F10,F2 – OK to move loads before
8 ADD.D F16,F14,F2 stores: get right data?
9 S.D F4,0(R1) – When is it safe for compiler
10 S.D F8,-8(R1) to do such changes?
11 S.D F12,-16(R1)
12 DSUBUI R1,R1,#32
13 BNEZ R1,LOOP
14 S.D 8(R1),F16 ; 8-32 = -24
14 clock cycles, or 3.5 per iteration
211
Compiler Perspectives on Code Movement
• Compiler concerned about dependencies in program
• Existence of a Hardware hazard depends on pipeline
• Try to schedule to avoid hazards that cause performance losses
• (True) Data dependencies (RAW)
– Instruction i produces a result used by instruction j, or
– Instruction j is data dependent on instruction k, and instruction k is data dependent
on instruction i.
• If dependent, can’t execute in parallel
• Easy to determine for registers (fixed names)
• Hard for memory (“memory disambiguation” problem):
– Does 100(R4) = 20(R6)?
– From different loop iterations, does 20(R6) = 20(R6)?
212
Where are the name dependencies?
1 Loop:L.D F0,0(R1)
3 ADD.D F4,F0,F2
6 S.D 0(R1),F4
7 L.D F0,-8(R1)
9 ADD.D F4,F0,F2
12 S.D -8(R1),F4
13 L.D F0,-16(R1)
15 ADD.D F4,F0,F2
18 S.D -16(R1),F4
19 L.D F0,-24(R1)
21 ADD.D F4,F0,F2
24 S.D -24(R1),F4
25 DSUBUI R1,R1,#32
26 BNEZ R1,LOOP
27 NOP
213
1 Loop:L.D F0,0(R1)
3 ADD.D F4,F0,F2
6 S.D 0(R1),F4
7 L.D F0,-8(R1)
9 ADD.D F4,F0,F2
12 S.D -8(R1),F4
13 L.D F0,-16(R1)
15 ADD.D F4,F0,F2
18 S.D -16(R1),F4
19 L.D F0,-24(R1)
21 ADD.D F4,F0,F2
24 S.D -24(R1),F4
25 DSUBUI R1,R1,#32
26 BNEZ R1,LOOP
27 NOP
How can remove them?

214
1 Loop:L.D F0,0(R1)
3 ADD.D F4,F0,F2
6 S.D 0(R1),F4
7 L.D F6,-8(R1)
9 ADD.D F8,F6,F2
12 S.D -8(R1),F8
13 L.D F10,-16(R1)
15 ADD.D F12,F10,F2
18 S.D -16(R1),F12
19 L.D F14,-24(R1)
21 ADD.D F16,F14,F2
24 S.D -24(R1),F16
25 DSUBUI R1,R1,#32
26 BNEZ R1,LOOP
27 NOP
The Original“register renaming”

215
What if We Could Change the Instruction
Set?
• Superscalar processors decide on the fly how many
instructions to issue
– HW complexity of Number of instructions to issue O(n2)
• Why not allow compiler to schedule instruction level
parallelism explicitly?
• Format the instructions in a potential issue packet so that
HW need not check explicitly for dependences
216
VLIW: Very Large Instruction Word
• Each “instruction” has explicit coding for multiple
operations
– In IA-64, grouping called a “bundle”
– In Transmeta, grouping called a “molecule” (with “atoms” as ops)
• Tradeoff instruction space for simple decoding
– The long instruction word has room for many operations
– By definition, all the operations the compiler puts in the long instruction
word are independent => execute in parallel
– E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
» 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide
– Need compiling technique that schedules across several branches
217
Classic VLIW
• Compiler responsible for instruction scheduling, w.r.t.
instruction latencies
• Insert NOPs where there will be stalls
large code size
• One operation per functional unit
• Any pipeline tweak requires new compilation
• Unroll loops to get ILP
218
Example of a VLIW Architecture: IA-64.
Suggested Reading for Interest
Intel IA-64 Architecture Software

Developer’s Manual, Chapters 8, 9
219
IA-64 Instruction Group
An instruction group is a set of instructions that

have no read after write (RAW) or write after write (WAW)
register dependencies.
Consecutive instruction groups are separated by stops
(represented by a double semi-column in the assembly code).
ld8 r1=[r5] // First group

sub r6=r8, r9 // First group
add r3=r2,r4 ;; // First group
st8 [r6]=r12 // Second group
220
IA64 Instructions & Registers
• 128 registers – 7 bit register address
– 128 65 bit integer
– 128 82 bit FP
• 3 operand instructions
• 14 bit opcode
• 6 bit predicate
• 3*7+14+6 = 41 bit instructions
221
Instruction Bundles
Instructions are organized in bundles of three instructions,

with the following format:
127 87 86 46 45 54 0
instruction slot 2 instruction slot 1 instruction slot 0 template
41 41 41 5
Instruction Description Execution Unit

Type
A Integer ALU I-unit or M-unit
I Non-ALU I-unit
integer
M Memory M-unit
F Floating-Point F-unit
B Branch B-unit
L+X Extended I-unit/B-unit
222
Bundles
In assembly, each 128-bit bundle is enclosed in

curly braces and contains a template specification
{ .mii
ld4 r28=[r8] // Load a 4-byte value
add r9=2,r1 // 2+r1 and put in r9
add r30=1,r1 // 1+r1 and put in r30
}
An instruction group can extend over an arbitrary

number of bundles.
223
Templates
There are restrictions on the type of instructions that

can be bundled together. The IA-64 has five slot types
(M, I, F, B, and L), six instruction types (M, I, A, F, B, L),
and twelve basic template types (MII, MI_I, MLX, MMI,
M_MI, MFI, MMF, MIB, MBB, BBB, MMB, and MFB).
The underscore in the bundle acronym indicates

a stop.
Every basic bundle type has two versions: one

with a stop at the end of the bundle and one
without.
224
Control Dependency Preventing Code
Motion
In the code below the ld4 is control dependent on the

branch, and thus cannot be safely moved up in
conventional processor architectures.
add r7=r6,1 // cycle 0 block A

add r13=r25, r27 br
cmp.eq p1, p2=r12, r23 block B
(p1) br. cond some_label ;; ld
ld4 r2=[r3] ;; // cycle 1

sub r4=r2, r11 // cycle 3
225
Control Speculation
In the following code, suppose a load latency of two cycles
(p1) br.cond.dptk L1 // cycle 0

ld8 r3=[r5] ;; // cycle 1
shr r7=r3,r87 // cycle 3
However, if we execute the load before we know if

we actually have to do it (control speculation), we get:
ld8.s r3=[r5] // earlier cycle
// other, unrelated instructions
(p1) br.cond.dptk L1 ;; // cycle 0
chk.s r3, recovery // cycle 1
226
Control Speculation
The ld8.s instruction is a speculative load, and the

chk.s instruction is a check instruction that verifies
if the value loaded is still good.
ld8.s r3=[r5] // earlier cycle

(p1) br.cond.dptk L1 ;; // cycle 0
chk.s r3, recovery // cycle 1
227
Ambiguous Memory Dependencies
An ambiguous memory dependency is a dependence

between a load and a store, or between two stores
where it cannot be determined if the instructions
involved access overlapping memory locations.
Two or more memory references are independent

if it is known that they access non-overlapping
memory locations.
228
Data Speculation
An advanced load allows a load to be moved

above a store even if it is not known whether
the load and the store may reference overlapping
memory locations.
st8 [r55]=r45 // cycle 0
ld8 r3=[r5] ;; // cycle 0
ld8.a r3=[r5] ;; // Advanced Load

st8 [r55]=r45 // cycle 0
ld8.c r3=[r5] ;; // cycle 0 - check
229
Moving Up Loads + Uses: Recovery Code
st8 [r4] = r12 // cycle 0: ambiguous store

ld8 r6 = [r8] ;; // cycle 0: load to advance
Original Code add r5 = r6,r7 // cycle 2
st8 [r18] = r5 // cycle 3
ld8.a r6 = [r8] ;; // cycle -3

Speculative // other, unrelated instructions
Code add r5 = r6,r7 // cycle -1; add that uses r6
st8 [r4]=r12 // cycle 0
chk.a r6, recover // cycle 0: check
back: // Return point from jump to recover
st8 [r18] = r5 // cycle 0
recover:
ld8 r6 = [r8] ;; // Reload r6 from [r8]
add r5 = r6,r7 // Re-execute the add
br back // Jump back to main code
230
ld.c, chk.a and the ALAT
The execution of an advanced load, ld.a, creates an

entry in a hardware structure, the Advanced Load
Address Table (ALAT). This table is indexed by the
register number. Each entry records the load
address, the load type, and the size of the load.
When a check is executed, the entry for the register

is checked to verify that a valid entry with the type
specified is there.
231
ld.c, chk.a and the ALAT
Entries are removed from the ALAT when:
(1) A store overlaps with the memory locations

specified in the ALAT entry;
(2) Another advanced load to the same register
is executed;
(3) There is a context switch caused by the
operating system (or hardware);
(4) Capacity limitation of the ALAT implementation
requires reuse of the entry.
232
Not a Thing (NaT)
The IA-64 has 128 general purpose registers, each

with 64+1 bits, and 128 floating point registers, each
with 82 bits.
The extra bit in the GPRs is the NaT bit that is used to
indicate that the content of the register is not valid.
NaT=1 indicates that an instruction that generated an
exception wrote to the register. It is a way to defer
exceptions caused by speculative loads.
Any operation that uses NaT as an operand
results in NaT.
233
If-conversion
If-conversion uses predicates to transform a

conditional code into a single control stream code.
if(r4) {
cmp.ne p1, p0=r4, 0 ;; Set predicate reg
add r1= r2, r3
(p1) add r1=r2, r3
ld8 r6=[r5]
(p1) ld8 r6=[r5]
}
if(r1)
cmp.ne p1, p2 = r1, 0 ;; Set predicate reg
r2 = r3 + r4
(p1) add r2 = r3, r4
else
(p2) sub r7 = r6, r5
r7 = r6 - r5
234
Trace Scheduling
• Two steps:
– Trace Selection
» Find likely sequence of basic blocks (trace)
of (statically predicted or profile predicted)
long sequence of straight-line code
– Trace Compaction
» Squeeze trace into few VLIW instructions
» Need bookkeeping code in case prediction is wrong
• This is a form of compiler-generated speculation
– Compiler must generate recovery code to handle cases in which execution does
not go according to speculation.
– Needs extra registers: undo bad guesses by discarding unused results
• Subtle compiler bugs may result in wrong answer:
no hardware speculation
235
Superscalar v. VLIW
• Smaller code size
• Binary compatibility • Simplified Hardware for
across generations of decoding, issuing
hardware instructions
• No Interlock Hardware
(compiler checks?)
• More registers, but
simplified Hardware for
Register Ports (multiple
independent register
files?)
236
Problems with First Generation VLIW
• Increase in code size

– generating enough operations in a straight-line code fragment
requires ambitiously unrolling loops
– whenever VLIW instructions are not full, unused functional units
translate to wasted bits in instruction encoding
• Operated in lock-step; no hazard detection HW
– a stall in any functional unit pipeline caused entire processor to stall,
since all functional units must be kept synchronized
– Compiler might prediction function units, but caches hard to predict
• Binary code compatibility
– Pure VLIW => different numbers of functional units and unit
latencies require different versions of the code
237
Intel/HP IA-64 “Explicitly Parallel
Instruction Computer (EPIC)”
• IA-64: instruction set architecture; EPIC is type
– EPIC = 2nd generation VLIW?
• Itanium™ is name of first implementation (2001)
– Highly parallel and deeply pipelined hardware at 800Mhz
– 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process
• 128 64-bit integer registers + 128 82-bit floating point registers
– Not separate register files per functional unit as in old VLIW
• Hardware checks dependencies
(interlocks => binary compatibility over time)
• Predicated execution (select 1 out of 64 1-bit flags)
=> 40% fewer mispredictions?
238
3rd Generation Itanium
ISSCC abstract: 14.4 A 1.5GHz Third Generation Itanium® Processor.
J. Stinson, S. Rusu (Intel, Santa Clara, CA)
A third-generation 1.5GHz Itanium® processor implements the Explicitly Parallel
Instruction Computing (EPIC) architecture and features an on-die 6MB, 24-way
set associative L3 cache. The 374mm2 die contains 410M transistors and is
implemented in a dual-VT 0.13m technology saving 6-level Cu interconnects
with FSG dielectric and dissipates 130W.
• 1.5 GHz
• 410 million transistors
• 6MB 24-way set associative L3 cache
• 6-level copper interconnect, 0.13 micron
• 130W (i.e. lasts 17s on an AA NiCd)
239
Comments on Itanium
• Remarkably, the Itanium has many of the features
more commonly associated with the dynamically-
scheduled pipelines
– strong emphasis on branch prediction, register renaming,
scoreboarding, a deep pipeline with many stages before
execution (to handle instruction alignment, renaming, etc.), and
several stages following execution to handle exception detection
• Surprising that an approach whose goal is to rely on
compiler technology and simpler HW seems to be at
least as complex as dynamically scheduled
processors!
240
Performance of IA-64 Itanium
(Source: Microprocessor Report Jan 2002)
• ITANIUM (800 MHz):
• SPECint2000(base): 358
• SPECfp2000(base): 703
• POWER4 (1.3 GHz):
• SPECfp2000(base): 1,098
• SUN UltraSPARC III (1.05 GHz)
• SPECfp2000(base): 701
241
Summary#1: Hardware versus Software
Speculation Mechanisms
• To speculate extensively, must be able to disambiguate

memory references
– Much easier in HW than in SW for code with pointers
• HW-based speculation works better when control flow
is unpredictable, and when HW-based branch
prediction is superior to SW-based branch prediction
done at compile time
– Mispredictions mean wasted speculation
• HW-based speculation maintains precise exception
model even for speculated instructions
• HW-based speculation does not require compensation
or bookkeeping code
242
Summary#2: Hardware versus Software
Speculation Mechanisms cont’d
• Compiler-based approaches may benefit from the
ability to see further in the code sequence, resulting in
better code scheduling
• HW-based speculation with dynamic scheduling does
not require different code sequences to achieve good
performance for different implementations of an
architecture
– may be the most important in the long run?
243
Summary #3: Software Scheduling
• Instruction Level Parallelism (ILP) found either by compiler or

hardware.
• Loop level parallelism is easiest to see
– SW dependencies/compiler sophistication determine if compiler can unroll
loops
– Memory dependencies hardest to determine => Memory disambiguation
– Very sophisticated transformations available
• Trace Scheduling to Parallelize If statements
• Superscalar and VLIW: CPI < 1 (IPC > 1)
– Dynamic issue vs. Static issue
– More instructions issue at same time => larger hazard penalty
– Limitation is often number of instructions that you can successfully fetch and
decode per cycle
244
Recall Forms of Parallelism
• Pipelining (Appendix A)
• ILP – Dynamically Scheduled
Involving compiler
• ILP – Statically Scheduled
Involving application programmer or user

• Throughput parallelism
– multiple independent processes
• Thread-level parallelism
– usually specified by programmer, sometimes smart compilers
245
ILP
• Limits to ILP (power efficiency, compilers, dependencies
…) seem to limit to 3 to 6 issue for practical options
• Explicitly parallel (Data level parallelism or Thread level
parallelism) is next step to performance
• Coarse grain vs. Fine grained multihreading
– Only on big stall vs. every clock cycle
• Simultaneous Multithreading if fine grained multithreading
based on OOO superscalar microarchitecture
– Instead of replicating registers, reuse rename registers
• Balance of ILP and TLP decided in marketplace
246
Commentary
• Itanium architecture does not represent a significant
breakthrough in scaling ILP or in avoiding the problems of
complexity and power consumption
• Instead of pursuing more ILP, architects are increasingly
focusing on TLP implemented with single-chip multiprocessors
• In 2000, IBM announced the 1st commercial single-chip,
general-purpose multiprocessor, the Power4, which contains 2
Power3 processors and an integrated L2 cache
– Since then, Sun Microsystems, AMD, and Intel have switch to a focus on
single-chip multiprocessors rather than more aggressive uniprocessors.
• Right balance of ILP and TLP is unclear today
– Perhaps right choice for server market, which can exploit more TLP, may
differ from desktop, where single-thread performance may continue to be a
primary requirement
247
And in conclusion …
• Limits to ILP (power efficiency, compilers, dependencies
…) seem to limit to 3 to 6 issue for practical options
• Explicitly parallel (Data level parallelism or Thread level
parallelism) is next step to performance
• Coarse grain vs. Fine grained multihreading
– Only on big stall vs. every clock cycle
• Simultaneous Multithreading if fine grained multithreading
based on OOO superscalar microarchitecture
– Instead of replicating registers, reuse rename registers
• Itanium/EPIC/VLIW is not a breakthrough in ILP
• Balance of ILP and TLP unclear in marketplace
248
Knowing When to Give Up
The more instruction level parallelism you use, the more
you pay for each incremental gain in performance
Si cost and power increases with square of parallelism
249
Beyond ILP
Thread Level Parallelism
• Diminishing returns for finding ILP in code that was
designed to be sequential
• Use functional units for more than one process/program –

what do we need?
– multiple program counters

– multiple processors, or virtual processors in the same CPU
– multiple register files or register renaming that is aware of multiple
threads & keeps hazards independent
250
TLP Architectures
• Terra Computer
– Unique instruction set
– Compute engine (supercomputer)
– Many threads ~1000
– No data cache, just get on with another thread when one thread stalls for a
memory access
• Intel Xeon, Pentium 4 Hyperthreading (HT)
– IA-32 instruction set
– Two threads, ~40% utilization increase
Xeon:
– Server applications
– Support for cache coherence to build parallel processors systems
– L1 uop trace cache 8KB, L2 512Kb, L3 2MB on chip, (max)
• Eleven Engineering XiNc (8 hardware thread comm processor)
251

CMPE 382 / ECE 510 Computer Organization & Architecture Chapter 2 - Instruction Level Parallelism

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CMPE 382 / ECE 510 Computer Organization & Architecture Chapter 2 - Instruction Level Parallelism

Uploaded by

Copyright:

Available Formats

CMPE 382 / ECE 510

Computer Organization & Architecture

Chapter 2 – Instruction Level Parallelism

Many lecture slides are courtesy of or based on the work of

• Pipeline CPI = Ideal pipeline CPI + Structural Stalls

•Which hazards are present? RAW? WAR? WAW?

True Dependence RAW read after write

• Every instruction is control dependent on some

• Control dependence need not always be preserved

• Data flow: actual flow of data values among

I3 MULTD f0, f2, f4

I4 DIVD f8, f6, f2

I5 SUBD f10, f0, f6

I6 ADDD f6, f8, f2

Pipelining becomes complex when we want high

• Delay writeback so all operations

Can we solve write

The following checks need to be made before the

• Is the required function unit available?

• Is the input data available?  RAW?

• Is it safe to write the destination? WAR? WAW?

• Is there a structural conflict at the WB stage?

Busy[FU#] : a bit-vector to indicate FU’s availability.

WP[reg#] : a bit-vector to record the registers for which

Issue checks the instruction (opcode dest src1 src2)

2 LD F4, 45(R3) long

4 SUBD F8, F2, F2 1

6 ADDD F10, F6, F4 1 6

(underline indicates cycle when instruction writes back)

• Issue stage buffer holds multiple instructions waiting

2 LD F4, 45(R3) long

4 SUBD F8, F2, F2 1

6 ADDD F10, F6, F4 1 6

Out-of-order execution did not allow any significant improvement!

Which features of a program limit the number of

Out-of-order dispatch by itself does not provide

Can a microarchitecture use more registers than

Robert Tomasulo of IBM suggested an ingenious

Throughput (T) = Number in Flight (N) / Latency (L)

 maximum of ½ issue per cycle!

2 LD F4, 45(R3) long

4 SUBD F8, F2, F2 1

6 ADDD F10, F6, F4’ 1 6

• Decode does register renaming and adds instructions to

• Any instruction in ROB whose RAW hazards have been

Common Data Bus (CDB)

Op: Operation to perform in the unit (e.g., + or –)

Register result status—Indicates which functional unit will

1. Issue—get instruction from FP Op Queue

Register result status:

Register result status:

Register result status:

Note: Can have multiple loads outstanding

Register result status:

• Note: registers names are removed (“renamed”) in

Register result status:

• Load2 completing; what is waiting for Load2?

Register result status:

• Timer starts counting down for Add1, Mult1

Register result status:

• Issue ADDD here despite name dependency on F6?

Register result status:

• Add1 (SUBD) completing; what is waiting for it?