Advanced Computer Architecture

Advanced Computer Architectures
Lecture 6: Pipeline Hazards and Their Resolution Mechanisms

Mr. Bhabani shankar Prasad Mishra. SCHOLE OF TECHNOLOGY
KIIT UNIVERSITY
BHUBANESWAR
1
Module Objectives

Hazards, their causes, and resolution

Branch prediction Exploiting loop-level parallelism Dynamic instruction scheduling:
Scoreboarding and Tomasulos algorithm
Compiler techniques for exposing ILP

Superscalar and VLIW processing
Survey of some modern processors

2
Introduction
What is ILP (Instruction-Level Parallelism)?
Parallel execution of different instructions belonging to the same thread.
A thread usually consists of several basic blocks:
As well as several branches and loops.

A sequence of instructions not having a branch instruction.
3
Basic block:
Introduction
cont
Instruction pipelines can effectively exploit parallelism in a basic block:

Pipelining can be viewed to:

An n-stage pipeline can improve performance up to n times. Does not require much investment in hardware Transparent to the programmers. Decrease average CPI, and/or Decrease clock cycle time for instructions.
4
Drags on Pipeline Performance
Factors that can degrade pipeline performance:

Unbalanced stages Pipeline overheads Clock skew Hazards
Hazards cause the worst drag on the performance of a pipeline.

5
Pipeline Hazards
What is a pipeline hazard?
A situation that prevents an instruction from executing during its designated clock cycles.
There are 3 classes of hazards:

Structural Hazards Data Hazards Control Hazards
Arise from resource conflicts among instructions executing concurrently:
Structural Hazards
Same resource is required by two (or more) concurrently executing instructions at the same time.
Easy way to avoid structural hazards:
Duplicate resources (sometimes not practical)

Memory interleaving ( lower & higher order ) An ALU to perform an arithmetic operation and an adder to increment PC. Separate data cache and instruction cache accessed simultaneously in the same cycle.
7
Examples of Resolution of Structural Hazard:

Structural Hazard: Example
IF
ID IF
EXE MEM ID IF EXE ID IF
WB MEM EXE ID WB MEM EXE WB MEM
WB
An Example of a Structural Hazard

Load
Mem Reg
ALU
DM
Reg
Instruction 1
Instruction 2 Instruction 3 Instruction 4
Mem
Reg
ALU
DM
Reg
Mem
Reg
ALU
DM
Reg
Mem
Reg
ALU
DM
Reg
Mem
Reg
ALU
DM
Reg
Time
Would there be a hazard here?
How is it Resolved?
Load
Mem Reg
ALU
DM
Reg
Instruction 1 Instruction 2 Stall Instruction 3
Mem
Reg
ALU
DM
Reg
Mem
Reg
ALU
DM
Reg
Bubble
Bubble
Bubble
Bubble
Bubble
Mem
Reg
ALU
DM
Reg
Time
A Pipeline can be stalled by inserting a bubble or NOP
10
Performance with Stalls
Stalls degrade performance of a pipeline:

Result
in deviation from 1 instruction executing/clock cycle. Lets examine by how much stalls can impact CPI
11
Stalls and Performance

CPI pipelined =
=Ideal CPI + Pipeline stall cycles per instruction =1 + Pipeline stall cycles per instruction
Ignoring overhead and assuming stages are balanced: CPI unpipelined Speedup 1 pipeline stall cycles per instruction
12
Speedup Due to Pipelining
1 Pipeline depth 1 Pipeline stall cycles per instruction
13
Alternate Speedup Expression

Clock cycle pipelined
Pipeline depth
Speedup from pipelining
Clock cycle unpipelined Pipeline depth
Clock cycle unpipelined Clock cycle pipelined
1 Clock cycle unpipelined 1 Pipeline stall cycles per instruction Clock cycle pipelined
14
An Example of Performance Impact of Structural Hazard
Assume:
Pipelined processor.
Data references constitute 40% of an instruction mix. Ideal CPI of the pipelined machine is 1.
Consider two cases:
Unified data and instruction cache vs. separate data and instruction cache.
What is the impact on performance?

15
An Example
Avg. Inst. Time = CPI x Clock Cycle Time

(ii) For Unified cache case:
= (1 + 0.4 x 1) x (Clock cycle timeideal) = 1.4 x Clock cycle timeideal=1.4
cont
(i) For Separate cache: Avg. Instr. Time=1*1=1
Speedup= 1/1.4 = 0.7
30% degradation in performance

16
Data Hazards
Occur when an instruction under execution depends on:
Data from an instruction ahead in pipeline.

A=B+C; D=A+E; IF ID IF EXE MEM ID EXE WB MEM
Example:
A=B+C;
D=A+E;
WB
Dependent instruction uses old data:
Results in wrong computations

17
Types of Data Hazards
Data hazards are of three types:

With an in-order execution machine:
Read After Write (RAW) Write After Read (WAR) Write After Write (WAW)
Assume instruction i is issued before j.

18
WAW, WAR hazards can not occur.
Read after Write (RAW) Hazards
Hazard between two instructions I & J may occur when j attempts to read some data object that has been modified by I.

instruction j tries to read its operand before instruction i writes it.
j would incorrectly receive an old or incorrect value. i: ADD R1, R2, R3 j: SUB R4, R1, R6 Example:
Instruction j is a read instruction issued after i
Instruction i is a write instruction issued before j

19
Read after Write (RAW) Hazards

D (I) Instn . I
Write
R (I) D (J)
Instn . J Read
RAW
R (J)
R (I) D (J) for RAW

20
Example program (a):

RAW Dependency: More Examples
i1: load r1, addr; i2: add r2, r1,r1; i1: mul r1, r4, r5; i2: add r2, r1, r1;
Program (b):

Both cases, i2 does not get operand until i1 has completed writing the result

In (a) this is due to load-use dependency In (b) this is due to define-use dependency
21
Write after Read (WAR) Hazards
Hazard may occur when j attempts to modify (write) some data object that is going to read by I.

Instruction J tries to write its operand at destination before instruction I read it.
I would incorrectly receive a new or incorrect value. i: ADD R1, R2, R3 j: SUB R2, R4, R6 Example:
Instruction i is a read instruction issued before j

22
Instruction j is a write instruction issued after i
Write after Read (WAR) Hazards

D (J) Instn . J
Write
R (J) D (I)
Instn . I Read
WAR
R (I)
D (I) R (J) for WAR

23
Write After Write (WAW) Hazards
WAW hazard:

Both I & J wants to modify a same data object. instruction j tries to write an operand before instruction i writes it. Writes are performed in wrong order.
Example:
i: DIV F1, F2, F3 j: SUB F1, F4, F6
(How can this happen???)
Instruction j is a write instruction issued after i
Instruction i is a write instruction issued before j

24
Write After Write (WAW) Hazards

D (I) Instn . I
Write
R (I) R (J)
Instn . J Write
WAW
D (J)
R (I) R (J) for WAW

25
Example program (a):

WAR and WAW Dependency: More Examples

i1: mul r1, r2, r3; i2: add r2, r4, r5; i1: mul r1, r2, r3;
Example program (b):
i2: add r1, r4, r5;

in (a) r2 must be read before it is written into in (b) r1 must be written by i2 after it has been written into by i1
26
Both cases have dependence between i1 and i2

Inter-Instruction Dependences
Data dependence
r3 r1 op r2 Read-after-Write r5 r3 op r4 (RAW) Anti-dependence r3 r1 op r2 Write-after-Read r1 r4 op r5 (WAR) Output dependence r3 r1 op r2 Write-after-Write r5 r3 op r4 (WAW) r3 r6 op r7
Control dependence
27
False Dependency
Data Dependencies : Summary

Data dependencies in straight-line code RAW Read After Write dependency
( Flow dependency )
WAR Write After Read dependency

( Anti dependency )
WAW Write After Write dependency

( Output dependency )
Load-Use dependency
Define-Use dependency
True dependency
Cannot be overcome
False dependency
Can be eliminated by register renaming
28
Solutions to Data Hazard

Operand forwarding
By S/W (NOP)
Reordering the instruction
29
Recollect Data Hazards

What causes them?

Pipelining changes the order of read/write accesses to operands. Order differs from that of an unpipelined machine.
ADD R1, R2, R3 SUB R4, R1, R5
Example:

For MIPS, ADD writes the register in WB but SUB needs it in ID.
This is a data hazard

30
Illustration of a Data Hazard

ADD R1, R2, R3
Mem Reg
ALU
DM
Reg
SUB R4, R1, R5

AND R6, R1, R7 OR R8, R1, R9 XOR R10, R1, R11
Time
Mem
Reg
ALU
DM
Reg
Mem
Reg
ALU
DM
Mem
Reg
Mem
Reg
ADD instruction causes a hazard in next 3 instructions because register not written until after those 3 read it.
31
ALU
Forwarding
Simplest solution to data hazard:
Result of the ADD instruction not really needed:
forwarding
until after ADD actually produces it.
Can we move the result from EX/MEM register to the beginning of ALU (where SUB needs it)?
Yes!
32
Forwarding
cont
Generally speaking:
Forwarding
occurs when a result is passed directly to the functional unit that requires it.
goes from output of one pipeline stage to input of another.
Result
33
Forwarding Technique
Latch EXECUTE ALU
Latch WRITE RESULT
Forwarding Path
34
When Can We Forward?

ALU
ADD R1, R2, R3
Mem
Reg
DM
Reg
SUB R4, R1, R5
Mem
Reg
ALU
DM
Reg
SUB gets info. from EX/MEM pipe register AND gets info. from MEM/WB pipe register OR gets info. by forwarding from register file
AND R6, R1, R7
Mem
Reg
ALU
DM
OR R8, R1, R9
Mem
Reg
XOR R10, R1, R11
Mem
Reg
Time
If line goes forward you can do forwarding. If its drawn backward, its physically impossible.
35
ALU
Handling data hazard by S/W
Compiler introduce NOP in between two instructions NOP = a piece of code which keeps a gap between two instruction
Detection of the dependency is left entirely on the S/W

Advantage :- We find the easy technique called as instruction reordering.
36
Instruction Reordering

ADD SUB
R1 , R2 , R3 R4 , R1 , R5
Before
XOR
AND ADD XOR AND SUB
R8 , R6 , R7
R9 , R10 , R11 R1 , R2 , R3 R8 , R6 , R7 R9 , R10 , R11 R4 , R1 , R5
37
After
Control Hazards
Result from branch and other instructions that change the flow of a program (i.e. change PC).
Example:
1: If(cond){
2: 3: s2 s1}
Statement in line 2 is control dependent on statement at line 1. Until condition evaluation completes:
It is not known whether s1 or s2 will execute next.

38
Can You Identify Control Dependencies?

1: if(cond1){ 2: 3: 4: s1; if(cond2){ s2;}
5: }
39
Solutions to Branch Hazards
Three simplest methods of dealing with branches:
Flush Pipeline:
Branch Not Taken:
Redo the instructions following a branch, once an instruction is detected to be branch during the ID stage.
Another scheme is delayed branch.
A slightly higher performance scheme is to assume every branch to be not taken.
40
An Example of Impact of Branch Penalty
Assume for a MIPS pipeline:

16%
of all instructions are branches:

4% unconditional branches: 3 cycle penalty 12% conditional: 50% taken: 3 cycle penalty
41
Impact of Branch Penalty
For a sequence of N instructions:
N cycles to initiate each

3 * 0.04 * N delays due to unconditional branches
0.5 * 3 * 0.12 * N delays due to conditional taken

1.3*N (or 1.3 cycles/instruction) 30% Performance Hit!!!
42
Overall CPI=

Reducing Branch Penalty
Two approaches:
1) Move condition comparator to ID stage:
Decide branch outcome and target address in the ID stage itself:
Reduces branch delay to 2 cycles.
2)Branch prediction
43
Four Simple Branch Hazard Solutions

#1: Stall
until branch direction is clear flushing pipe Execute successor instructions in sequence as if there is no branch undo instructions in pipeline if branch actually taken
#2: Predict Branch Not Taken
47% branches not taken on average

44
Four Simple Branch Hazard Solutions cont

#3: Predict Branch Taken
53% branches taken on average.

But branch target address not available after IF in MIPS
MIPS still incurs 1 cycle branch penalty even with predict taken Other machines: branch target known before branch outcome computed, significant benefits can accrue
45
Four Simple Branch Hazard Solutions cont

#4: Delayed Branch
Insert unrelated successor in the branch delay slot branch instruction sequential successor1 sequential successor2 ........ Branch delay of sequential successorn branch target if taken 1 slot delay required in 5 stage pipeline
length n
46
Delayed Branch
Simple idea: Put an instruction that would be executed anyway right after a branch.
Branch Delayed slot instruction Branch target OR successor IF ID IF EX MEM WB
ID EX MEM delay slotWB IF ID EX MEM WB
Question: What instruction do we put in the delay slot?

Answer: one that can safely be executed no matter what the branch does.
The compiler decides this.

47
Delayed Branch

One possibility: An instruction from before Example:

R1, R2, R3
DADD
DADD if
R1, R2, R3 then
if R2 == 0 then
R2 == 0
delay slot
. . .
DADD
R1, R2, R3
The DADD instruction is executed no matter what happens in the branch:

Because it is executed before the branch! Therefore, it can be moved

48
Delayed Branch
We get to execute the DADD execution for free

branch
add instruction branch target/successor
IF
ID IF
EX ID IF
MEM WB EX ID MEM WB EX MEM WB
By this time, we know whether to take the branch or whether not to take it
49
Delayed Branch
Another possibility: An instruction much before
Example:
DSUB R4, R5, R6 ... DADD R1, R2, R3 if R1 == 0 delay slot then
The DSUB instruction can be replicated into the delay slot, and the branch target can be changed
50
Delayed Branch
Another possibility: An instruction from before
Example:
DSUB R4, R5, R6 ... DADD R1, R2, R3 if R1 == 0 then
DSUB R4, R5, R6
The DSUB instruction can be replicated into the delay slot, and the branch target can be changed
51
Delayed Branch
Yet another possibility: An instruction from inside the taken path Example:
DADD R1, R2, R3 if R1 == 0 delay slot OR R7, R8, R9 DSUB R4, R5, R6 then
The OR instruction can be moved into the delay slot ONLY IF its execution doesnt disrupt the program execution (e.g., R7 is overwritten later)
52
Delayed Branch
Third possibility: An instruction from inside the taken path Example:

DADD R1, R2, R3 if R1 == 0 then
OR R7, R8, R9 OR R7, R8, R9 DSUB R4, R5, R6
The OR instruction can be moved into the delay slot ONLY IF its execution doesnt disrupt the program execution (e.g., R7 is overwritten later)
53
Delayed Branch Example

B1
LD
DSUBU BEQZ OR L:
R1,0(R2) R1,R1,R3
R1 != 0
LD DSUBU BEQZ
R1,0(R2) R1,R1,R3 R1,L
R1,L
R4,R5,R6
OR R4,R5,R6 DADDU R10,R4,R3 B2
R1 == 0
DADDU R10,R4,R3
DADDU R7,R8,R9
DADDU R7,R8,R9 B3 1.) BEQZ is dependent on DSUBU and DSUBU on LD, 2.) If we knew that the branch was taken with a high probability, then DADDU could be moved into block B1, since it doesnt have any dependencies with block B2,
3.) Conversely, knowing the branch was not taken, then OR could be moved into block B1, since it doesnt affect anything in B3,
54
Delayed Branch
Where to get instructions to fill branch delay slots?

Before
branch instruction
From
the target address: Useful only if branch taken.

fall through: Useful only if branch not taken.
From
55
Delayed Branch
cont
Compiler effectiveness for single branch delay slot:
Delayed Branch downside: what if multiple instructions issued per clock cycle (superscalar)?
Fills about 60% of branch delay slots. About 80% of instructions executed in branch delay slots useful in computation. About (60% x 80%) i.e. 50% of slots usefully filled.
56
Performance of branch with Stalls
Stalls degrade performance of a pipeline:

Result
in deviation from 1 instruction executing/clock cycle. Lets examine by how much stalls can impact CPI
57
Stalls and Performance with branch

CPI pipelined =
=Ideal CPI + Pipeline stall cycles per instruction =1 + Pipeline stall cycles per instruction
58
Performance of branch instn
Pipeline speed up
Pipeline depth 1+ pipeline stall cycle from branch
Pipeline stall cycle from branches = Branch frequency * branch penalty
Pipeline speed up =
Pipeline depth
1+ Branch frequency * Branch Penalty

59
Program Dependences Can Cause Hazards!
Hazards can be caused by dependences within a program. There are three main types of dependences in a program:
Data dependence Name dependence Control dependence
60
Data Dependences
An instruction j is data dependent on instruction i, if either of:

Direct: Instruction i produces a result that r3 r1 op r2 is used by instruction j. r5 r3 op r4 Transitive:
Instruction j is data dependent on instruction k and Instruction k is data dependent on instruction i.

r3 r1 op r2 r4 r3 op r2 r5 r6 op r4
61
Detecting Data Dependences
A data value may flow between instructions:
(i) through registers
(ii) through memory locations.

Detection is rather straight forward. Detection is difficult.
When data flow is through a register:
When data flow is through a memory location:
Two addresses may refer to the same memory location but look different.
100(R4) and 20(R6)
62
Types of Data Dependences
Two types of data dependences:

True data dependence. Name dependence:
Two types of name dependences:

Two instructions use the same register or memory location (called a name). There is no true flow of data between the two instructions. Example: A=B+C; A=P+Q;
Anti-dependence Output dependence

63
Anti-Dependence or (WAR)
Anti-dependence occurs between two instructions i and j, iff:
j writes to a register or memory location that i reads.
Original ordering must be preserved to ensure that i reads the correct value.
ADD F0,F6,F8 SUB F8,F4,F5
64
Example:

Output Dependence or (WAW)
Output dependence occurs between two instructions i and j, iff:
The two instructions write to the same memory location.
Ordering of the instructions must be preserved to ensure:
Finally written value corresponds to j.
Example:- ADD f6,f0,f8

Mul f6,f10,f8
65
Exercise
Identify all the dependences in the following C code:

1. 2. 3. 4.
a=b+c; b=c+d; a=a+c; c=b+a;
66
Hazard Resolution
Name dependences:

Once identified, can be easily eliminated through simple compiler renaming techniques. Memory-related dependences are difficult to identify:
True data dependences:

Hardware techniques (scoreboarding and dynamic instruction scheduling) are being used.
More difficult to handle. Can not be eliminated; can only be overcome! Many techniques have evolved over the years.
67
A Solution to WAR and WAW Hazards
Rename Registers

i1: mul r1, r2, r3; i2: add r6, r4, r5;
Register renaming can get rid of most false dependencies:
Compiler can do register renaming in the register allocation process (i.e., the process that assigns registers to variables).
68
Dependences and Hazards

Dependences
True Data Name
Hazards
RAW
Output
Anti
WAW
WAR Control
Control
------
Structural
69
Out-of-order Pipelining
IF ID RD EX INT Fadd1 Fadd2

Program Order Ia: F1 F2 x F3 ..... Ib: F1 F4 + F5
Fmult1 Fmult2 Fmult3
LD/ST
Out-of-order WB
Ib: F1 F4 + F5 ...... Ia: F1 F2 x F3
70
WB
Use of Compiler Techniques to Tackle Data hazards
A compiler can help eliminate some stalls caused by data hazards:
Example: an instruction that uses result of a LOADs destination register should not immediately follow the LOAD instruction. compiler-based pipeline instruction scheduling
71
The technique is called:
Hardware Techniques to Deal with Hazards
Simple solution
Stall
pipeline
Pipeline stall:
Lets
some instruction(s) in pipeline proceed, others are made to wait for data, resource, etc.
72
How to Implement Hazard Control Logic?
In a pipeline,

All data hazards can be checked during ID phase of pipeline. If a data hazard is detected, next instruction should be stalled. Whether forwarding is needed can also be determined at this stage, control signals set. Control unit of pipeline must stall pipeline and prevent instructions in IF, ID from advancing.
73
If hazard is detected,
Modern Computer Architectures
Lecture 8: Branch Prediction

Mr. Bhabani Shankar Prasad Mishra.
KIIT UNIVERSITY
BHUBANESWAR
74
Delayed Branch
cont
Compiler effectiveness for single branch delay slot:
Delayed Branch downside: what if multiple instructions issued per clock cycle (superscalar)?
Fills about 60% of branch delay slots. About 80% of instructions executed in branch delay slots useful in computation. About (60% x 80%) i.e. 50% of slots usefully filled.
75
Branch Prediction
KEY IDEA: Hope that branch assumption is correct.
If yes, then weve gained a performance improvement.
Otherwise, discard instructions

program is still correct, all weve done is waste a clock cycle.
Two approaches
Direction Based Prediction
Profile Based Prediction

76
Direction Based Prediction

Simple to implement
However, often branch behaviour is variable (dynamic).
Cant capture such behaviour at compile time with simple direction based prediction! Need history (aka profile)-based prediction.
77
History-based Branch Prediction
An important example is State-based branch prediction: Needs 2 parts:
Predictor to guess where/if instruction will branch (and to where)

Recovery Mechanism: i.e. a way to fix mistakes
78
One bit predictor:
History-based Branch Prediction
cont
Use result from last time this instruction executed. Even if branch is almost always taken, we will be wrong at least twice if branch alternates between taken, not taken
Problem:

We get 0% accuracy
79
1-bit Predictor
Set bit to 1 or 0:
Depending
Pipeline If
(T) or Not-taken (NT)
on whether branch Taken
checks bit value and predicts
incorrect then need to discard speculatively executed instruction
Actual outcome used to set the bit value.

80
Example
Let initial value = T, actual outcome of branches is- NT, NT,NT,T,T,T
Predictions are: T, NT,NT,NT,T,T
2 wrong (in red), 4 correct = 66% accuracy
2-bit predictors can do even better
In general, can have k-bit predictors.
81
2-bit Dynamic Branch Prediction Scheme
Change prediction only if twice mispredicted:

T Predict Taken NT Predict Taken
11
T
T NT
10
NT
Predict Not Taken
01
00
Predict Not Taken
Adds hysteresis to decision making process

82
NT
An Example of Computing Performance
Program assumptions:
23% loads and in of cases, next instruction uses load value 13% stores 19% conditional branches 2% unconditional branches 43% other
83
Example
Machine Assumptions:
5
cont
stage pipe
Penalty of 1 cycle on use of load value immediately after a load. Jumps are resolved in ID stage for a 1 cycle branch penalty. 75% branch prediction accuracy. 1 cycle delay on misprediction.
84
Example
CPI penalty calculation:
cont
Loads:
50% of the 23% of loads have 1 cycle penalty: .5*.23=0.115
Jumps:
All of the 2% of jumps have 1 cycle penalty: 0.02*1 = 0.02

25% of the 19% are mispredicted, have a 1 cycle penalty: 0.25*0.19*1 = 0.0475
Conditional Branches:
Total Penalty: 0.115 + 0.02 + 0.0475 = 0.1825 Average CPI: 1 + 0.1825 = 1.1825
85
Exploiting Loop-level Parallelism: Motivation
An instruction pipeline essentially exploits ILP within a basic block:

On the average the size of a basic block is 7. After every 7 instructions, a branch instruction is encountered.
To obtain substantial performance benefits:
ILP across multiple basic blocks need to be exploited.

86
Software-based Scheduling vs. Hardware-based Scheduling

Disadvantage with compilers:
Examples:
In many cases, many information can not be extracted from code pointers to the same memory location. Value of the induction variable of a loop
It is still possible to assist hardware by exposing more ILP:
Rearrange instructions for increased performance

87
Loop-level Parallelism
It may be possible to execute different iterations of a loop in parallel. Example:

For(i=0;i<1000;i++){ a[i]=a[i]+b[i]; b[i]=b[i]*2; }
88
Problems in Exploiting Looplevel Parallelism

Loop Carried Dependences: Loop Independent Dependences:
A dependence across different iterations of a loop.
A dependence within the body of the loop itself (i.e. within one iteration).
89
Loop-level Dependence
Example:

For(i=0;i<1000;i++){ a[i+1]=b[i]+c[i]
b[i+1]=a[i+1]+d[i];
}
Loop-carried dependence from one iteration to the preceding iteration. Also, loop-independent dependence on account of a[i+1]
90
Eliminating Loop-level Dependences Through Code Transformations
We shall examine 3 techniques:

Static
loop unrolling Basic block transformations Software pipelining
91
Static Loop Unrolling

- A high proportion of loop instructions are loop management instructions.
- Eliminating this overhead can significantly increase the performance of the loop. - for(i=1000;i>0;i--){ -} a[i]=a[i]+c;
92

L.D F0,0(R1) F4,F0,F2 ; F0 = array elem. ; add scalar in F2
Loop :
ADD.D
S.D
F4,0(R1)
; store result
; decrement ptr
DADDUI R1,R1,#-8 BNE
R1,R2,Loop ; branch if R1 !=R2
93

cont
Loop : L.D ADD.D S.D F0,0(R1) F4,F0,F2 F4,0(R1) F6,-8(R1) F8,F6,F2
L.D
ADD.D S.D L.D ADD.D S.D L.D
F8,-8(R1)
F10,-16(R1) F12,F10,F2 F12,-16(R1) F14,-24(R1) F16,F14,F2 F16,-24(R1)
ADD.D
S.D DADDUI BNE
R1,R1,#-32
R1,R2,Loop
94

cont
Loop : L.D F0,0(R1) ADD.D F4,F0,F2 S.D L.D F4,0(R1) F6,-8(R1) Note the renamed registers. This eliminates dependencies between each of n loop bodies of different iterations.
ADD.D F8,F6,F2 n loop Bodies for n=4 S.D L.D F8,-8(R1) F10,-16(R1)
ADD.D F12,F10,F2 S.D L.D F12,-16(R1) Note the adjustments for store and load offsets (only store highlighted red)!
F14,-24(R1)
ADD.D F16,F14,F2 S.D Adjusted loop overhead instructions F16,-24(R1)
DADDUI R1,R1,#-32 BNE R1,R2,Loop

95
Transformation of A Basic Block
It is possible to rewrite a loop to eliminate loop-carried dependences:
Only if, there are no cyclic dependences.

a[1]=a[1]+b[1];
for(i=1;i<999;i++){ b[i+1]=c[i]+d[i];
for(i=1;i<1000;i++){ a[i]=a[i]+b[i];
b[i+1]=c[i]+d[i];
} } With dependence
a[i+1]=a[i+1]+b[i+1];
b[1000]=c[999]+d[999]; Without dependence
96
Lecture 11:Software Pipelining and Predicated Instructions

Mr. Bhabani Shankar Prasad Mishra.
KIIT UNIVERSITY
BHUBANESWAR
97
Software Pipelining
Eliminates loop-independent dependence through code restructuring.

Reduces
stalls Helps achieve better performance in pipelined execution.
As compared to simple loop unrolling:

Consumes
less code space

98
Software Pipelining
cont
Central idea: reorganize loops
Each iteration is made from instructions chosen from different iterations of the original loop.
i0
Software Pipeline Iteration
i1 i2 i3 i4
i5
99
Software Pipelining
cont
Exactly just as it happens in a hardware pipeline:

In
each iteration of a software pipelined code, some instruction of some iteration of the original loop is executed.
100
Software Pipelining
cont
- How is this done?
1 unroll loop body with an unroll factor of n. (we have taken n = 3 for our example) 2 select order of instructions from different iterations to pipeline 3 paste instructions from different iterations into the new pipelined loop body
101
Static Loop Unrolling Example

L.D F0,0(R1) F4,F0,F2 ; F0 = array elem. ; add scalar in F2
Loop :
ADD.D
S.D
F4,0(R1)
; store result
; decrement ptr
DADDUI R1,R1,#-8 BNE
R1,R2,Loop ; branch if R1 !=R2
102
Software Pipelining: Step 1

Iteration i: L.D F0,0(R1) F4,F0,F2 F4,0(R1) F0,0(R1) F4,F0,F2
Note: 1.) We are unrolling the loop Hence no loop overhead Instructions are needed! 2.) A single loop body of restructured loop would contain instructions from different iterations of the original loop body.
ADD.D
S.D Iteration i + 1: L.D
ADD.D
S.D Iteration i + 2: L.D ADD.D S.D
F4,0(R1)
F0,0(R1) F4,F0,F2 F4,0(R1)
103

Iteration i: L.D F0,0(R1) F4,F0,F2 F4,0(R1) F0,0(R1) F4,F0,F2 2.) 1.)
ADD.D
Notes: 1.) Well select the following order in our pipelined loop: 2.) Each instruction (L.D ADD.D S.D) must be selected at least once to make sure that we dont leave out any instructions of the original loop in the pipelined loop.
ADD.D
F4,0(R1)
F0,0(R1) F4,F0,F2 F4,0(R1) 3.)
104

Iteration i: L.D F0,0(R1) F4,F0,F2 F4,0(R1) F0,0(R1) F4,F0,F2 2.) 1.) Loop : S.D F4,16(R1) F4,F0,F2 F0,0(R1) The Pipelined Loop
ADD.D
ADD.D
L.D DADDU 3.) BNE
ADD.D
F4,0(R1)
F0,0(R1) F4,F0,F2 F4,0(R1)
R1,R1,#-8
R1,R2,Loop
105

Preheader Instructions to fill software pipeline Loop : S.D ADD.D L.D BNE Postheader F4,16(R1) F4,F0,F2 F0,0(R1) R1,R2,Loop ; M[ i ] ; M[ i 1 ] ; M[ i 2 ]
Pipelined Loop Body
DADDUI R1,R1,#-8
Instructions to drain software pipeline
106
Software Pipelined Code

Loop : S.D
ADD.D F4,16(R1) ; M[ i ] F4,F0,F2 ; M[ i 1 ]
L.D
BNE
F0,0(R1)
R1,R2,Loop
; M[ i 2 ]
DADDUI R1,R1,#-8
107
Software Pipelining Issues
Register management can be tricky.
In more complex examples, we may need to increase the iterations between when data is read and when the results are used.
Optimal software pipelining has been shown to be an NP-complete problem:
Present solutions are based on heuristics.
108
Software Pipelining versus Loop Unrolling
Software pipelining takes less code space.

Software pipelining and loop unrolling reduce different types of inefficiencies:
Loop unrolling reduces loop management overheads. Software pipelining allows a pipeline to run at full efficiency by eliminating loopindependent dependencies.
109
Hardware Support for ILP: Predicated Instructions

Consider :
If (A == 0) {S = T;}
Following MIPS code would be generated:

BNEZ R1,L L : ADDU R2,R3,R0
With predicated instructions:

CMOVZ R2,R3,R1; if (R1 == 0) move R3 to R2
110
Advantages of Dynamic Scheduling
Can handle dependences unknown at compile time:

E.g.
dependences involving memory references.
Simplifies the compiler.

Allows code compiled for one pipeline to run efficiently on a different pipeline. Hardware speculation can be used:
Can
lead to further performance advantages, builds on dynamic scheduling. 111
Overview of Dynamic Instruction Scheduling
We shall discuss two schemes for implementing dynamic scheduling:

Scoreboarding:
6600 computer. Tomasulos Algorithm: Implemented for the FP unit of the IBM 360/91 in 1966.
First used in the 1964 CDC
Since scoreboarding is a little closer to in-order execution, well look at it first.

112
A Point to Note About Dynamic Scheduling
WAR and WAW hazards that did not exist in an in-order pipeline:
Can
arise in a dynamically scheduled processor.
113
Scoreboarding
cont
Scoreboarding allows instructions to execute out of order:

When
Named after the scoreboard:

Originally
there are sufficient resources. 6600.
developed for CDC
114
Scoreboarding The 5 Stage MIPS Pipeline
Split the ID pipe stage of simple 5-stage pipeline into 2 stages:

Issue:
Decode instructions, check for structural hazards. Read operands: Wait until no data hazards, then read operands.
115
Scoreboarding
Instructions pass through the issue stage in order. Instructions can bypass each other in the read operands stage:
Then
cont
enter execution out of order.
116
Scoreboarding Concepts
We had observed that WAR and WAW hazards can occur in out-oforder execution:
Instructions
are stalled, But, instructions having no dependence are allowed to continue. Different units are kept as busy as possible.
117
involved in a dependence
Scoreboarding Concepts
Essence of scoreboarding:
Execute
possible. When an instruction is stalled,
instructions as early as
Later instructions are issued and executed if they do not depend on any active or stalled instruction.
118
A Few More Basic Scoreboarding Concepts
Every instruction goes through the scoreboard:

Scoreboard constructs the data dependences of the instruction. Scoreboard decides when an instruction can execute. Scoreboard also controls when an instruction can write its results into the destination register.
119
Scoreboarding
Out-of-order execution requires multiple instructions to be in the EX stage simultaneously:
Achieved with multiple functional units, along with pipelined functional units.
All instructions go through the scoreboard:
Centralized control of issue, operand reading, execution and writeback.

All hazard resolution is centralized in the scoreboard as well.
120
A Scoreboard for MIPS

R e g i s t e r s
Data buses source of structural hazard FP Mult FP Mult FP Divide FP Add
Integer Unit
Control/ status
Scoreboard
Control/ status
121
1. Issue: when a f.u. for an instruction is free and no other active instruction has the same destination register: 2. Read operands: when all source operands are available:

4 Steps of Execution with Scoreboarding
Avoids structural and WAW hazards.
Note: forwarding not used. A source operand is available if no earlier issued active instruction is going to write it. Thus resolves RAW hazards dynamically.
122
Steps in Execution with Scoreboarding

3. Execution: begins when the f.u. receives its operands; scoreboard notified when execution completes. 4. Write Result: after WAR hazards have been resolved. Example:

ADD.D cannot proceed to read operands until DIV.D completes; SUB.D can execute but not write back until ADD.D has read F8.
123
DIV.D F0, F2, F4 ADD.D F10, F0, F8 SUB.D F8, F8, F14
An Assessment of Scoreboarding
Pro: Factor of 1.7 improvement for FORTRAN and 2.5 for hand-coded assembly on CDC 6600!
Scoreboard on the CDC 6600:
Before semiconductor main memory or caches Required about as much logic as a functional unit -- quite low. Large number of buses needed:
Cons:
Centralized hardware for hazard resolution.
However, if we wish to issue multiple instructions per clock, more wires are needed in any case.
124
An Assessment of Scoreboarding cont
Pro: A scoreboard effectively handles true data dependencies:

Minimizes
the number of stalls due to true data dependencies.
Con: Anti dependences and output dependences (WAR and WAW hazards) are also handled using stalls:
Could have been better handled.

125
Lecture 13: Tomasulos Algorithm

Mr. B.S.P.Mishra.
KIIT UNIVERSITY
BHUBANESWAR
126
A More Sophisticated Approach: Tomasulos Algorithm
Developed for IBM 360/91:

Goal:
To keep the floating point pipeline as busy as possible. This led Tomasulo to try to figure out how to achieve renaming in hardware!
The descendants of this have flourished!

Alpha
21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604, Pentium 4

127
Reservation stations:
Key Innovations in Dynamic Instruction Scheduling
Common Data Bus (CDB):
Single entry buffer at the head of each functional unit has been replaced by a multiple entry buffer.
Register Tags:
Connects the output of the functional units to the reservation stations as well as registers. Tag corresponds to the reservation station entry number for the instruction producing the result.
128
Reservation Stations
The basic idea:

An
instruction waits in the reservation station, until its operands become available. reservation station fetches and buffers an operand as soon as it is available:

Helps overcome RAW hazards.
Eliminates the need to get operands from registers.

129
Tomasulos Algorithm
Control & buffers distributed with Function Units (FU) In the form of reservation stations associated with every function unit. Store operands for issued but pending instructions. Registers in instructions replaced by values and others with pointers to reservation stations (RS):
Achieves register renaming. Avoids WAR, WAW hazards without stalling.
130
Tomasulos Algorithm
Results passed to FUs from RSs,
cont
Load and Stores:
Not through registers, therefore similar to forwarding. Over Common Data Bus (CDB) that broadcasts results to all FUs. Treated as FUs with RSs as well.
Integer instructions can go past branches:
Allows FP ops beyond basic block in FP queue.

131
Tomasulos Scheme
From Instruction Unit Instruction Queue Registers Address Unit Store Buffer
Load Buffer
Adder
Reservation Stations
Multiplier
Memory Unit
CDB
132
Three Stages of Tomasulo Algorithm

1. Issue: Get instruction from Instr Queue
Issue instruction only if a matching reservation station is free (no structural hazard). Send registers or the functional unit that would produce the result (achieves renaming).

2. Execute: Operate on operands (EX)
3. Write result: Finish execution (WB)
When both operands ready then execute; if not ready, watch Common Data Bus for result
Write on CDB to all awaiting units; mark reservation station available.

133
Instruction stream
Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8
Tomasulo Example
k R2 R3 F4 F2 F6 F2
Exec Write Issue Comp Result

Load1 Load2 Load3
Busy Address
No No No
3 Load/Buffers
Op S1 Vj S2 Vk RS Qj RS Qk
Reservation Stations:
Time Name Busy Add1 No Add2 No FU count Add3 No down Mult1 No Mult2 No
3 FP Adder R.S. 2 FP Mult R.S.
Register result status: Clock

0 FU
F0
F2
F4
F6
F8
F10
F12
...
F30
Clock cycle counter

134
Tomasulo Example Cycle 1

Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

1 Load1 Load2 Load3
Busy Address
Yes No No 34+R2
Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No
Op
S1 Vj
S2 Vk
RS Qj
RS Qk

1 FU
F0
F2
F4
F6
Load1
F8
F10
F12
...
F30
135

Instruction status:

1 2 Load1 Load2 Load3
Busy Address
Yes Yes No 34+R2 45+R3
Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No
Op
S1 Vj
S2 Vk
RS Qj
RS Qk

2 FU
F0
F2
Load2
F4
F6
Load1
F8
F10
F12
...
F30
Note: Can have multiple loads outstanding

136

Instruction status:

1 2 3 3 Load1 Load2 Load3
Busy Address
Yes Yes No 34+R2 45+R3
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes MULTD Mult2 No
S1 Vj
S2 Vk
RS Qj
RS Qk
R(F4) Load2

3 FU
F0
F2
F4
F6
Load1
F8
F10
F12
...
F30
Mult1 Load2
Note: registers names are removed (renamed) in Reservation Stations; MULT issued 137 Load1 completing; what is waiting for Load1?

Instruction status:

1 2 3 4 3 4 4 Load1 Load2 Load3
Busy Address
No Yes No 45+R3
Time Name Busy Op Add1 Yes SUBD M(A1) Load2 Add2 No Add3 No Mult1 Yes MULTD R(F4) Load2 Mult2 No
S1 Vj
S2 Vk
RS Qj
RS Qk

4 FU
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 Load2
M(A1) Add1
Load2 completing; what is waiting for Load2?

138

Instruction status:

1 2 3 4 5 3 4 4 5 Load1 Load2 Load3
Busy Address
No No No
Time Name Busy Op 2 Add1 Yes SUBD M(A1) M(A2) Add2 No Add3 No 10 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk

5 FU
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
M(A1) Add1 Mult2
Timer starts down for Add1, Mult1

139

Instruction status:

1 2 3 4 5 6 3 4 4 5 Load1 Load2 Load3
Busy Address
No No No
Time Name Busy Op 1 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 9 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk

6 FU
F0
F2
F4
F6
Add2
F8
F10
F12
...
F30
Mult1 M(A2)
Add1 Mult2
Issue ADDD here despite name dependency on F6?

140

Instruction status:

1 2 3 4 5 6 3 4 7 4 5 Load1 Load2 Load3
Busy Address
No No No
Time Name Busy Op 0 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 8 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk

7 FU
F0
F2
F4
F6
Add2
F8
F10
F12
...
F30
Mult1 M(A2)
Add1 Mult2
Add1 (SUBD) completing; what is waiting for it?

141

Instruction status:

1 2 3 4 5 6 3 4 7 4 5 8 Load1 Load2 Load3
Busy Address
No No No
Time Name Busy Op Add1 No 2 Add2 Yes ADDD (M-M) M(A2) Add3 No 7 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk

8 FU
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
Add2 (M-M) Mult2
142

Instruction status:

1 2 3 4 5 6 3 4 7 4 5 8 Load1 Load2 Load3
Busy Address
No No No
S1 Vj
S2 Vk
RS Qj
RS Qk

9 FU
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
Add2 (M-M) Mult2
143

Instruction status:

1 2 3 4 5 6 3 4 7 10 4 5 8 Load1 Load2 Load3
Busy Address
No No No
S1 Vj
S2 Vk
RS Qj
RS Qk

10 FU
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
Add2 (M-M) Mult2
Add2 (ADDD) completing; what is waiting for it?

144

Instruction status:

1 2 3 4 5 6 3 4 7 10 4 5 8 11 Load1 Load2 Load3
Busy Address
No No No
Time Name Busy Op Add1 No Add2 No Add3 No 4 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk

11 FU
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
(M-M+M) (M-M) Mult2
Write result of ADDD here?

145

Instruction status:

1 2 3 4 5 6 3 4 7 10 4 5 8 11 Load1 Load2 Load3
Busy Address
No No No
S1 Vj
S2 Vk
RS Qj
RS Qk

12 FU
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
(M-M+M) (M-M) Mult2
146

Instruction status:

1 2 3 4 5 6 3 4 7 10 4 5 8 11 Load1 Load2 Load3
Busy Address
No No No
S1 Vj
S2 Vk
RS Qj
RS Qk

13 FU
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
(M-M+M) (M-M) Mult2
147

Instruction status:

1 2 3 4 5 6 3 4 7 10 4 5 8 11 Load1 Load2 Load3
Busy Address
No No No
S1 Vj
S2 Vk
RS Qj
RS Qk

14 FU
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
(M-M+M) (M-M) Mult2
148

Instruction status:

1 2 3 4 5 6 3 4 15 7 10 4 5 8 11 Load1 Load2 Load3
Busy Address
No No No
S1 Vj
S2 Vk
RS Qj
RS Qk

15 FU
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
(M-M+M) (M-M) Mult2
Mult1 (MULTD) completing; what is waiting for it?

149

Instruction status:

1 2 3 4 5 6 3 4 15 7 10 4 5 16 8 11 Load1 Load2 Load3
Busy Address
No No No
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No 40 Mult2 Yes DIVD M*F4 M(A1)
S1 Vj
S2 Vk
RS Qj
RS Qk

16 FU
F0
F2
F4
F6
F8
F10
F12
...
F30
M*F4 M(A2)
(M-M+M) (M-M) Mult2
Just waiting for Mult2 (DIVD) to complete

150
(skip a couple of cycles)
151

Instruction status:

1 2 3 4 5 6 3 4 15 7 10 4 5 16 8 11 Load1 Load2 Load3
Busy Address
No No No
S1 Vj
S2 Vk
RS Qj
RS Qk

55 FU
F0
F2
F4
F6
F8
F10
F12
...
F30
M*F4 M(A2)
(M-M+M) (M-M) Mult2
152

Instruction status:

1 2 3 4 5 6 3 4 15 7 56 10 4 5 16 8 11 Load1 Load2 Load3
Busy Address
No No No
S1 Vj
S2 Vk
RS Qj
RS Qk

56 FU
F0
F2
F4
F6
F8
F10
F12
...
F30
M*F4 M(A2)
(M-M+M) (M-M) Mult2
Mult2 (DIVD) is completing; what is waiting for it?

153

Instruction status:

1 2 3 4 5 6 3 4 15 7 56 10 4 5 16 8 57 11 Load1 Load2 Load3
Busy Address
No No No
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No Mult2 Yes DIVD M*F4 M(A1)
S1 Vj
S2 Vk
RS Qj
RS Qk

56 FU
F0
F2
F4
F6
F8
F10
F12
...
F30
M*F4 M(A2)
(M-M+M) (M-M) Result
Once again: In-order issue, out-of-order execution 154 and out-of-order completion.
Performance is limited by CDB:
Tomasulos Scheme: Drawbacks
CDB connects to multiple functional units high capacitance, high wiring density Number of functional units that can complete per cycle limited to one!
Imprecise exceptions!
Multiple CDBs more FU logic for parallel stores.
Effective handling is a major performance bottleneck.

155
Interrupts/Exceptions
Interrupts: external, I/O devices, OS. Exceptions: internal, errors

Illegal
OS needs to intervene to handle exceptions.
op code, divide by 0, overflow/underflow, page faults.
156
Imprecise Exceptions
An exception is called imprecise when:

The
processor state when an exception is raised, does not look exactly the same compared to when the instructions are executed inorder.
157
Imprecise Exceptions
In an out of order execution model, an imprecise exception is said to occur if

When

exception is raised by an instruction:
For example:
A
some instructions before it may not be complete some instructions after it are already complete
floating point instruction exception could be detected after an integer instruction that is much later in the program order is complete.
158
Handling Imprecise Exceptions in Dynamic Scheduling
Instructions are issued in-order:

But, may execute out-of-order. However, unless control-dependence is resolved an instruction is not executed. No instruction is allowed to initiate execution until all branches that precede the instructions are complete.

This is a performance bottleneck:

Average basic block size is about 6 instructions.
159
Lecture 14: Dynamic Instruction Scheduling: Loop Example

Mr. SUBHASIS DASH
KIIT UNIVERSITY
BHUBANESWAR
160
Tomasulos Scheme- Loop Example

Loop: LD MULTD F0 F4 0 F0 R1 F2
SD
SUBI
F4
R1
R1 R1 #8
BNEZ
R1 Loop
161
Assume Multiply takes 4 clocks. Assume:

1st
Tomasulos Scheme- Loop Example
To be clear, we will show clocks for SUBI, BNEZ:

Reality:
load takes 8 clocks (L1 cache miss) 2nd load takes 1 clock (hit)
integer instructions ahead of FP Instructions.
162
Show 2 iterations
Instruction status:
ITER Instruction
Iter1 ation Count 2
2 2 1 1 LD F0 MULTD F4 SD F4 LD F0 MULTD F4 SD F4
Loop Example
ExecWrite j
0 F0 0 0 F0 0
k IssueCompResult
R1 F2 R1 R1 F2 R1
Busy Addr
Load1 No Load2 No Load3 No Store1 No Store2 No Store3 No
Fu
Add1 Add2 Add3 Mult1 Mult2 R1 80 No No No No No
S1 Vj Vk
S2 Qj
RS Qk Code:
Time Name Busy Op
Added Store Buffers

0 F0 0 R1 Loop R1 F2 R1 #8
LD F0 MULTD F4 SD SUBI BNEZ F4 R1 R1
Register result status
Instruction Loop
Clock
0
F0 F2 F4 F6 F8
Fu
F10 F12
...
F30 163
Loop Example Cycle 1

Instruction status:
ITER Instruction
1 LD F0
j
0
k
R1
Exec Write Issue CompResult

1 Load1 Load2 Load3 Store1 Store2 Store3 S2 Qj RS Qk Code: LD MULTD SD SUBI BNEZ
Busy Addr
Yes No No No No No 80
Fu
Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No R1 80 Op Vj
S1 Vk
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
1
F0
Fu Load1
F2
F4
F6
F8
F10 F12
...
F30
164

Instruction status:
ITER Instruction
1 1 LD MULTD F0 F4
j
0 F0
k
R1 F2

1 2 Load1 Load2 Load3 Store1 Store2 Store3 S2 Qj RS Qk Code: LD MULTD SD SUBI BNEZ
Busy Addr
Yes No No No No No 80
Fu
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No R1 80 Vj
S1 Vk
R(F2) Load1
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
2
F0
Fu Load1
F2
F4
Mult1
F6
F8
F10 F12
...
F30
165

Instruction status:
ITER Instruction
1 1 1 LD MULTD SD F0 F4 F4
j
0 F0 0
k
R1 F2 R1

1 2 3 Load1 Load2 Load3 Store1 Store2 Store3 S2 Qj RS Qk Code: LD MULTD SD SUBI BNEZ
Busy Addr
Yes No No Yes No No 80
Fu
80
Mult1
S1 Vk
R(F2) Load1
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
3
F0
Fu Load1
F2
F4
Mult1
F6
F8
F10 F12
...
F30
Implicit renaming sets up data flow graph
166

Instruction status:
ITER Instruction
j
0 F0 0
k
R1 F2 R1

Busy Addr
Fu
80
Mult1
S1 Vk
R(F2) Load1
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
4
F0
Fu Load1
F2
F4
Mult1
F6
F8
F10 F12
...
F30
Dispatching SUBI Instruction (not in FP queue)
167

Instruction status:
ITER Instruction
j
0 F0 0
k
R1 F2 R1

Busy Addr
Fu
80
Mult1
S1 Vk
R(F2) Load1
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
5
F0
Fu Load1
F2
F4
Mult1
F6
F8
F10 F12
...
F30
BNEZ instruction (not in FP queue)
168

Instruction status:
ITER Instruction
1 1 1 2 LD MULTD SD LD F0 F4 F4 F0
j
0 F0 0 0
k
R1 F2 R1 R1

1 2 3 6 Load1 Load2 Load3 Store1 Store2 Store3 S2 Qj RS Qk Code: LD MULTD SD SUBI BNEZ
Busy Addr
Yes Yes No Yes No No 80 72 80
Fu
Mult1
S1 Vk
R(F2) Load1
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
6
F0
Fu Load2
F2
F4
Mult1
F6
F8
F10 F12
...
F30
Notice that F0 never sees Load from location 80
169

Instruction status:
ITER Instruction
1 1 1 2 2 LD MULTD SD LD MULTD F0 F4 F4 F0 F4
j
0 F0 0 0 F0
k
R1 F2 R1 R1 F2

1 2 3 6 7 S1 Vk S2 Qj RS Qk Load1 Load2 Load3 Store1 Store2 Store3 Code: LD MULTD SD SUBI BNEZ
Busy Addr
Yes Yes No Yes No No 80 72 80
Fu
Mult1
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 Yes Multd R1 72 Vj
R(F2) Load1 R(F2) Load2
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
7

F0
Fu Load2
F2
F4
Mult2
F6
F8
F10 F12
...
F30
Register file completely detached from computation First and Second iteration completely overlapped
170

Instruction status:
ITER Instruction
1 1 1 2 2 2 Time LD MULTD SD LD MULTD SD F0 F4 F4 F0 F4 F4
j
0 F0 0 0 F0 0
k
R1 F2 R1 R1 F2 R1 Vj

1 2 3 6 7 8 S1 Vk S2 Qj RS Qk Load1 Load2 Load3 Store1 Store2 Store3 Code: LD MULTD SD SUBI BNEZ
Busy Addr
Yes Yes No Yes Yes No 80 72 80 72
Fu
Mult1 Mult2
Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 Yes Multd R1 72
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
8
F0
Fu Load2
F2
F4
Mult2
F6
F8
F10 F12
...
F30
171

Instruction status:
ITER Instruction
j
0 F0 0 0 F0 0
k

1 2 3 6 7 8 S1 Vk 9 Load1 Load2 Load3 Store1 Store2 Store3 RS Qk Code: LD MULTD SD SUBI BNEZ
Busy Addr
Yes Yes No Yes Yes No 80 72 80 72
Fu
Mult1 Mult2
Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 Yes Multd R1 72
S2 Qj
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
9
F0
Fu Load2
F2
F4
Mult2
F6
F8
F10 F12
...
F30
Load1 completing: who is waiting? Note: Dispatching SUBI
172

Instruction status:
ITER Instruction
j
0 F0 0 0 F0 0
k
R1 F2 R1 R1 F2 R1

1 2 3 6 7 8 S1 Vk 9 10 Load1 Load2 Load3 Store1 Store2 Store3 Code: LD MULTD SD SUBI BNEZ
Busy Addr
No Yes No Yes Yes No 72 80 72
Fu
10
Mult1 Mult2
Name Busy Op Vj Add1 No Add2 No Add3 No Mult1 Yes Multd M[80] R(F2) Mult2 Yes Multd R(F2) Load2 R1 64
S2 Qj
RS Qk
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
10

F0
Fu Load2
F2
F4
Mult2
F6
F8
F10 F12
...
F30
Load2 completing: who is waiting? Note: Dispatching BNEZ
173

Instruction status:
ITER Instruction
j
0 F0 0 0 F0 0
k
R1 F2 R1 R1 F2 R1

Busy Addr
No No Yes Yes Yes No
Fu
10
11
64 80 72
Mult1 Mult2
3 4
Name Busy Op Vj Add1 No Add2 No Add3 No Mult1 Yes Multd M[80] R(F2) Mult2 Yes Multd M[72] R(F2) R1 64
S2 Qj
RS Qk
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
11
F0
Fu Load3
F2
F4
Mult2
F6
F8
F10 F12
...
F30
Next load in sequence
174

Instruction status:
ITER Instruction
j
0 F0 0 0 F0 0
k
R1 F2 R1 R1 F2 R1

Busy Addr
Fu
10
11
64 80 72
Mult1 Mult2
2 3
S2 Qj
RS Qk
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
12
F0
Fu Load3
F2
F4
Mult2
F6
F8
F10 F12
...
F30
Why not issue third multiply?
175

Instruction status:
ITER Instruction
j
0 F0 0 0 F0 0
k
R1 F2 R1 R1 F2 R1

Busy Addr
Fu
10
11
64 80 72
Mult1 Mult2
1 2
S2 Qj
RS Qk
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
13
F0
Fu Load3
F2
F4
Mult2
F6
F8
F10 F12
...
F30
Why not issue third store?
176

Instruction status:
ITER Instruction
j
0 F0 0 0 F0 0
k
R1 F2 R1 R1 F2 R1

1 2 3 6 7 8 S1 Vk 9 14 10 10 Load1 Load2 Load3 Store1 Store2 Store3 Code: LD MULTD SD SUBI BNEZ
Busy Addr
Fu
11
64 80 72
Mult1 Mult2
0 1
S2 Qj
RS Qk
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
14
F0
Fu Load3
F2
F4
Mult2
F6
F8
F10 F12
...
F30
Mult1 completing. Who is waiting?
177

Instruction status:
ITER Instruction
j
0 F0 0 0 F0 0
k
R1 F2 R1 R1 F2 R1

1 2 3 6 7 8 S1 Vk 9 14 10 15 S2 Qj 10 15 11 Load1 Load2 Load3 Store1 Store2 Store3 Code: LD MULTD SD SUBI BNEZ
Busy Addr
Fu
64 80 72
[80]*R2 Mult2
Name Busy Op Vj Add1 No Add2 No Add3 No Mult1 No Mult2 Yes Multd M[72] R(F2) R1 64
RS Qk
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
15
F0
Fu Load3
F2
F4
Mult2
F6
F8
F10 F12
...
F30
Mult2 completing. Who is waiting?
178

Instruction status:
ITER Instruction
j
0 F0 0 0 F0 0
k

1 2 3 6 7 8 S1 Vk 9 14 10 15 S2 Qj 10 15 11 16 RS Qk Load1 Load2 Load3 Store1 Store2 Store3 Code: LD MULTD SD SUBI BNEZ
Busy Addr
Fu
64 80 72
[80]*R2 [72]*R2
Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No R1 64
R(F2) Load3
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
16
F0
Fu Load3
F2
F4
Mult1
F6
F8
F10 F12
...
F30
179

Instruction status:
ITER Instruction
j
0 F0 0 0 F0 0
k

1 2 3 6 7 8 S1 Vk 9 14 10 15 S2 Qj 10 15 11 16 RS Qk Load1 Load2 Load3 Store1 Store2 Store3 Code: LD MULTD SD SUBI BNEZ
Busy Addr
No No Yes Yes Yes Yes
Fu
64 80 72 64
[80]*R2 [72]*R2 Mult1
R(F2) Load3
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
17
F0
Fu Load3
F2
F4
Mult1
F6
F8
F10 F12
...
F30
180

Instruction status:
ITER Instruction
j
0 F0 0 0 F0 0
k

1 2 3 6 7 8 S1 Vk 9 14 18 10 15 S2 Qj 10 15 11 16 RS Qk Load1 Load2 Load3 Store1 Store2 Store3 Code: LD MULTD SD SUBI BNEZ
Busy Addr
No No Yes Yes Yes Yes
Fu
64 80 72 64
[80]*R2 [72]*R2 Mult1
R(F2) Load3
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
18
F0
Fu Load3
F2
F4
Mult1
F6
F8
F10 F12
...
F30
181

Instruction status:
ITER Instruction
j
0 F0 0 0 F0 0
k

1 2 3 6 7 8 S1 Vk 9 14 18 10 15 19 S2 Qj 10 15 19 11 16 RS Qk Load1 Load2 Load3 Store1 Store2 Store3 Code: LD MULTD SD SUBI BNEZ
Busy Addr
No No Yes No Yes Yes
Fu
64 72 64 [72]*R2 Mult1
R(F2) Load3
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
19
F0
Fu Load3
F2
F4
Mult1
F6
F8
F10 F12
...
F30
182

Instruction status:
ITER Instruction
j
0 F0 0 0 F0 0
k

1 2 3 6 7 8 S1 Vk 9 14 18 10 15 19 S2 Qj 10 15 19 11 16 20 RS Qk Load1 Load2 Load3 Store1 Store2 Store3 Code: LD MULTD SD SUBI BNEZ
Busy Addr
Yes No Yes No No Yes 56 64
Fu
64
Mult1
R(F2) Load3
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
20
F0
Fu Load1
F2
F4
Mult1
F6
F8
F10 F12
...
F30
Once again: In-order issue, out-of-order execution and 183 out-of-order completion.
Why Can Tomasulos Scheme Overlap Iterations of Loops?
Register renaming using reservation stations:

Avoids
the WAR stall that occurred in the scoreboard. Also, multiple iterations use different physical destinations facilitating dynamic loop unrolling.
184
Tomasulos Scheme Offers Three Major Advantages
1. Distribution of hazard detection logic:
Distributed reservation stations. If multiple instructions wait on a single result,
If a centralized register file were used,
Instructions can be passed simultaneously by broadcast on CDB. Units would have to read their results from registers .
2. Elimination of stalls for WAW and WAR hazards. 3. Possible to have superscalar execution:
Because results directly available to FUs, rather than from registers.

185
Superscalar and VLIW Processors

Mr. B.S.P.Mishra SCHOLE OF TECHNOLOGY
KIIT UNIVERSITY
BHUBANESWAR
186
A Practice Problem on Dependence Analysis

Identify all dependences in the following code.
Transform the code to eliminate the dependences.

for(i=1;i<1000;i++){ y[i]=x[i]/c; x[i]=x[i]+c; z[i]=y[i]+c; y[i]=c-y[i]; }
187
Transformed Code Without Dependence

for(i=1;i<1000;i++){ t[i]=x[i]/c; x[i]=x[i]+c; z[i]=t[i]+c; y[i]=c-t[i]; }
188
Simple code transformations work well, only if:
Global Code Scheduling

The loop body is a straight line code.
Issues become more complex in the presence of:
Instructions might have to be moved across branches:
Nested loops, nested branches, etc.
This is called global code scheduling.

189
Two Paths to Higher ILP
Superscalar processors:
Multiple
issue, dynamically scheduled, speculative execution, branch prediction More hardware functionalities and complexities.
VLIW:
Let
complier take the complexity. Simple hardware, smart compiler.

190
Superscalar Execution
Scheduling of instructions is determined by a number of factors:

True Data Dependency: The result of one operation is an input to the next. Resource constraints: Two operations require the same resource. Branch Dependency: Scheduling instructions across conditional branch statements cannot be done deterministically a-priori. Superscalar processor of degree m.
191
An appropriate number of instructions issued.
Very Long Instruction Word (VLIW) Processors
Hardware cost and complexity of superscalar schedulers is a major consideration in processor design.
These instructions are packed and dispatched together,
VLIW processors rely on compile time analysis to identify and bundle together instructions that can be executed concurrently.
Thus the name very long instruction word. This concept is employed in the Intel IA64 processors.
192
VLIW Processors
The compiler has complete responsibility of selecting a set of instructions:
These can be concurrently be executed.
VLIW processors have static instruction issue capability:
As compared, superscalar processors have dynamic issue capability.

193
The Basic VLIW Approach

VLIW processors deploy multiple independent functional units. Early VLIW processors operated lock step:
There
was no hazard detection in hardware at all. A stall in any functional unit causes the entire pipeline to stall.
194
Assume a 4-issue static superscalar processor:

During fetch stage, 1 to 4 instructions would be fetched. The group of instructions that could be issued in a single cycle are called:
VLIW Processors
If an instruction could cause a structural or data hazard:
An issue packet or a Bundle.
It is not issued.
195
One single VLIW instruction:
VLIW (Very Long Instruction Word)

separately targets differently functional units.
MultiFlow TRACE, TI C6X, IA-64

Bundle
add r1,r2,r3 load r4,r5+4 mov r6,r2 mul r7,r8,r9
FU
FU
FU
FU
Schematic Explanation for a VLIW Instruction
196
VLIW Processors: Some Considerations

Issue hardware is simpler. Compiler has a bigger context from which to select co-scheduled instructions. Compilers, however, do not have runtime information such as cache misses.

Scheduling is, therefore, inherently conservative. Branch and memory prediction is more difficult.
Typical VLIW processors are limited to 4-way to 8-way parallelism.

197
VLIW Summary
Each instruction is very large
Complier detects hazard, and determines scheduling. There is no (or only partial) hardware hazard detection:
Bundles multiple operations that are independent.
Tradeoff instruction space for simple decoding

The long instruction word has room for many operations. But have to fill with NOP if enough operations cannot be found.
No dependence check logic for instructions issued at the same cycle.
198
VLIW vs Superscalar
VLIW - Compiler finds parallelism:
VLIW Simpler hardware:
Superscalar hardware finds parallelism Superscalar More complex hardware
VLIW less parallelism can be exploited for a typical program:
Superscalar Better performance
199
Superscalar Processors
Commercial desktop processors now do four or more issues per clock:

Even
in the embedded processor market, dual issue superscalar pipelines are becoming common.
200
Superscalar Execution With Dynamic Scheduling
Multiple instruction issue:

Very
well accommodated with dynamic instruction scheduling approach.

pipelined, or both.
The issue stage can be:

Replicated,
201
Limitations of Scalar Pipelines: A Reflection
Maximum throughput bounded by one instruction per cycle. Inefficient unification of instructions into one pipeline:
ALU, MEM stages very diverse eg: FP If a leading instruction is stalled every subsequent instruction is stalled
202
Rigid nature of in-order pipeline:
A Rigid Pipeline
Bypassing of stalled instruction not allowed
Stalled Instruction
Backward Propagation of stalling
203
Solving Problems of Scalar Pipelines: Modern Processors
Maximum throughput bounded by one instruction per cycle:
Inefficient unification into a single pipeline:
parallel pipelines (superscalar)
Rigid nature of in order pipeline
diversified pipelines.
Allow out of ordering or dynamic instruction scheduling.

204
Machine Parallelism
(a) No Parallelism (Nonpipelined) (b) Temporal Parallelism (Pipelined) (c) Spatial Parallelism (Multiple units) (d) Combined Temporal and Spatial Parallelism
205
A Parallel Pipeline
Width = 3
206
Scalar and Parallel Pipeline

(a) The five-stage i486 scalar pipeline (b) The five-stage Pentium Parallel Pipeline of width=2
207
Diversified Parallel Pipeline
208
A Dynamically Scheduled Speculative Pipeline
209
Distributed Reservation Stations
210
A Superscalar Pipeline
A degree six superscalar pipeline
211
Superscalar Pipeline Design

Fetch
Instruction Flow
Instruction Buffer Decode Dispatch Buffer Dispatch Issuing Buffer Execute
Data Flow
Complete
Completion Buffer
Store Buffer
Retire
212
A Superscalar MIPS Processor
Assume two instructions can be issued per clock cycle:

One
of the instructions can be load, store, or integer ALU operations. The other can be a floating point operation.
213
MIPS Pipeline with Pipelined MultiCycle Operations

EX
M1 IF ID
M2
M3
M4
M5
M6
M7 M WB
A1
A2
A3
A4
DIV
Pipelined implementations ex: 7 outstanding MUL, 4 outstanding Add, unpipelined DIV. In-order execution, out-of-order completion
Tomasulo w/o ROB: out-of-order execution, out-of-order completion, in- 214 order commit

Advanced Computer Architecture

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced Computer Architecture

Uploaded by

Copyright:

Available Formats

Advanced Computer Architectures

Lecture 6: Pipeline Hazards and Their Resolution Mechanisms

Hazards, their causes, and resolution

Scoreboarding and Tomasulos algorithm

Compiler techniques for exposing ILP

Survey of some modern processors

What is ILP (Instruction-Level Parallelism)?

Parallel execution of different instructions belonging to the same thread.

A thread usually consists of several basic blocks:

As well as several branches and loops.

Instruction pipelines can effectively exploit parallelism in a basic block:

Pipelining can be viewed to:

Drags on Pipeline Performance

Factors that can degrade pipeline performance:

Hazards cause the worst drag on the performance of a pipeline.

What is a pipeline hazard?

There are 3 classes of hazards:

Arise from resource conflicts among instructions executing concurrently:

Easy way to avoid structural hazards:

Duplicate resources (sometimes not practical)

Examples of Resolution of Structural Hazard:

Structural Hazard: Example

EXE MEM ID IF EXE ID IF

WB MEM EXE ID WB MEM EXE WB MEM

An Example of a Structural Hazard

Would there be a hazard here?

Instruction 1 Instruction 2 Stall Instruction 3

A Pipeline can be stalled by inserting a bubble or NOP

Performance with Stalls

Stalls degrade performance of a pipeline:

Stalls and Performance

Speedup Due to Pipelining

1 Pipeline depth 1 Pipeline stall cycles per instruction

Alternate Speedup Expression

Clock cycle unpipelined Pipeline depth

Clock cycle unpipelined Clock cycle pipelined

An Example of Performance Impact of Structural Hazard

Consider two cases:

What is the impact on performance?

Avg. Inst. Time = CPI x Clock Cycle Time

(i) For Separate cache: Avg. Instr. Time=1*1=1

Speedup= 1/1.4 = 0.7

30% degradation in performance

Occur when an instruction under execution depends on:

Data from an instruction ahead in pipeline.

Dependent instruction uses old data:

Results in wrong computations

Types of Data Hazards

Data hazards are of three types:

With an in-order execution machine:

Assume instruction i is issued before j.

WAW, WAR hazards can not occur.

Read after Write (RAW) Hazards

instruction j tries to read its operand before instruction i writes it.

Instruction j is a read instruction issued after i

Instruction i is a write instruction issued before j

Read after Write (RAW) Hazards

R (I) D (J) for RAW

Example program (a):

RAW Dependency: More Examples

Write after Read (WAR) Hazards

Instruction i is a read instruction issued before j