Professional Documents
Culture Documents
LD
DADD
DSUB
DADD
BNEZ
DADD
DSUB
R1, 45(R2)
R7, R1, R5
R8, R1, R6
R9, R5, R1
R7, target
R10, R8, R5
R2, R3, R4
LD
DADD
Instruction
R1, 45(R2)
R7, R1, R5
Type of Dependence
RAW Dependence or True
Dependence, Dadd
depends on LD
Storage Location
R1
DSUB
R8, R1, R6
R1
DADD
R9, R5, R1
BNEZ
R7, target
DADD
R10, R8, R5
DSUB
R2, R3, R4
R1
R7
R8, R7
R7
b) Using MIPS Five Stage Pipeline, we get the next schedule for our
instructions.
Assumption: The only way of forwarding is through the register file.
LD
R1, 45(R2)
DADD
R7, R1, R5
DSUB
R8, R1, R6
DADD
IF
ID
EXE
MEM
WB
IF
ID
MEM
WB
IF
ID
IF
R9, R5, R1
BNEZ
R7, target
DADD
R10, R8, R5
DSUB
R2, R3, R4
10
11
12
13
EXE
MEM
WB
ID
EXE
MEM
WB
IF
ID
EXE
MEM
WB
IF
ID
EXE
MEM
WB
IF
ID
EXE
MEM
14
WB
Solution:
Considering a sequence where d alternates between 0 and 2 and we can assume a NT
value for B2 at the beginning.
d=?
0
1
2
0
1
2
B1
Predictio
n
NT/NT
NT/NT
T/NT
T/NT
T/NT
T/NT
B1
Action
NT
T
T
NT
T
T
New B1
Predictio
n
NT/NT
T/NT*
T/NT
T/NT
T/NT
T/NT
B2
Predictio
n
NT/NT
NT/NT
NT/NT
NT/T
NT/T
NT/NT
Total Mispredictions: 4
B2
Action
New B2
Prediction
NT
NT
T
NT
NT
T
NT/NT
NT/NT
NT/T*
NT/T
NT/NT*
NT/T*
Solution:
a) Lets consider two branches B1 and B2, executed alternatively and
alternating between TAKEN/NOT TAKEN. The next table shows the
values for the predictions and the mispredictions. Because a
single predictor is shared here, prediction accuracy improves from
0% to 50%.
P
NT
Correct
Prediction?
B1
T
P
T
B2
NT
No
P
NT
No
B1
NT
P
NT
Yes
B2
T
P
T
No
B1
T
P
T
Yes
B2
NT
P
NT
No
B1
NT
P
N
T
Yes
B2
T
No
B1
T
No
P
T
B2
NT
No
P
NT
B1
T
No
P
T
B2
NT
No
P
NT
B1
T
No
P
T
B2
NT
No
P
NT
B1
T
No
P
T
B2
NT
No
L.D
MUL.D
L.D
ADD.D
S.D
ADDI
ADDI
ADDI
BNEZ
F2,
F4,
F6,
F6,
F6,
R1,
R2,
R3,
R3,
0(R1)
F2, F0
0(R2)
F4, F6
0(R2)
R1, #8
R2, #8
R3, #-8
bar
a) Assume a single-issue pipeline. Show how the loop would look both
unscheduled by the compiler and after compiler scheduling for both
floating-point operation and branch delays, including any stall or
idle clock cycles.
What is the execution time per iteration of
the result, unscheduled and scheduled? How much faster must the
clock be for processor hardware alone to match the performance
improvement achieved by the scheduling compiler (neglect the
possible increase in the number of cycles necessary for memory
system access effects of higher processor clock speed on memory
system performance?)
Unscheduled:
Instruction
L.D
F2, 0(R1)
S.D
ADDI
ADDI
ADDI
8
F6,
R1,
R2,
R3,
0(R2)
R1, #8
R2, #8
R3, #-
BNEZ
R3, bar
Clock
Number
1
Stall
3
4
Stall
Stall
7
Stall
Stall
10
11
12
13
Cycle
Stall
15
Stall
Scheduled:
Instruction
Instruction
L.D
F2, 0(R1)
L.D
F6, 0(R2)
L.D
MUL.D F2,
F4, 0(R1)
F2, F0
L.D
F8,
ADDI R1, 8(R1)
R1, #8
L.D
ADDI F6,
R3, 0(R2)
R3, #L.D
F12, 8(R2)
8
ADDI R2, R2, #8
ADD.D F4,
F6, F2,
F4, F0
F6
MUL.D
MUL.D F10, F8, F0
BNEZ R3, bar
*
S.D
ADDI F6,
R1, -8(R2)
R1, #16
*
ADDI R3, R3, #-8
Total Execution
cycles: 10 Cycles
How much faster
be to get this
using hardware?
Clock
Cycle
NumberClock
Cycle
1
Number
2
1
3
2
4
3
5
4
6
7
5
Stall 6
9
10
7
8
Time in clock
9
10
11
S.D
F6, 0(R2)
12
*
BNEZ R3, bar
13
16
S.D
F12, 8(R2)
14
1.6 times faster
then the clock
faster to match
of the schedule code on the original hardware.
For this problem, we can unroll at least 2 times the loop and schedule it to avoid stalls. It
can be unrolled more than 2 times and fit a better performance but initially the goal is to
find the first schedule using unrolling that doesnt have stalls.
In this exercise, we have produced 2 results using 14 cycles, which results in 7 clock
cycles per element. The major advantage in the unrolled case is that we have eliminated
4 instructions compared to not unrolling, to be precise, the loop overhead instructions of
one of the iterations-the instructions with * in the table aboveAdditionally, with unrolling the loop body is more suited to scheduling and allows the
stall cycle present in the scheduled original loop to be eliminated.