You are on page 1of 6

1) Consider the following MIPS assembly code.

LD
DADD
DSUB
DADD
BNEZ
DADD
DSUB

R1, 45(R2)
R7, R1, R5
R8, R1, R6
R9, R5, R1
R7, target
R10, R8, R5
R2, R3, R4

a) Identify each dependence by type; list the two instructions involved;


identify which instruction is dependent; and, if there is one, name the
storage location involved.
b) Use information about the MIPS five-stage pipeline from Appendix A
and assume a register file that writes in the first half of the clock
cycle and reads in the second half-cycle forwarding.
Which of the
dependences that you found in part (a) become hazards and which do not?
Why?
Solution:
a) Dependences:

LD
DADD

Instruction
R1, 45(R2)
R7, R1, R5

Type of Dependence
RAW Dependence or True
Dependence, Dadd
depends on LD

Storage Location
R1

DSUB

R8, R1, R6

RAW Dependence or True


Dependence, DSUB
depends on LD
RAW Dependence or True
Dependence, DADD
depends on LD
RAW Dependence or True
Dependence, BNEZ
Depends on DADD
RAW Dependence or True
Dependence and Control
Dependence- DADD
depends on DSUB and
Branch
Control DependenceDSUB depends on Branch

R1

DADD

R9, R5, R1

BNEZ

R7, target

DADD

R10, R8, R5

DSUB

R2, R3, R4

R1
R7
R8, R7

R7

b) Using MIPS Five Stage Pipeline, we get the next schedule for our
instructions.
Assumption: The only way of forwarding is through the register file.

LD

R1, 45(R2)

DADD

R7, R1, R5

DSUB

R8, R1, R6

DADD

IF

ID

EXE

MEM

WB

IF

ID

MEM

WB

IF

ID
IF

R9, R5, R1

BNEZ

R7, target

DADD

R10, R8, R5

DSUB

R2, R3, R4

10

11

12

13

EXE

MEM

WB

ID

EXE

MEM

WB

IF

ID

EXE

MEM

WB

IF

ID

EXE

MEM

WB

IF

ID

EXE

MEM

14

WB

Data Dependences that became hazards:


1) LD and DADD RAW Dependence-See the 2 cycles that need to be
stalled2) BNEZ and DADD, Control Dependence that becomes hazard because of
the need for checking the branch and waiting for the effective
address calculation of the next instruction to be executed.

2) Construct a version of the table that we have in class for 1/1


predictor assuming the 1-bit predictors are initialized to NT, the
correlation bit is initialized to T, and the value of d (leftmost
column of the table) alternates 0,1,2,0,1,2. Also, note and count
the number of instances of misprediction.

Solution:
Considering a sequence where d alternates between 0 and 2 and we can assume a NT
value for B2 at the beginning.
d=?
0
1
2
0
1
2

B1
Predictio
n
NT/NT
NT/NT
T/NT
T/NT
T/NT
T/NT

B1
Action
NT
T
T
NT
T
T

New B1
Predictio
n
NT/NT
T/NT*
T/NT
T/NT
T/NT
T/NT

B2
Predictio
n
NT/NT
NT/NT
NT/NT
NT/T
NT/T
NT/NT

Total Mispredictions: 4

B2
Action

New B2
Prediction

NT
NT
T
NT
NT
T

NT/NT
NT/NT
NT/T*
NT/T
NT/NT*
NT/T*

3) Increasing the size of a branch-prediction buffer means that it is


less likely that two branches in a program will share the same
predictor
A
single
predictor
predicting
a
single
branch
instruction is generally more accurate than is the same predictor
serving more that one branch instruction.
a) List a sequence of branch taken and not taken actions to show a
simple example of 1-bit predictor sharing that reduces misprediction
rate.
b) List a sequence of branch taken and not taken actions to show a
simple example of 1-bit predictor sharing that increases misprediction
rate.
c) Discuss why the sharing of branch predictors can be expected to
increase mispredictions for the long instruction execution sequences of
actual programs.

Solution:
a) Lets consider two branches B1 and B2, executed alternatively and
alternating between TAKEN/NOT TAKEN. The next table shows the
values for the predictions and the mispredictions. Because a
single predictor is shared here, prediction accuracy improves from
0% to 50%.
P
NT
Correct
Prediction?

B1
T

P
T

B2
NT

No

P
NT

No

B1
NT

P
NT

Yes

B2
T

P
T

No

B1
T

P
T

Yes

B2
NT

P
NT

No

B1
NT

P
N
T

Yes

B2
T
No

b) For this part, lets consider two Braches B1 and B2 where B1 is


always TAKEN and B2 is always NOT TAKEN and we follow the same
pattern as in part a. If each branch had a 1-bit predictor, each
would be correctly predicted. Because they share a predictor, the
accuracy of our predictions is 0%.-See table belowP
NT
Correct
Prediction?

B1
T
No

P
T

B2
NT
No

P
NT

B1
T
No

P
T

B2
NT
No

P
NT

B1
T
No

P
T

B2
NT
No

P
NT

B1
T
No

P
T

B2
NT
No

c) In general terms, if a predictor is shared by a set of branch


instructions, then over the course of program execution set
membership is very likely to change. When a new branch enters the
set or an old one leaves the set, the branch action history
represented by the state of the predictor is unlikely to predict
new set behaviors as it did before. Then, transient intervals
following set changes likely will reduce the long term prediction
accuracy for our shared predictors.

d) Consider the following loop.


bar:

L.D
MUL.D
L.D
ADD.D
S.D
ADDI
ADDI
ADDI
BNEZ

F2,
F4,
F6,
F6,
F6,
R1,
R2,
R3,
R3,

0(R1)
F2, F0
0(R2)
F4, F6
0(R2)
R1, #8
R2, #8
R3, #-8
bar

a) Assume a single-issue pipeline. Show how the loop would look both
unscheduled by the compiler and after compiler scheduling for both
floating-point operation and branch delays, including any stall or
idle clock cycles.
What is the execution time per iteration of
the result, unscheduled and scheduled? How much faster must the
clock be for processor hardware alone to match the performance
improvement achieved by the scheduling compiler (neglect the
possible increase in the number of cycles necessary for memory
system access effects of higher processor clock speed on memory
system performance?)
Unscheduled:
Instruction
L.D

F2, 0(R1)

MUL.D F4, F2, F0


L.D
F6, 0(R2)

ADD.D F6, F4, F6

S.D
ADDI
ADDI
ADDI
8

F6,
R1,
R2,
R3,

0(R2)
R1, #8
R2, #8
R3, #-

BNEZ

R3, bar

Clock
Number
1
Stall
3
4
Stall
Stall
7
Stall
Stall
10
11
12
13

Cycle

Stall
15
Stall

Total Execution Time in clock cycles: 16 Cycles

Scheduled:
Instruction
Instruction
L.D
F2, 0(R1)
L.D
F6, 0(R2)
L.D
MUL.D F2,
F4, 0(R1)
F2, F0
L.D
F8,
ADDI R1, 8(R1)
R1, #8
L.D
ADDI F6,
R3, 0(R2)
R3, #L.D
F12, 8(R2)
8
ADDI R2, R2, #8
ADD.D F4,
F6, F2,
F4, F0
F6
MUL.D
MUL.D F10, F8, F0
BNEZ R3, bar
*
S.D
ADDI F6,
R1, -8(R2)
R1, #16
*
ADDI R3, R3, #-8
Total Execution
cycles: 10 Cycles
How much faster
be to get this
using hardware?

Clock
Cycle
NumberClock
Cycle
1
Number
2
1
3
2
4
3
5
4
6
7
5
Stall 6
9
10
7
8
Time in clock

ADD.D F6, F4, F6


ADD.D F12, F10, F12
*
ADDI R2, R2, #16

9
10
11

S.D
F6, 0(R2)
12
*
BNEZ R3, bar
13
16
S.D
F12, 8(R2)
14
1.6 times faster
then the clock
faster to match
of the schedule code on the original hardware.

has the clock to


improvement just
cycles/10 cycles=
than the original
must be 60%
the performance

b) Assume a single-issue pipeline. Unroll the loop as many times as


necessary to schedule it without any stalls, collapsing the loop
overhead instructions. How many times must the loop be unrolled?
Show the instruction schedule. What is the execution time per
element of the result iteration?
What is the major contribution
to the reduction in time per iteration?
Solution:

For this problem, we can unroll at least 2 times the loop and schedule it to avoid stalls. It
can be unrolled more than 2 times and fit a better performance but initially the goal is to
find the first schedule using unrolling that doesnt have stalls.

In this exercise, we have produced 2 results using 14 cycles, which results in 7 clock
cycles per element. The major advantage in the unrolled case is that we have eliminated
4 instructions compared to not unrolling, to be precise, the loop overhead instructions of
one of the iterations-the instructions with * in the table aboveAdditionally, with unrolling the loop body is more suited to scheduling and allows the
stall cycle present in the scheduled original loop to be eliminated.

You might also like