You are on page 1of 8

Question 1:

Give a high-level view of pipelined processor datapath and explain its working;
compare the performance of pipelined datapath and the multi-cycle datapath.

Solution:

Instruction

pipelining is

technique

that

implements

form

of parallelism called instruction-level parallelism within a single processor. It therefore


allows faster CPU throughput (the number of instructions that can be executed in a unit
of time) than would otherwise be possible at a given clock rate. The basic instruction
cycle is broken up into a series called a pipeline. Rather than processing each
instruction sequentially (finishing one instruction before starting the next), each
instruction is split up into a sequence of steps so different steps can be executed
in parallel and instructions can be processed concurrently(starting one instruction before
finishing the previous one).
Pipelining increases instruction throughput by performing multiple operations at the
same time, but does not reduce instruction latency, which is the time to complete a
single instruction from start to finish, as it still must go through all steps. Indeed, it may
increase latency due to additional overhead from breaking the computation into
separate steps and worse, the pipeline may stall (or even need to be flushed), further
increasing the latency. Thus, pipelining increases throughput at the cost of latency, and
is frequently used in CPUs but avoided in real-time systems, in which latency is a hard
constraint.
Each instruction is split into a sequence of dependent steps. The first step is always to
fetch the instruction from memory; the final step is usually writing the results of the
instruction to processor registers or to memory. Pipelining seeks to let the processor

work on as many instructions as there are dependent steps, just as an assembly


line builds many vehicles at once, rather than waiting until one vehicle has passed
through the line before admitting the next one. Just as the goal of the assembly line is to
keep each assembler productive at all times, pipelining seeks to keep every portion of
the processor busy with some instruction. Pipelining lets the computer's cycle time be
the time of the slowest step, and ideally lets one instruction complete in every cycle.
The term pipeline is an analogy to the fact that there is fluid in each link of a pipeline, as
each part of the processor is occupied with work.

Question 2:
Following code lines are written in a high level language:
a = c + d;
b = c + e;

The corresponding instructions for MIPS are:

LW R1, 0(R0)
LW R2, 4(R0)
ADD R3, R1, R2
SW R3, 12(R0)
LW R4, 8(R0)
ADD R5, R1, R4
SW R5, 16(R0)

These instructions are to be executed on a pipelined processor with forwarding.

a. Identify hazards by showing the execution of these instructions per cycle


bases.
b. Reorder these instructions to avoid any pipeline stalls.
c. How many cycles are saved after executing the reordered instructions?

Solution:
a. Identify hazards by showing the execution of these
instructions per cycle bases.
SR.NO.

CODE

ASSEMBLIY LENT CODE

LW RI, 0(RO)

LOADRI

Mem[O+Reg[R0]]

LW R2, 4(RO)

LOADRI

ADD R3, RI, R2

gcsg[R3]

llsg[RI]Fasg[R2]]

SW R3, 12(RO)

Mem[R3]

Ilsg[R3]+Mem[12+Reg[R1]]

LW R4, 8(RO)

LOADR4

ADD R5, RI, R4

Reg[R5]

ga[R1]-13sg[R4]]

SW R5, 16(RO)

Mem[R5]

Ilsg[125]+Mem[16-Ren I ]]

Mem[O+Reg[R0]]

Mem[8+Reg[ROJJ

Sample
Instruction
LW R1, 0(R0)
LW R2, 4(RO)
ADD R3, R1, R2
SW R3, 12(RO)
LW R4, 8(RO)
ADD R5, R1, R4
SW R5, 16(RO)

1
IF

2
ID
IF

3
EXE
ID
IF

4
MEM
EXE
ID

5
WB
MEM
EXE

IF

ID
IF

6
WB
MEM
EXE
ID
IF

10

11

WB
MEM
EXE

WE
MEM

WB

WB
MEM
EXE
ID
IF

WB
MEM
EXE
ID

Instruction
LW R1, 0(R0)
LW R2, 4(R0)
ADD R3, R1, R2
SW R3, 12(RO)
LW R4, 8(R0)
ADD R5, R1, R4
SW R5, 16(RO)

1 2
IF ID
IF

3
EXE
ID
IF

4
MEM
EXE
stall

5
6
7
WB
MEM WB
ID
EXE MEM
IF
ID EXE
IF
ID
IF

10

11

12

13

WB
MEM WB
EXE MEM WB
stall
ID
EXE MEM WB
IF
ID EXE MEM WB

b.
Reorder these instructions to avoid any
pipeline stalls.

SR.NO.

CODE

ASSENIBLIY LINE CODE

LW RI, 0(RO)

LOADR1

Mem[01-Reg[RO]]

LW R2, 4(RO)

LOADRI

Mem[01-Reg[RO]]

LW R4, S(RO)

LOADR4

Mem[S+Reg[RO]]

ADD R3, RI, R2

Reg[R3]

E Ikg[R1] : 1Mg[R2]]

SW R3, 12(RO)

Mem[121-Reg[RO]] E

ADD R5, RI, R4

Reg[R5]

SW R5, 16(RO)

Reg[R3]

< gsgER1Hisg[R4]]

Mem[16RegR0]1 Reg[R5]

Instruction

LW R1, 0(RO)
LW R2, 4(RO)
LW R4, 8(RO)
ADD R3, R1, R2
SW R3, 12(RO)
ADD R5, R1, R4
SW R5, 16(RO)

IF

ID
IF

&NE MEN! WB
ID
EXE MEN! WB
IF
ID
EXE MEN!
IF
ID
EXE
IF
ID
IF

WB
MEM
EXE
ID
IF

WE
MEN! WB
EXE MEN!
ID
EXE

10

11

WE
MEM

WE

c. How many cycles are saved after executing the reordered instructions?

The code before reordering contained 13 clock cycle in a given question


Instzuction
LW RI, 0(RO)
LW R2, 4(RO)
ADD R3, R1, R2
SW R3, 12(RO)
LW R4, S(RO)
ADD R5, RI, R4
SW R5, 16(R0)

1 2
IF ID
IF

3
EXE
ID
IF

4
5
MEM WB
EXE MEM
stall
ID
IF

6
WE
EXE
ID
IF

MEM WB
EXE MEM WE
ID
EXE MEM
IF
stall
ID
IF

10

WB
EXE
ID

11

12

MEM WE
EXE MEM

13

WB

The code after reordering con ained 11 clock cycle in a given question.

Instruction
1
LW R1, 0(RO)
LW R2, 4(RO)
LW R4, S(RO)
ADD R3, R1, R2
SW R3, 12(RO)
ADD R5, R1, R4
SW R5, 16(RO)

IF ID EXE MEM
IF ID
EXE
IF
ID
IF

5
WE
MEM
EXE
ID
IF

6
WB
MEM
EXE
ID
IF

10

11

WE
MEM
EXE
ID
IF

WB
Mal
EXE
ID

WE
NIEM WB
EXE MEM WB

Due to reordering save two cycles,

Question 3:
Read the research paper titled An optimizing pipeline stall reduction algorithm for
power and performance on multi-core CPUs, and answer the following questions:
a. How the proposed Left-Right (LR) algorithm works?
b. Why LR algorithm is giving better results as compared to traditional in-order
and Tomasulos algorithms?

a. How the proposed Left-Right (LR) algorithm works?

Proposed algorithm (LR(Left-Right)): We have proposed an algorithm which performs


the stall reduction in a Left-Right (LR) manner, insequential instruction execution as
shown in Figure 1. Our algorithm introduces a hybrid order of instruction execution in
order to reduce the power dissipationl. More precisely, it executes the instructions
serially as in-order execution until a stall condition is encountered, and thereafter, it
uses of concept of out-of-order execution to replace the stall with an independent
instruction. Thus, LR increases the throughput by executing independent instructions
while the lengthy instructions are still executed in other functional units or the registers
are involved in an ongoing process. LR also prevents the hazards that might occur
during the instruction execution. The instructions are scheduled statically at compile

time as shown in Figure 2. In our proposed approach, if a buffer in presence can hold a
certain number of sequential instructions, our algorithm will generate a sequence
inwhich the instructions should be executed to reduce the number of stalls while
maximizing the throughput of a processor. It is assumed that all the instructions are in
the

form

of op-code

source

destination

format.

proposed an algorithm which performs the stall reduction in a Left-Right (LR) manner, in
sequential instruction execution as shown in Figure 1. Our algorithm introduces a hybrid
order of instruction execution in order to reduce the power dissipationl. More precisely, it
executes the instructions serially as in-order execution until a stall condition is
encountered, and thereafter, it uses of concept of out-of-order execution to replace the
stall with an independent instruction. Thus, LR increases the throughput by executing
independent instructions while the lengthy instructions are still executed in other
functional units or the registers are involved in an ongoing process. LR also prevents
the hazards that might occur during the instruction execution. The instructions are
scheduled statically at compile time as shown in Figure 2. In our proposed approach, if
a buffer in presence can hold a certain number of sequential instructions, our algorithm
will generate a sequence in which the instructions should be executed to reduce the
number of stalls while maximizing the throughput of a processor. It is assumed that all
the instructions are in the form of op-code source destination format.

b. Why LR algorithm is giving better results as compared to traditional in-order


and Tomasulos algorithms?
Solution:
Comparison
of
LR
vs.
Tomasulo
algorithm
In this section, the performance and power gain of the LR and the Tomasulo algorithms are
compared.
Simulation
and
power-performance
evaluation
As our baseline configuration, we use an Intel core i5 dual core processor with 2.40GHZ clock
frequency, and 64-bit operating system. We also use the Sim-Panalyzer simulator [25]. The LR, inorder, and Tomasulo algorithms are developed as C programs. These C programs were compiled
using
arm-linux-gcc
in
order
to
obtain
the
object
files
for
each
of
them,
on
an
ARM
microprocessor
model.

At the early stage of the processor design, various levels of simulators can be used to estimate the
power and performance such as transistor level, system level, instruction level, and microarchitecture level simulators. In transistor level simulators, one can estimate the voltage and current
behaviour over time. This type of simulators are used for integrated circuit design, and not suitable
for large programs. On the other hand, microarchitecture level simulators provide the power
estimation across cycles and these are used in modern processors. Our work is similar to this kind of
simulator because our objective is to evaluate the power-performance behaviour of a microarchitecture
level
design abstraction. Though, a literature survey suggests several power estimation tools such as
CACTI, WATTCH [26], and we have choose the Sim-Panalyzer [25] since it provides an accurate
power modelling by taking into account both the leakage and dynamic power dissipation.
The actual instruction execution of our proposed algorithm against existing ones is shown in
Algorithms 1 and 2. In the LR algorithm, an instruction is executed serially in-order until a stall
occurs, and thereafter the out-of-order execution technique comes to play to replace the stall with an
independent instruction stage. Therefore, in most cases, our proposed algorithm takes less cycle of
operation
and
less
cycle
time
compared to existing algorithms as shwon in algorithm [2]. The comparison of our proposed
algorithm against the Tomasulo algorithm and the in-orderalgorithm is shown in Table 1. The next
section focusses on the power-performance efficiency of our proposed algorithm

You might also like