Datapath Control

The Processor: Datapath & Control
We're ready to look at an implementation of the MIPS

Simplified to contain only:
memory-reference instructions: lw, sw
arithmetic-logical instructions: add, sub, and, or, slt
control flow instructions: beq, j
Generic Implementation:
use the program counter (PC) to supply instruction address
get the instruction from memory
read registers
use the instruction to decide exactly what to do
All instructions use the ALU after reading the registers
Why? memory-reference? arithmetic? control flow?
2004 Morgan Kaufmann Publishers 1

More Implementation Details
Abstract / Simplified View:
Two types of functional units:

elements that operate on data values (combinational)
elements that contain state (sequential)

Figure 5.2 The basic implementation of the MIPS subset
including the necessary multiplexers and control lines.

5.3 Building a Datapath

Keywords
Datapath element A functional unit used to operate on or hold

data within a processor. In the MIPS implementation the datapath
elements include the instruction and data memories, the register
file, the arithmetic logic unit (ALU), and adders.
Program counter (PC) The register containing the address of

the instruction in the program being executed.
Register file A state element that consists of a set of registers

that can be read and written by supplying a register number to be
accessed.
Sign-extend To increase the size of a data item by replicating

the high-order sign bit of the original data item in the high-order
bits of the larger, destination data item.

Keywords
Branch target address The address specified in a branch,

which becomes the new program counter (PC) if the branch is
taken. In the MIPS architecture the branch target is given by the
sum of the offset field of the instruction and the address of the
instruction following the branch.
Branch taken A branch where the branch condition is satisfied
and the program counter (PC) becomes the branch target. All
unconditional branches are taken branches.
Branch not taken A branch where the branch condition is false
and the program counter (PC) becomes the address of the
instruction that sequentially follows the branch.
Delayed branch A type of branch where the instruction
immediately following the branch is always executed, independent
of whether the branch condition is true or false.

Register File
Built using D flip-flops Read register

number 1
Register 0
Register 1
M
... u Read data 1
x
Read register Register n 2
number 1 Read Register n 1
Read register data 1
number 2
Register file Read register
Write Read number 2
register data 2
Write
data Write M
u Read data 2
x
Do you understand? What is the Mux above?

Register File
Note: we still use the real clock to determine when to write

Write
C
0
1 Register 0
n-to-2n .. D
Register number .
decoder
C
Register 1
n1
D
n
..
.
C
Register n 2
D
C
Register n 1
Register data D

Simple Implementation
Include the functional units we need for each instruction
Instruction
address
Instruction PC Add Sum
Instruction
memory
a. Instruction memory b. Program counter c. Adder

MemWrite
Read
Address data
16 32
Sign
Data extend
Write memory
data
MemRead
a. Data memory unit b. Sign-extension unit

5 Read ALU operation
register 1 4
Read
Register 5 data 1
Read
numbers register 2 Zero
Data ALU ALU
5 Registers
Write result
register Read
data 2
Data Write
Data
RegWrite
a. Registers b. ALU
Why do we need this stuff?

Figure 5.10 The datapath for the memory instructions and
the R-type instructions.

Building the Datapath
Use multiplexors to stitch them together
PCSrc
M
Add u
x
ALU
4 Add
result
Shift
left 2
Read ALUSrc ALU operation

Read register 1 4
PC address Read MemWrite
data 1
Read MemtoReg
register 2 Zero
Instruction ALU ALU
Registers Read Read
Write Address
data 2 result data M
Instruction register M
memory u u
x x
Write
data Data
Write memory
RegWrite data
16 32 MemRead
Sign
extend

5.4 A Simple Implementation Scheme

Figure B.5.9 A 1-bit ALU that performs AND, OR, and
addition on a and b or a and b.

FIGURE B.5.10 (Top) A 1-bit ALU that performs AND,
OR, and addition on a and b or b.

FIGURE B.5.10 (bottom) a 1-bit ALU for the most
significant bit.

FIGURE B.5.11 A 32-bit ALU constructed from the 31 copies of the 1-bit
ALU in the top of Figure B.5.10 and one 1-bit ALU in the bottom of that figure.

FIGURE B.5.12 The final 32-bit ALU. This adds a Zero
detector to Figure B.5.11.

FIGURE B.5.14 The symbol commonly used to represent an
ALU, as shown in FigureB.5.12.

Figure 5.15 The datapath of Figure 5.12 with all necessary
multiplexors and all control lines identified

Control
Simple combinational logic (truth tables)

Inputs
Op5
Op4
Op3
ALUOp Op2
Op1
ALU control block
Op0
ALUOp0
ALUOp1
Outputs
Operation2 R-format Iw sw beq
F3 RegDst
Operation
F2 Operation1 ALUSrc
F (5 0)
F1 MemtoReg
Operation0
RegWrite
F0
MemRead
MemWrite
Branch
ALUOp1
ALUOpO

Figure 5.17 The simple datapath with the control unit.

Figure 5.18 The setting of the control lines is completely
determined by the opcode fields of the instruction.
Memto- Reg Mem Mem

Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0
R-format 1 0 0 1 0 0 0 1 0
lw 0 1 1 1 1 0 0 0 0
sw X 1 X 0 0 1 0 0 0
beq X 0 X 0 0 0 1 0 1

Figure 5.19 The datapath in operation for an R-type instruction
such as add $t1, $t2, $t3.

Figure 5.20 The datapath in operation for a load instruction.

Figure 5.21 The datapath in operation for a branch equal
instruction.

Figure 5.22 The control function for the simple single-cycle
implementation is completely specified by this truth table.
Input or output Signal name R-format lw sw beq
Inputs Op5 0 1 1 0
Op4 0 0 0 0
Op3 0 0 1 0
Op2 0 0 0 1
Op1 0 1 1 0
Op0 0 1 1 0
Outputs RegDst 1 0 X X
ALUSrc 0 1 1 0
MemtoReg 0 1 X X
RegWrite 1 1 0 0
MemRead 0 1 0 0
MemWrite 0 0 1 0
Branch 0 0 0 1
ALUOp1 1 0 0 0
ALUOp0 0 0 0 1
Figure 5.23 Instruction format for the jump instruction
(opcode = 2).
Field 000010 address

Bit positions 31:26 25:0

Figure 5.24 The simple control and datapath are extended to
handle the jump instruction.

Problem: Performance of Single-Cycle Machines (p.315)
Assume that the operation times for the major functional units in this implementation
are the following:
Memory units: 200 picoseconds (ps)

ALU and adders: 100 ps
Register file (read or write): 50 ps
Assume that the multiplexors, control unit, PC accesses, sign extension unit, and
wires have no delay, which of the following implementations would be faster and by
how much?
1. An implementation in which every instruction operates in 1 clock cycle of a

fixed length.
2. An implementation where every instruction executes in 1 clock cycle
using a variable-length clock, which for each instruction is only as long as it
needs to be.
To compare the performance, assume the following instruction mix: 25% loads, 10%
stores, 45% ALU instructions, 15% branches, and 5% jumps.

Lets start by comparing the CPU execution times.
CPU execution time Instruction count CPI Clock cycle time
Since CPI must be 1, we can simplify this to
CPU execution time Instruction count Clock cycle time
The critical path for the different instruction classes is as follows:
Instruction class Functional units used by the instruction class

R-type Instruction fetch Register access ALU Register access
Load word Instruction fetch Register access ALU Memory access Register access
Store word Instruction fetch Register access ALU Memory access
Branch Instruction fetch Register access ALU
Jump Instruction fetch

Using these critical paths, we can compute the required length for
each instruction class:
Instruction Instruction Register ALU Data Register
class memory read operation memory write Total
R-type 200 50 100 0 50 400ps
Load word 200 50 100 200 50 600ps
Store word 200 50 100 200 550ps
Branch 200 50 100 0 350ps
Jump 200 200ps
Thus, the average time per instruction with a variable clock is

CPU clock cycle 600 25% 550 10% 400 45% 350 15% 200 5%
447.5 ps

Since the variable clock implementation has a shorter average clock
cycle, it is clearly faster. Lets find the performance ratio:
CPU performance variable clock CPU execution timesingle clock

CPU performance single clock CPU execution time variable clock
IC CPU clock cyclesingle clock CPU clock cyclesingle clock

IC CPU clock cyclevariable clock CPU clock cyclevariable clock
600
1.34
447.5

5.5 A Multicycle Implementation

Keywords
Multicycle implementation Also called multiple clock cycle

implementation. An implementation in which and instruction is
executed in multiple clock cycles.
Microprogramming A symbolic representation of control in the

form of instructions, called microinstructions, that are executed on
a simple micromachine.
Finite state machine A sequential logic function consisting of a

set of inputs and outputs, a next-state function that maps the
current state and the inputs to a new state, and an output function
that maps the current state and possibly the input to a set of
asserted outputs.
Next-state function A combinational function that, given the

inputs and the current state, determines the next state of a finite
state machine.
Where we are headed
Single Cycle Problems:

what if we had a more complicated instruction like floating
point?
wasteful of area
One Solution:
use a smaller cycle time
have different instructions take different numbers of cycles
a multicycle datapath:

Multicycle Approach
We will be reusing functional units

ALU used to compute address and to increment PC
Memory used for instruction and data
Our control signals will not be determined directly by instruction
e.g., what should the ALU do for a subtract instruction?
Well use a finite state machine for control

Multicycle Approach
Break up the instructions into steps, each step takes a cycle

balance the amount of work to be done
restrict each cycle to use only one major functional unit
At the end of a cycle
store values for use in later cycles (easiest thing to do)
introduce additional internal registers

Figure 5.27 The multicycle datapath from Figure 5.26 with the
control lines shown.

Figure 5.28 The complete datapath for the multicycle
implementation together with the necessary control lines.

Figure 5.29 The action caused by the setting of each control
signal in Figure 5.28 on page 323.
Actions of the 1-bit control signals

Signal name Effect when deasserted Effect when asserted
RegDst The register file destination number for The register file destination number for the Write register
the Write register comes from the rt comes from the rd field.
field.
RegWrite None. The general-purpose register selected by the Write register
number is written with the value of the Write data input.
ALUSrcA The first ALU operand is the PC. The first ALU operand comes from the A register.
MemRead None. Content of memory at the location specified by the address

input is put on Memory data output.
MemWrite None. Memory contents at the location specified by the address
input is replaced by value on Write data input.
MemtoReg The value fed to the register file Write The value fed to the register file Write data input comes from
data input comes from ALUOut. the MDR.
IorD The PC is used to supply the address to ALUOut is used to supply the address to the memory unit.
the memory unit.
IRWrite None. The output of the memory is written into the IR.
PCWrite None. The PC is written; the source is controlled by PCSource.
PCWriteCond None. The PC is written is the Zero output from the ALU is also
active.

Continue
Actions of the 2-bit control signals

Signal Value Effect
name (binary)
ALUOp 00 The ALU performs an add operation.
01 The ALU performs a subtract operation.
10 The funct field of the instruction determines the ALU operation.
ALUSrcB 00 The second input to the ALU comes from the B register.
01 The second input to the ALU is the constant 4.
10 The second input to the ALU is the sign-extend, lower 16 bits of the IR.
11 The second input to the ALU is the sign-extended, lower 16 bits of the IR shifted left 2 bits.
PCSource 00 Output of the ALU (PC+4) is sent to the PC for writing.
01 The contents of ALUOut (the branch target address) are sent to the PC for waiting.
10 The jump target address (IR[25:0] shifted left 2 bits and concatenated with PC+4[31:28] is
sent to the PC for writing.)

Instructions from ISA perspective
Consider each instruction from perspective of ISA.

Example:
The add instruction changes a register.
Register specified by bits 15:11 of instruction.
Instruction specified by the PC.
New value is the sum (op) of two registers.
Registers specified by bits 25:21 and 20:16 of the instruction
Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op
Reg[Memory[PC][20:16]]
In order to accomplish this we must break up the instruction.

(kind of like introducing variables when programming)

Breaking down an instruction
ISA definition of arithmetic:
Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op

Reg[Memory[PC][20:16]]
Could break down to:

IR <= Memory[PC]
A <= Reg[IR[25:21]]
B <= Reg[IR[20:16]]
ALUOut <= A op B
Reg[IR[20:16]] <= ALUOut
We forgot an important part of the definition of arithmetic!

PC <= PC + 4

Idea behind multicycle approach
We define each instruction from the ISA perspective (do this!)
Break it down into steps following our rule that data flows through
at most one major functional unit (e.g., balance work across steps)
Introduce new registers as needed (e.g, A, B, ALUOut, MDR, etc.)
Finally try and pack as much work into each step

(avoid unnecessary cycles)
while also trying to share steps where possible
(minimizes control, helps to simplify solution)
Result: Our books multicycle Implementation!

Five Execution Steps
Instruction Fetch
Instruction Decode and Register Fetch
Execution, Memory Address Computation, or Branch Completion
Memory Access or R-type instruction completion
Write-back step
INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!

Step 1: Instruction Fetch
Use PC to get instruction and put it in the Instruction Register.

Increment the PC by 4 and put the result back in the PC.
Can be described succinctly using RTL "Register-Transfer Language"
IR <= Memory[PC];
PC <= PC + 4;
Can we figure out the values of the control signals?
What is the advantage of updating the PC now?

Step 2: Instruction Decode and Register Fetch
Read registers rs and rt in case we need them

Compute the branch address in case the instruction is a branch
RTL:
A <= Reg[IR[25:21]];
B <= Reg[IR[20:16]];
ALUOut <= PC + (sign-extend(IR[15:0]) << 2);
We aren't setting any control lines based on the instruction type

(we are busy "decoding" it in our control logic)

Step 3 (instruction dependent)
ALU is performing one of three functions, based on instruction type
Memory Reference:
ALUOut <= A + sign-extend(IR[15:0]);
R-type:
ALUOut <= A op B;
Branch:
if (A==B) PC <= ALUOut;

Step 4 (R-type or memory-access)
Loads and stores access memory
MDR <= Memory[ALUOut];

or
Memory[ALUOut] <= B;
R-type instructions finish
Reg[IR[15:11]] <= ALUOut;
The write actually takes place at the end of the cycle on the edge

Write-back step
Reg[IR[20:16]] <= MDR;
Which instruction needs this?

Summary:

Problem: CPI in a multicycle CPU
Using the SPECINT2000 instruction mix shown in Figure 3.26, what is

the CPI, assuming that each state in the multicycle CPU requires 1
clock cycle?
Answer:
The mix is 25% loads (1% load byte+24% load word), 10% stores (1%
store byte+9% store word), 11% branches (6% beq, 5% bne), 2% jumps
(1% jal+1% jr), and 52% ALU (all the rest of the mix, which we assume to
be ALU instructions). From Figure 5.30 on page 329, the number of clock
cycles for each instruction class is the following:
Loads: 5 ; Store: 4; ALU instructions: 4; Branches: 3; Jumps: 3;
The CPI is given by the following:
CPU clock cycles Instruction count i CPI i

CPI
Instruction count Instruction count
Instruction count i
CPI i
Instruction count 2004 Morgan Kaufmann Publishers 54
The ratio
Instruction counti
Instruction count
is simplify the instruction frequency for the instruction class i. We
can therefore substitute to obtain
CPI 0.25 5 0.10 4 0.52 4 0.11 3 0.02 3 4.12

This CPI is better than the worst-case CPI of 5.0 when all the
instructions take the same number of clock cycles.

Figure 5.39 The multicycle datapath with the addition needed
to implement exceptions.

Datapath Control

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Datapath Control

Uploaded by

Copyright:

Available Formats

The Processor: Datapath & Control

We're ready to look at an implementation of the MIPS

2004 Morgan Kaufmann Publishers 1

Abstract / Simplified View:

Two types of functional units:

2004 Morgan Kaufmann Publishers 2

2004 Morgan Kaufmann Publishers 3

2004 Morgan Kaufmann Publishers 4

Datapath element A functional unit used to operate on or hold

Program counter (PC) The register containing the address of

Register file A state element that consists of a set of registers

Sign-extend To increase the size of a data item by replicating

2004 Morgan Kaufmann Publishers 5

Branch target address The address specified in a branch,

2004 Morgan Kaufmann Publishers 6

Built using D flip-flops Read register

Do you understand? What is the Mux above?

2004 Morgan Kaufmann Publishers 7

Note: we still use the real clock to determine when to write

2004 Morgan Kaufmann Publishers 8

Include the functional units we need for each instruction

Instruction PC Add Sum

a. Instruction memory b. Program counter c. Adder

2004 Morgan Kaufmann Publishers 9

a. Data memory unit b. Sign-extension unit

2004 Morgan Kaufmann Publishers 10

Why do we need this stuff?

2004 Morgan Kaufmann Publishers 12

Read ALUSrc ALU operation

2004 Morgan Kaufmann Publishers 13

2004 Morgan Kaufmann Publishers 14

2004 Morgan Kaufmann Publishers 15

2004 Morgan Kaufmann Publishers 16

2004 Morgan Kaufmann Publishers 17

2004 Morgan Kaufmann Publishers 18

2004 Morgan Kaufmann Publishers 19

2004 Morgan Kaufmann Publishers 20

2004 Morgan Kaufmann Publishers 21

Simple combinational logic (truth tables)

2004 Morgan Kaufmann Publishers 22

2004 Morgan Kaufmann Publishers 23

Memto- Reg Mem Mem

2004 Morgan Kaufmann Publishers 24

2004 Morgan Kaufmann Publishers 25

2004 Morgan Kaufmann Publishers 26

2004 Morgan Kaufmann Publishers 27

Field 000010 address

2004 Morgan Kaufmann Publishers 29

2004 Morgan Kaufmann Publishers 30

Memory units: 200 picoseconds (ps)

1. An implementation in which every instruction operates in 1 clock cycle of a

2004 Morgan Kaufmann Publishers 31

Instruction class Functional units used by the instruction class

Store word Instruction fetch Register access ALU Memory access

Branch Instruction fetch Register access ALU

Jump Instruction fetch

2004 Morgan Kaufmann Publishers 32

R-type 200 50 100 0 50 400ps

Load word 200 50 100 200 50 600ps

Store word 200 50 100 200 550ps

Branch 200 50 100 0 350ps

Jump 200 200ps