You are on page 1of 21

Chapter 4 (Part II)

The Processor: Datapath and Control


(Enhancing Performance with Pipelining)

陳瑞奇(J.C. Chen)
亞洲大學資訊工程學系

Adapted from class notes by


Prof. M.J. Irwin, PSU and Prof. D. Patterson, UCB

Single Cycle vs. Multiple Cycle Timing


Single Cycle Implementation:

Cycle 1 Cycle 2
Clk

lw sw Waste
multicycle clock
slower than 1/5th of
Multiple Cycle Implementation: single cycle clock
due to stage register
overhead
Clk Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10

lw sw R-type
IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch

2
How Can We Make It Even Faster?
 1. Split the multiple instruction cycle into smaller and
smaller steps (stages)

 2. Start fetching and executing the next instruction


before the current one has completed
 Pipelining –modern processors are pipelined for
performance
Multiple instructions can be overlapped in
execution.
 Remember the performance equation:
CPU time = CPI * CCT * IC

 3. Fetch (and execute) more than one instruction at a


time (Parallel Processing; Superscalar)
3

Pipelining(管線化): 觀念來自於生產線輸送帶

http://www.dtdsmt.com/upload/photo/ee99e86b50b43054bfd4217649475fba.jpg 4
Pipelining #1

Pipelining #2

6
Pipelining: It’s Natural!

 Laundry Example
 Ann,Brian, Cathy, Dave A B C D
each have one load of
clothes to wash, dry, fold,
and organize
 Washer takes 30 minutes
 Dryer takes 30 minutes
 “Folder” takes 30 minutes
 “Closet” takes 30 minutes
7

Sequential laundry takes 8 hours for 4 loads!

pipelining
Only takes 3.5 hours for 4 loads!
2.3 times
40/11.5=3.5 times faster faster
Twenty loads (11.5hrs) would take about
5.75 times as long as one load (2 hrs)!
p.261(頁271) Fig. 4.25 10
MIPS instructions classically take five Stages
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

lw IFetch Dec Exec Mem WB

 Five stages, one step per stage


 IFetch: Instruction Fetch
 Dec: Instruction Decode and register file read
 Exec: Execution or address calculation
 Mem: data Memory access
 WB: Write Back (to a register)
11

A Pipelined MIPS Processor


 Start the next instruction before the current one has
completed
 improves throughput - total amount of work done in a
given time
 instruction latency (execution time, delay time,
response time - time from the start of an instruction to
its completion) is not reduced

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8

lw IFetch Dec Exec Mem WB

sw IFetch Dec Exec Mem WB

R-type IFetch Dec Exec Mem WB

- clock cycle (pipeline stage time) is limited by the slowest stage


- for some instructions, some stages are wasted cycles
12
Single Cycle, Multiple Cycle, vs. Pipeline
Single Cycle Implementation:
Cycle 1 Cycle 2
Clk

lw sw Waste

Multiple Cycle Implementation:

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10
Clk
lw sw R-type
IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch

Pipeline Implementation:
lw IFetch Dec Exec Mem WB

sw IFetch Dec Exec Mem WB

R-type IFetch Dec Exec Mem WB

13

Pipeline Performance p.264(頁273) Fig. 4.27


Single-cycle (Tc= 800ps)

Pipelined (Tc= 200ps)

15
4.6 The single-cycle datapath from Fig. 4.17

p. 275
(頁285)
Fig. 4.33
•What do we need to add to actually split the datapath into stages?
17

MIPS Pipeline Datapath Modifications


 What do we need to add/modify in our MIPS datapath?
 State registers between each pipeline stage to isolate them
IF:IFetch ID:Decode EX:Execute MEM: WB:
MemAccess WriteBack

64bits 128bits 97bits 64bits


IF/ID ID/EX EX/MEM MEM/WB
Add
32 32 Add
4 Shift
left 2
32 zero
Read Addr 1
Instruction Data
IFetch/Dec

Register Read
Memory Memory
Read Addr 2Data 132
Dec/Exec

Exec/Mem

Read
PC

File Read
Mem/WB

Address ALU Address


32 Write Addr Data
Read 32 32
Data 232 Write Data
Write Data

p.277
Sign
(頁287) 16 Extend 32 32 32
Fig.4.35
System Clock
18
Graphically Representing Pipelines p.276(頁286)
Fig.4.34

Totally
used
R R R W

Not used 19

The first pipe stage (IF) of an instruction (Load)

p.279(頁288)
(Instruction fetch) System clock (Write) Fig.4.36 top
IF

R
System
clock

20
The second pipe stage (ID) of an instruction (Load)
(Instruction decode)
p.279(頁288)
System clock (Write) Dec Write
Fig.4.36 bottom

decoding decoded

System
clock

21

The third pipe stage (EX) of an instruction (Load)

(Execution)
p.280(頁289)
Dec Write EX Write
Fig.4.37

decoded

effective
address

22
The fourth pipe stage (MEM) of an instruction (Load)
(Memory)
p.281(頁290) Fig.4.38 top EX write MEM write

effective
address

23

The fifth pipe stage (WB) of an instruction


(Write back)
System clock
MEM

p.281(頁290) Fig.4.38 bottom 24


The third pipe stage (EX) of an instruction (store)

(Execution)
p.282(頁292) Fig.4.39
Dec Write EX Write

effective
address
store

25

The fourth pipe stage (MEM) of an instruction (store)


(Memory)
p.283(頁293) Fig.4.40 top EX Write MEM Write

effective
address

26
The fifth pipe stage (WB) of an instruction (store)
(Write back)
p.283(頁293) Fig.4.40 bottom
MEM Write

MEM Write

decoded

27

Corrected Datapath to Save RegWrite Addr


 Need to preserve the destination register address in
the pipeline state registers

IF/ID ID/EX EX/MEM

Add
Shift Add MEM/WB
4
left 2
Read Addr 1
Instruction Data
Register Read
Memory Memory
Read Addr 2Data 1
Read
PC

File Address Read


Address Write Addr ALU
Read Data
Data 2 Write Data
Write Data

Sign
16 Extend 32

28
鑰匙怕丟掉,怎麼辦?

http://sy.police.taipei/

放進口袋,帶著走!
29

Corrected Datapath to Save RegWrite Addr


 Need to preserve the destination register address in
the pipeline state registers

IF/ID ID/EX EX/MEM

Add
Shift Add MEM/WB
4
left 2
Read Addr 1
Instruction Data
Register Read
Memory Memory
Read Addr 2Data 1
Read
PC

File Address Read


Address Write Addr ALU
Read Data
Data 2 Write Data
Write Data

Sign
16 Extend 32
rt/rd

p.284(頁294) Fig.4.41
30
Corrected Datapath for Load

p.284(頁294) Fig.4.41
31

Why Pipeline? For Performance!


Time (clock cycles)

Once the
ALU

I Inst 0 IM Reg DM Reg pipeline is full,


n one instruction
s
is completed
ALU

t Inst 1 IM Reg DM Reg


r. every cycle, so
CPI = 1
ALU

O Inst 2 IM Reg DM Reg


r
d
ALU

e Inst 3 IM Reg DM Reg


r
ALU

Inst 4 IM Reg DM Reg

Time to fill the pipeline


ALU

IM Reg DM Reg

34
Multiple-clock-cycle pipeline diagram
Totally
used
R R R W

Not
used

p.286(頁296) Fig. 4.43


35

The single-clock-cycle pipeline diagram


5 4 3 2 1

1
2 rt/rd
3
4
5 p.287(頁297) Fig. 4.45 Clock 5 37
Includes control lines 4 control 3 control
Lines for 2 control
Lines for Lines for
EXE MEM WB

rt

p.289(頁299) rd

Fig. 4.46
39









剛 http://www.hbyty.com/images/product_images/info_images/001EYD000001-2.jpg

http://www.chara-net.com/images-item-big/ref4-7522.jpg
40
https://www.youtube.com/watch?v=Ojzp_zv5dwg 41

2 control
Lines for
WB

decoding
4 control 3 control
Lines for Lines for
EXE MEM

p.291(頁301) Fig. 4.50


42
https://farm4.staticflickr.com/3691/13534389474_87e00fdb79.jpg

http://pic.pimg.tw/cgboy26/1404897101-3238112107.jpg
43

Pipelined Control

p.292(頁302)
Fig. 4.51 45
The BIG Picture
Pipeline Summary
 All modern day processors use pipelining
 Pipeliningdoesn’t help latency of single task, it
helps throughput of entire workload
 Potential speedup: a CPI of 1 and fast a CC
 Pipeline rate limited by slowest pipeline stage
 Unbalanced pipe stages makes for
inefficiencies
 Must detect and resolve hazards
 Stalling negatively affects CPI (makes CPI
less than the ideal of 1)

56

第三次作業:第四章前半部習題 (Due in 2 weeks)


4.1考慮下列指令:
指令:AND Rd,Rs,Rt
說明:Reg[Rd] = Reg[Rs] AND Reg[Rt]
4.1.1(15%)圖4.2中的控制器為了上述指令所產生的控制訊號其值為何?
4.1.2(5%)哪些資源(區塊)會為該指令做出有用的功能?
4.1.3(5%)哪些資源(區塊)會產生並不被該指令用到的輸出?
哪些資源(區塊)並不會對該指令產生輸出?

58
4.4本習題的問題假設所需用以製作處理器資料通道的邏輯區塊具有以下延遲:
I-Mem Add Mux ALU Regs D-Mem Sign- Shift-Left-2
Extend
200ps 70ps 20ps 90ps 90ps 250ps 15ps 10ps
4.4.1(10%)設若我們在處理器中唯一需要做的事是擷取連續的指令(圖4.6),
則時脈週期時間可以為若干?
4.4.2(10%)考慮一個類似圖4.11中所示的資料通道,然其處理器只有一種類型
的指令:無條件PC-相對位址的分支。對於這個資料通道的週期時間為若干?
4.4.3(10%)當我們僅需支援有條件PC-相對位址的分支時,重複4.4.2題
本習題中剩下的三題與資料通道中的Shift-left-2有關:
4.4.4(5%)哪些指令需要用到這項資源?
4.4.5(5%)這項資源對哪些種指令(如果有的話)會位於關鍵路徑上?
4.4.6(5%)假設我們僅支援beq及add指令,討論這項資源在延遲上的變化會如
何影響處理器的週期時間。假設其他資源的延遲不變。

Add
4

Instruction
Memory

PC Read Instruction
Address

圖4.6 59

圖4.11
PCSrc

ALUSrc ALU operation


MemWrite
MemtoReg

RegWrite

MemRead

60
4.7 本習題中我們仔細檢視一道指令是如何在單週期的資料通道中執行的。
習題中的各問題請參考擷取下列指令字當時的時脈週期:
1010 1100 0110 0010 0000 0000 0001 0100
假設資料記憶體中的資料全為0且處理器的暫存器在擷取上述指令的週期開
始時含有下列值:
r0 r1 r2 r3 r4 r5 r6 r8 r12 r31
0 -1 2 -3 -4 10 6 8 2 -16
4.7.1(10%)符號延伸及跳躍的「左移2」單元(位於圖4.24中的上方)對該指
令字所產出的輸出各為何?

4.7.2(5%)對該指令而言ALU控制單元的輸入值應為何?

4.7.3(5%)該指令執行後PC的新位址應為何?標示出決定這個PC新位置所需用
到的路徑。(按:這裡應是指在圖4.24中標示計算PC新位置所需用到的路徑。)

4.7.4(5%)在該指令執行期間以及上述暫存器內容值的情形下,則每一個Mux
的輸出值各為何?

4.7.5(5%)ALU及兩個加法單元的輸入值各為何?

61

圖4.24

62

You might also like