Chapter 4 (Part II) The Processor: Datapath and Control: (Enhancing Performance With Pipelining)

Chapter 4 (Part II)
The Processor: Datapath and Control

(Enhancing Performance with Pipelining)
陳瑞奇(J.C. Chen)
亞洲大學資訊工程學系
Adapted from class notes by

Prof. M.J. Irwin, PSU and Prof. D. Patterson, UCB
Single Cycle vs. Multiple Cycle Timing

Single Cycle Implementation:
Cycle 1 Cycle 2
Clk
lw sw Waste
multicycle clock
slower than 1/5th of
Multiple Cycle Implementation: single cycle clock
due to stage register
overhead
Clk Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10
lw sw R-type
IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch
2
How Can We Make It Even Faster?
 1. Split the multiple instruction cycle into smaller and
smaller steps (stages)
 2. Start fetching and executing the next instruction

before the current one has completed
 Pipelining –modern processors are pipelined for
performance
Multiple instructions can be overlapped in
execution.
 Remember the performance equation:
CPU time = CPI * CCT * IC
 3. Fetch (and execute) more than one instruction at a

time (Parallel Processing; Superscalar)
3
Pipelining(管線化): 觀念來自於生產線輸送帶
http://www.dtdsmt.com/upload/photo/ee99e86b50b43054bfd4217649475fba.jpg 4
Pipelining #1
Pipelining #2
6
Pipelining: It’s Natural!
 Laundry Example
 Ann,Brian, Cathy, Dave A B C D
each have one load of
clothes to wash, dry, fold,
and organize
 Washer takes 30 minutes
 Dryer takes 30 minutes
 “Folder” takes 30 minutes
 “Closet” takes 30 minutes
7
Sequential laundry takes 8 hours for 4 loads!
pipelining
Only takes 3.5 hours for 4 loads!
2.3 times
40/11.5=3.5 times faster faster
Twenty loads (11.5hrs) would take about
5.75 times as long as one load (2 hrs)!
p.261(頁271) Fig. 4.25 10
MIPS instructions classically take five Stages
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
lw IFetch Dec Exec Mem WB
 Five stages, one step per stage

 IFetch: Instruction Fetch
 Dec: Instruction Decode and register file read
 Exec: Execution or address calculation
 Mem: data Memory access
 WB: Write Back (to a register)
11
A Pipelined MIPS Processor

 Start the next instruction before the current one has
completed
 improves throughput - total amount of work done in a
given time
 instruction latency (execution time, delay time,
response time - time from the start of an instruction to
its completion) is not reduced
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8
sw IFetch Dec Exec Mem WB
R-type IFetch Dec Exec Mem WB
- clock cycle (pipeline stage time) is limited by the slowest stage

- for some instructions, some stages are wasted cycles
12
Single Cycle, Multiple Cycle, vs. Pipeline
Single Cycle Implementation:
Cycle 1 Cycle 2
Clk
lw sw Waste
Multiple Cycle Implementation:
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10
Clk
lw sw R-type
IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch
Pipeline Implementation:
sw IFetch Dec Exec Mem WB
R-type IFetch Dec Exec Mem WB
13
Pipeline Performance p.264(頁273) Fig. 4.27

Single-cycle (Tc= 800ps)
Pipelined (Tc= 200ps)
15
4.6 The single-cycle datapath from Fig. 4.17
p. 275
(頁285)
Fig. 4.33
•What do we need to add to actually split the datapath into stages?
17
MIPS Pipeline Datapath Modifications

 What do we need to add/modify in our MIPS datapath?
 State registers between each pipeline stage to isolate them
IF:IFetch ID:Decode EX:Execute MEM: WB:
MemAccess WriteBack
64bits 128bits 97bits 64bits

IF/ID ID/EX EX/MEM MEM/WB
Add
32 32 Add
4 Shift
left 2
32 zero
Read Addr 1
Instruction Data
IFetch/Dec
Register Read
Memory Memory
Read Addr 2Data 132
Dec/Exec
Exec/Mem
Read
PC
File Read
Mem/WB
Address ALU Address

32 Write Addr Data
Read 32 32
Data 232 Write Data
Write Data
p.277
Sign
(頁287) 16 Extend 32 32 32
Fig.4.35
System Clock
18
Graphically Representing Pipelines p.276(頁286)
Fig.4.34
Totally
used
R R R W
Not used 19
The first pipe stage (IF) of an instruction (Load)
p.279(頁288)
(Instruction fetch) System clock (Write) Fig.4.36 top
IF
R
System
clock
20
The second pipe stage (ID) of an instruction (Load)
(Instruction decode)
p.279(頁288)
System clock (Write) Dec Write
Fig.4.36 bottom
decoding decoded
System
clock
21
The third pipe stage (EX) of an instruction (Load)
(Execution)
p.280(頁289)
Dec Write EX Write
Fig.4.37
decoded
effective
address
22
The fourth pipe stage (MEM) of an instruction (Load)
(Memory)
p.281(頁290) Fig.4.38 top EX write MEM write
effective
address
23
The fifth pipe stage (WB) of an instruction

(Write back)
System clock
MEM
p.281(頁290) Fig.4.38 bottom 24

The third pipe stage (EX) of an instruction (store)
(Execution)
p.282(頁292) Fig.4.39
Dec Write EX Write
effective
address
store
25
The fourth pipe stage (MEM) of an instruction (store)

(Memory)
p.283(頁293) Fig.4.40 top EX Write MEM Write
effective
address
26
The fifth pipe stage (WB) of an instruction (store)
(Write back)
p.283(頁293) Fig.4.40 bottom
MEM Write
MEM Write
decoded
27
Corrected Datapath to Save RegWrite Addr

 Need to preserve the destination register address in
the pipeline state registers
IF/ID ID/EX EX/MEM
Add
Shift Add MEM/WB
4
left 2
Read Addr 1
Instruction Data
Register Read
Memory Memory
Read Addr 2Data 1
Read
PC
File Address Read

Address Write Addr ALU
Read Data
Data 2 Write Data
Write Data
Sign
16 Extend 32
28
鑰匙怕丟掉，怎麼辦？
http://sy.police.taipei/
放進口袋，帶著走！
29
Corrected Datapath to Save RegWrite Addr

 Need to preserve the destination register address in
the pipeline state registers
IF/ID ID/EX EX/MEM
Add
Shift Add MEM/WB
4
left 2
Read Addr 1
Instruction Data
Register Read
Memory Memory
Read Addr 2Data 1
Read
PC
File Address Read

Address Write Addr ALU
Read Data
Data 2 Write Data
Write Data
Sign
16 Extend 32
rt/rd
p.284(頁294) Fig.4.41
30
Corrected Datapath for Load
p.284(頁294) Fig.4.41
31
Why Pipeline? For Performance!

Time (clock cycles)
Once the
ALU
I Inst 0 IM Reg DM Reg pipeline is full,

n one instruction
s
is completed
ALU
t Inst 1 IM Reg DM Reg

r. every cycle, so
CPI = 1
ALU
O Inst 2 IM Reg DM Reg

r
d
ALU
e Inst 3 IM Reg DM Reg

r
ALU
Inst 4 IM Reg DM Reg
Time to fill the pipeline

ALU
IM Reg DM Reg
34
Multiple-clock-cycle pipeline diagram
Totally
used
R R R W
Not
used
p.286(頁296) Fig. 4.43

35
The single-clock-cycle pipeline diagram

5 4 3 2 1
1
2 rt/rd
3
4
5 p.287(頁297) Fig. 4.45 Clock 5 37
Includes control lines 4 control 3 control
Lines for 2 control
Lines for Lines for
EXE MEM WB
rt
p.289(頁299) rd
Fig. 4.46
39
傳
說
中
的
無
敵
鐵
金
剛 http://www.hbyty.com/images/product_images/info_images/001EYD000001-2.jpg
http://www.chara-net.com/images-item-big/ref4-7522.jpg
40
https://www.youtube.com/watch?v=Ojzp_zv5dwg 41
2 control
Lines for
WB
decoding
4 control 3 control
Lines for Lines for
EXE MEM
p.291(頁301) Fig. 4.50

42
https://farm4.staticflickr.com/3691/13534389474_87e00fdb79.jpg
http://pic.pimg.tw/cgboy26/1404897101-3238112107.jpg
43
Pipelined Control
p.292(頁302)
Fig. 4.51 45
The BIG Picture
Pipeline Summary
 All modern day processors use pipelining
 Pipeliningdoesn’t help latency of single task, it
helps throughput of entire workload
 Potential speedup: a CPI of 1 and fast a CC
 Pipeline rate limited by slowest pipeline stage
 Unbalanced pipe stages makes for
inefficiencies
 Must detect and resolve hazards
 Stalling negatively affects CPI (makes CPI
less than the ideal of 1)
56
第三次作業：第四章前半部習題 (Due in 2 weeks)

4.1考慮下列指令：
指令：AND Rd,Rs,Rt
說明：Reg[Rd] = Reg[Rs] AND Reg[Rt]
4.1.1(15%)圖4.2中的控制器為了上述指令所產生的控制訊號其值為何?
4.1.2(5%)哪些資源(區塊)會為該指令做出有用的功能?
4.1.3(5%)哪些資源(區塊)會產生並不被該指令用到的輸出?
哪些資源(區塊)並不會對該指令產生輸出?
58
4.4本習題的問題假設所需用以製作處理器資料通道的邏輯區塊具有以下延遲:
I-Mem Add Mux ALU Regs D-Mem Sign- Shift-Left-2
Extend
200ps 70ps 20ps 90ps 90ps 250ps 15ps 10ps
4.4.1(10%)設若我們在處理器中唯一需要做的事是擷取連續的指令(圖4.6)，
則時脈週期時間可以為若干?
4.4.2(10%)考慮一個類似圖4.11中所示的資料通道，然其處理器只有一種類型
的指令：無條件PC-相對位址的分支。對於這個資料通道的週期時間為若干?
4.4.3(10%)當我們僅需支援有條件PC-相對位址的分支時，重複4.4.2題
本習題中剩下的三題與資料通道中的Shift-left-2有關：
4.4.4(5%)哪些指令需要用到這項資源?
4.4.5(5%)這項資源對哪些種指令(如果有的話)會位於關鍵路徑上?
4.4.6(5%)假設我們僅支援beq及add指令，討論這項資源在延遲上的變化會如
何影響處理器的週期時間。假設其他資源的延遲不變。
Add
4
Instruction
Memory
PC Read Instruction
Address
圖4.6 59
圖4.11
PCSrc
ALUSrc ALU operation

MemWrite
MemtoReg
RegWrite
MemRead
60
4.7 本習題中我們仔細檢視一道指令是如何在單週期的資料通道中執行的。
習題中的各問題請參考擷取下列指令字當時的時脈週期：
1010 1100 0110 0010 0000 0000 0001 0100
假設資料記憶體中的資料全為0且處理器的暫存器在擷取上述指令的週期開
始時含有下列值：
r0 r1 r2 r3 r4 r5 r6 r8 r12 r31
0 -1 2 -3 -4 10 6 8 2 -16
4.7.1(10%)符號延伸及跳躍的「左移2」單元(位於圖4.24中的上方)對該指
令字所產出的輸出各為何?
4.7.2(5%)對該指令而言ALU控制單元的輸入值應為何?
4.7.3(5%)該指令執行後PC的新位址應為何?標示出決定這個PC新位置所需用
到的路徑。(按：這裡應是指在圖4.24中標示計算PC新位置所需用到的路徑。)
4.7.4(5%)在該指令執行期間以及上述暫存器內容值的情形下，則每一個Mux
的輸出值各為何?
4.7.5(5%)ALU及兩個加法單元的輸入值各為何?
61
圖4.24
62

Chapter 4 (Part II) The Processor: Datapath and Control: (Enhancing Performance With Pipelining)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 4 (Part II) The Processor: Datapath and Control: (Enhancing Performance With Pipelining)

Uploaded by

Copyright:

Available Formats

Chapter 4 (Part II)

The Processor: Datapath and Control

Adapted from class notes by

Single Cycle vs. Multiple Cycle Timing

 2. Start fetching and executing the next instruction

 3. Fetch (and execute) more than one instruction at a

Sequential laundry takes 8 hours for 4 loads!

lw IFetch Dec Exec Mem WB

 Five stages, one step per stage

A Pipelined MIPS Processor

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8

lw IFetch Dec Exec Mem WB

sw IFetch Dec Exec Mem WB

R-type IFetch Dec Exec Mem WB

- clock cycle (pipeline stage time) is limited by the slowest stage

Multiple Cycle Implementation:

sw IFetch Dec Exec Mem WB

R-type IFetch Dec Exec Mem WB

Pipeline Performance p.264(頁273) Fig. 4.27

Pipelined (Tc= 200ps)

MIPS Pipeline Datapath Modifications

64bits 128bits 97bits 64bits

Address ALU Address

The first pipe stage (IF) of an instruction (Load)

The third pipe stage (EX) of an instruction (Load)

The fifth pipe stage (WB) of an instruction

p.281(頁290) Fig.4.38 bottom 24

The fourth pipe stage (MEM) of an instruction (store)

Corrected Datapath to Save RegWrite Addr

IF/ID ID/EX EX/MEM

File Address Read

Corrected Datapath to Save RegWrite Addr

IF/ID ID/EX EX/MEM

File Address Read

Why Pipeline? For Performance!

I Inst 0 IM Reg DM Reg pipeline is full,

t Inst 1 IM Reg DM Reg

O Inst 2 IM Reg DM Reg

e Inst 3 IM Reg DM Reg

Inst 4 IM Reg DM Reg

Time to fill the pipeline

p.286(頁296) Fig. 4.43

The single-clock-cycle pipeline diagram

p.291(頁301) Fig. 4.50

第三次作業：第四章前半部習題 (Due in 2 weeks)

ALUSrc ALU operation

You might also like