Professional Documents
Culture Documents
陳瑞奇(J.C. Chen)
亞洲大學資訊工程學系
Cycle 1 Cycle 2
Clk
lw sw Waste
multicycle clock
slower than 1/5th of
Multiple Cycle Implementation: single cycle clock
due to stage register
overhead
Clk Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10
lw sw R-type
IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch
2
How Can We Make It Even Faster?
1. Split the multiple instruction cycle into smaller and
smaller steps (stages)
Pipelining(管線化): 觀念來自於生產線輸送帶
http://www.dtdsmt.com/upload/photo/ee99e86b50b43054bfd4217649475fba.jpg 4
Pipelining #1
Pipelining #2
6
Pipelining: It’s Natural!
Laundry Example
Ann,Brian, Cathy, Dave A B C D
each have one load of
clothes to wash, dry, fold,
and organize
Washer takes 30 minutes
Dryer takes 30 minutes
“Folder” takes 30 minutes
“Closet” takes 30 minutes
7
pipelining
Only takes 3.5 hours for 4 loads!
2.3 times
40/11.5=3.5 times faster faster
Twenty loads (11.5hrs) would take about
5.75 times as long as one load (2 hrs)!
p.261(頁271) Fig. 4.25 10
MIPS instructions classically take five Stages
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
lw sw Waste
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10
Clk
lw sw R-type
IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch
Pipeline Implementation:
lw IFetch Dec Exec Mem WB
13
15
4.6 The single-cycle datapath from Fig. 4.17
p. 275
(頁285)
Fig. 4.33
•What do we need to add to actually split the datapath into stages?
17
Register Read
Memory Memory
Read Addr 2Data 132
Dec/Exec
Exec/Mem
Read
PC
File Read
Mem/WB
p.277
Sign
(頁287) 16 Extend 32 32 32
Fig.4.35
System Clock
18
Graphically Representing Pipelines p.276(頁286)
Fig.4.34
Totally
used
R R R W
Not used 19
p.279(頁288)
(Instruction fetch) System clock (Write) Fig.4.36 top
IF
R
System
clock
20
The second pipe stage (ID) of an instruction (Load)
(Instruction decode)
p.279(頁288)
System clock (Write) Dec Write
Fig.4.36 bottom
decoding decoded
System
clock
21
(Execution)
p.280(頁289)
Dec Write EX Write
Fig.4.37
decoded
effective
address
22
The fourth pipe stage (MEM) of an instruction (Load)
(Memory)
p.281(頁290) Fig.4.38 top EX write MEM write
effective
address
23
(Execution)
p.282(頁292) Fig.4.39
Dec Write EX Write
effective
address
store
25
effective
address
26
The fifth pipe stage (WB) of an instruction (store)
(Write back)
p.283(頁293) Fig.4.40 bottom
MEM Write
MEM Write
decoded
27
Add
Shift Add MEM/WB
4
left 2
Read Addr 1
Instruction Data
Register Read
Memory Memory
Read Addr 2Data 1
Read
PC
Sign
16 Extend 32
28
鑰匙怕丟掉,怎麼辦?
http://sy.police.taipei/
放進口袋,帶著走!
29
Add
Shift Add MEM/WB
4
left 2
Read Addr 1
Instruction Data
Register Read
Memory Memory
Read Addr 2Data 1
Read
PC
Sign
16 Extend 32
rt/rd
p.284(頁294) Fig.4.41
30
Corrected Datapath for Load
p.284(頁294) Fig.4.41
31
Once the
ALU
IM Reg DM Reg
34
Multiple-clock-cycle pipeline diagram
Totally
used
R R R W
Not
used
1
2 rt/rd
3
4
5 p.287(頁297) Fig. 4.45 Clock 5 37
Includes control lines 4 control 3 control
Lines for 2 control
Lines for Lines for
EXE MEM WB
rt
p.289(頁299) rd
Fig. 4.46
39
傳
說
中
的
無
敵
鐵
金
剛 http://www.hbyty.com/images/product_images/info_images/001EYD000001-2.jpg
http://www.chara-net.com/images-item-big/ref4-7522.jpg
40
https://www.youtube.com/watch?v=Ojzp_zv5dwg 41
2 control
Lines for
WB
decoding
4 control 3 control
Lines for Lines for
EXE MEM
http://pic.pimg.tw/cgboy26/1404897101-3238112107.jpg
43
Pipelined Control
p.292(頁302)
Fig. 4.51 45
The BIG Picture
Pipeline Summary
All modern day processors use pipelining
Pipeliningdoesn’t help latency of single task, it
helps throughput of entire workload
Potential speedup: a CPI of 1 and fast a CC
Pipeline rate limited by slowest pipeline stage
Unbalanced pipe stages makes for
inefficiencies
Must detect and resolve hazards
Stalling negatively affects CPI (makes CPI
less than the ideal of 1)
56
58
4.4本習題的問題假設所需用以製作處理器資料通道的邏輯區塊具有以下延遲:
I-Mem Add Mux ALU Regs D-Mem Sign- Shift-Left-2
Extend
200ps 70ps 20ps 90ps 90ps 250ps 15ps 10ps
4.4.1(10%)設若我們在處理器中唯一需要做的事是擷取連續的指令(圖4.6),
則時脈週期時間可以為若干?
4.4.2(10%)考慮一個類似圖4.11中所示的資料通道,然其處理器只有一種類型
的指令:無條件PC-相對位址的分支。對於這個資料通道的週期時間為若干?
4.4.3(10%)當我們僅需支援有條件PC-相對位址的分支時,重複4.4.2題
本習題中剩下的三題與資料通道中的Shift-left-2有關:
4.4.4(5%)哪些指令需要用到這項資源?
4.4.5(5%)這項資源對哪些種指令(如果有的話)會位於關鍵路徑上?
4.4.6(5%)假設我們僅支援beq及add指令,討論這項資源在延遲上的變化會如
何影響處理器的週期時間。假設其他資源的延遲不變。
Add
4
Instruction
Memory
PC Read Instruction
Address
圖4.6 59
圖4.11
PCSrc
RegWrite
MemRead
60
4.7 本習題中我們仔細檢視一道指令是如何在單週期的資料通道中執行的。
習題中的各問題請參考擷取下列指令字當時的時脈週期:
1010 1100 0110 0010 0000 0000 0001 0100
假設資料記憶體中的資料全為0且處理器的暫存器在擷取上述指令的週期開
始時含有下列值:
r0 r1 r2 r3 r4 r5 r6 r8 r12 r31
0 -1 2 -3 -4 10 6 8 2 -16
4.7.1(10%)符號延伸及跳躍的「左移2」單元(位於圖4.24中的上方)對該指
令字所產出的輸出各為何?
4.7.2(5%)對該指令而言ALU控制單元的輸入值應為何?
4.7.3(5%)該指令執行後PC的新位址應為何?標示出決定這個PC新位置所需用
到的路徑。(按:這裡應是指在圖4.24中標示計算PC新位置所需用到的路徑。)
4.7.4(5%)在該指令執行期間以及上述暫存器內容值的情形下,則每一個Mux
的輸出值各為何?
4.7.5(5%)ALU及兩個加法單元的輸入值各為何?
61
圖4.24
62