Professional Documents
Culture Documents
Lecture3:MemoryHierarchy
Design(Chapter2,AppendixB)
ChihWeiLiu
NationalChiao TungUniversity
cwliu@twins.ee.nctu.edu.tw
Introduction
Since1980,CPUhasoutpacedDRAM
CPU
60% per yr
2X in 1.5 yrs
DRAM
Gap grew 50% per
9% per yr
year
2X in 10 yrs
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
Introduction
Introduction
Programmerswantunlimitedamountsofmemorywithlow
latency
Fastmemorytechnologyismoreexpensiveperbitthan
slowermemory
Solution:organizememorysystemintoahierarchy
Entireaddressablememoryspaceavailableinlargest,slowestmemory
Incrementallysmallerandfastermemories,eachcontainingasubset
ofthememorybelowit,proceedinstepsuptowardtheprocessor
Temporalandspatiallocalityinsuresthatnearlyallreferences
canbefoundinsmallermemories
Givestheallusionofalarge,fastmemorybeingpresentedtothe
processor
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
Introduction
MemoryHierarchyDesign
Memoryhierarchydesignbecomesmorecrucialwith
recentmulticoreprocessors:
Aggregatepeakbandwidthgrowswith#cores:
IntelCorei7cangeneratetworeferencespercoreperclock
Fourcoresand3.2GHzclock
25.6billion64bitdatareferences/second+
12.8billion128bitinstructionreferences
=409.6GB/s!
DRAMbandwidthisonly6%ofthis(25GB/s)
Requires:
Multiport,pipelinedcaches
Twolevelsofcachepercore
Sharedthirdlevelcacheonchip
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
MemoryHierarchy
Takeadvantageoftheprincipleoflocalityto:
Presentasmuchmemoryasinthecheapesttechnology
Provideaccessatspeedofferedbythefastesttechnology
Processor
Control
1s
On-Chip
Cache
Speed (ns):
Registers
Datapath
Second
Level
Cache
(SRAM)
Main
Memory
(DRAM/
FLASH/
PCM)
Secondary
Storage
(Disk/
FLASH/
PCM)
10s-100s
100s
10,000,000s
(10s ms)
Ks-Ms
Ms
Gs
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
Tertiary
Storage
(Tape/
Cloud
Storage)
10,000,000,000s
(10s sec)
Ts
5
MulticoreArchitecture
Processing Node
Processing Node
Processing Node
CPU
CPU
CPU
Processing Node
Processing Node
Processing Node
CPU
CPU
CPU
Interconnection
network
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
ThePrincipleofLocality
ThePrincipleofLocality:
Programaccessarelativelysmallportionoftheaddressspaceat
anyinstantoftime.
TwoDifferentTypesofLocality:
TemporalLocality(LocalityinTime):Ifanitemisreferenced,it
willtendtobereferencedagainsoon(e.g.,loops,reuse)
SpatialLocality(LocalityinSpace):Ifanitemisreferenced,items
whoseaddressesareclosebytendtobereferencedsoon
(e.g.,straightline code,arrayaccess)
HWreliedonlocalityforspeed
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
Introduction
MemoryHierarchyBasics
Whenawordisnotfoundinthecache,amissoccurs:
Fetchwordfromlowerlevelinhierarchy,requiringa
higherlatencyreference
Lowerlevelmaybeanothercacheorthemainmemory
Alsofetchtheotherwordscontainedwithintheblock
Takesadvantageofspatiallocality
Placeblockintocacheinanylocationwithinitsset,
determinedbyaddress
blockaddressMODnumberofsets
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
HitandMiss
Hit:dataappearsinsomeblockintheupperlevel(e.g.:BlockX)
HitRate:thefractionofmemoryaccessfoundintheupperlevel
HitTime:Timetoaccesstheupperlevelwhichconsistsof
RAMaccesstime+Timetodeterminehit/miss
Miss:dataneedstoberetrievefromablockinthelowerlevel
(BlockY)
MissRate=1 (HitRate)
MissPenalty:Timetoreplaceablockintheupperlevel+
Timetodelivertheblocktheprocessor
HitTime<<MissPenalty(500instructionson21264!)
To Processor
Upper Level
Memory
Lower Level
Memory
Blk X
From Processor
Blk Y
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
CachePerformanceFormulas
(Average memory access time) =
(Hit time) + (Miss rate)(Miss penalty)
Important:
CPU
Cache
Miss penalty
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
Lower levels
of hierarchy
10
FourQuestionsforMemoryHierarchy
Consideranylevelinamemoryhierarchy.
Rememberablockistheunitofdatatransfer.
Betweenthegivenlevel,andthelevelsbelowit
Theleveldesignisdescribedbyfourbehaviors:
BlockPlacement:
Wherecouldanewblockbeplacedinthelevel?
BlockIdentification:
Howisablockfoundifitisinthelevel?
BlockReplacement:
Whichexistingblockshouldbereplacedifnecessary?
WriteStrategy:
Howarewritestotheblockhandled?
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
11
Q1:Wherecanablockbeplaced
intheupperlevel?
Block12placedin8blockcache:
Fullyassociative,directmapped,2waysetassociative
S.A.Mapping=BlockNumberModuloNumberSets
FullMapped
DirectMapped
(12mod8)=4
2WayAssoc
(12mod4)=0
01234567
01234567
01234567
Cache
1111111111222222222233
01234567890123456789012345678901
Memory
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
12
Q2:Howisablockfoundifitisin
theupperlevel?
Block Address
Index
Tag
IndexUsedtoLookupCandidates
Indexidentifiesthesetincache
Set Select
Tagusedtoidentifyactualcopy
Ifnocandidatesmatch,thendeclarecachemiss
Block
offset
Data Select
Blockisminimumquantumofcaching
Dataselectfieldusedtoselectdatawithinblock
Manycachingapplicationsdonthavedataselectfield
Largerblocksizehasdistincthardwareadvantages:
lesstagoverhead
exploitfastbursttransfersfromDRAM/overwidebusses
Disadvantagesoflargerblocksize?
Fewerblocks moreconflicts.Canwastebandwidth
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
13
Review:DirectMappedCache
31
Cache Tag
Ex: 0x50
Valid Bit
Cache Tag
4
0
Byte Select
Ex: 0x00
9
Cache Index
Ex: 0x01
0x50
Cache Data
Byte 31
Byte 1 Byte 0 0
Byte 63
Byte 33 Byte 32 1
2
3
:
Byte 1023
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
DirectMapped2N bytecache:
Theuppermost(32 N)bitsarealwaystheCacheTag
ThelowestMbitsaretheByteSelect(BlockSize=2M)
Example:1KBDirectMappedCachewith32BBlocks
Indexchoosespotentialblock
Tagcheckedtoverifyblock
Byteselectchoosesbytewithinblock
: :
Byte 992 31
14
DirectMappedCacheArchitecture
Tags
Block frames
Address
Tag Frm# Off.
Compare Tags
?
Hit
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
Mux
select
Data Word
15
Review:SetAssociativeCache
Nwaysetassociative:NentriesperCacheIndex
Ndirectmappedcachesoperatesinparallel
Example:Twowaysetassociativecache
CacheIndexselectsasetfromthecache
Twotagsinthesetarecomparedtoinputinparallel
Dataisselectedbasedonthetagresult
31
Cache Tag
Valid
Cache Tag
Compare
Cache Data
Cache Block 0
8
Cache Index
4
0
Byte Select
Cache Data
Cache Block 0
Cache Tag
Valid
Sel1 1
Mux
0 Sel0
Compare
OR
CA-Lec3
cwliu@twins.ee.nctu.edu.tw
Hit
Cache Block
16
Review:FullyAssociativeCache
FullyAssociative:Everyblockcanholdanyline
Addressdoesnotincludeacacheindex
CompareCacheTagsofallCacheEntriesinParallel
Example:BlockSize=32Bblocks
WeneedN27bitcomparators
Stillhavebyteselecttochoosefromwithinblock
4
31
Cache Tag (27 bits long)
Cache Tag
0
Byte Select
Ex: 0x01
: :
=
=
=
=
=
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
:
17
ConcludingRemarks
Directmappedcache=1waysetassociative
cache
Fullyassociativecache:thereisonly1set
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
18
CacheSizeEquation
Simpleequationforthesizeofacache:
(Cachesize)=(Blocksize) (Numberofsets)
(SetAssociativity)
Canrelatetothesizeofvariousaddressfields:
(Blocksize)=2(#ofoffsetbits)
(Numberofsets)=2(#ofindexbits)
(#oftagbits)=(#ofmemoryaddressbits)
(#ofindexbits) (#ofoffsetbits)
Memory address
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
19
Q3:Whichblockshouldbereplaced
onamiss?
Easyfordirectmappedcache
Onlyonechoice
Setassociativeorfullyassociative
LRU(leastrecentlyused)
Appealing,buthardtoimplementforhighassociativity
Random
Easy,buthowwelldoesitwork?
Firstin,firstout(FIFO)
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
20
Q4:Whathappensonawrite?
Write-Through
Write-Back
Debug
Easy
Hard
Do read misses
produce writes?
No
Yes
Do repeated
writes make it to
lower level?
Yes
No
Policy
21
WriteBuffers
Cache
Processor
Lower
Level
Memory
Write Buffer
MoreonCachePerformanceMetrics
Cansplitaccesstimeintoinstructions&data:
Avg.mem.acc.time=
(%instructionaccesses) (inst.mem.accesstime)+
(%dataaccesses) (datamem.accesstime)
Anotherformulafromchapter1:
CPUtime=(CPUexecutionclockcycles+Memorystallclockcycles)
cycletime
UsefulforexploringISAchanges
Canbreakstallsintoreadsandwrites:
Memorystallcycles=
(Reads readmissrate readmisspenalty)+
(Writes writemissrate writemisspenalty)
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
23
SourcesofCacheMisses
Compulsory (coldstartorprocessmigration,firstreference):
firstaccesstoablock
Coldfactoflife:notawholelotyoucandoaboutit
Note:Ifyouaregoingtorunbillionsofinstruction,Compulsory
Missesareinsignificant
Capacity:
Cachecannotcontainallblocksaccessbytheprogram
Solution:increasecachesize
Conflict (collision):
Multiplememorylocationsmapped
tothesamecachelocation
Solution1:increasecachesize
Solution2:increaseassociativity
Coherence (Invalidation):otherprocess(e.g.,I/O)updates
memory
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
24
Introduction
MemoryHierarchyBasics
Sixbasiccacheoptimizations:
Largerblocksize
Reducescompulsorymisses
Increasescapacityandconflictmisses,increasesmisspenalty
Largertotalcachecapacitytoreducemissrate
Increaseshittime,increasespowerconsumption
Higherassociativity
Reducesconflictmisses
Increaseshittime,increasespowerconsumption
Highernumberofcachelevels
Reducesoverallmemoryaccesstime
Givingprioritytoreadmissesoverwrites
Reducesmisspenalty
Avoidingaddresstranslationincacheindexing
Reduceshittime
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
25
1.LargerBlockSizes
Largerblocksize no.ofblocks
Obviousadvantages:reducecompulsorymisses
Reasonisduetospatiallocality
Obviousdisadvantage
Highermisspenalty:largerblocktakeslongertomove
Mayincreaseconflictmissesandcapacitymissifcacheissmall
26
2.LargeCaches
Cachesize missrate;hittime
Helpwithbothconflictandcapacitymisses
MayneedlongerhittimeAND/ORhigherHW
cost
Popularinoffchipcaches
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
27
3.HigherAssociativity
Reduceconflictmiss
2:1Cacheruleofthumbonmissrate
2waysetassociativeofsizeN/2isaboutthe
sameasadirectmappedcacheofsizeN(heldfor
cachesize<128KB)
Greaterassociativitycomesatthecostof
increasedhittime
Lengthentheclockcycle
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
28
4.MultiLevelCaches
2levelcachesexample
AMATL1 =HittimeL1 +MissrateL1 MisspenaltyL1
AMATL2 =HittimeL1 +MissrateL1 (HittimeL2 +Miss
rateL2 MisspenaltyL2)
Probablythebestmisspenaltyreductionmethod
Definitions:
Localmissrate:missesinthiscachedividedbythetotalnumberof
memoryaccessestothiscache(MissrateL2)
Globalmissrate:missesinthiscachedividedbythetotalnumberof
memoryaccessesgeneratedbyCPU(MissrateL1xMissrateL2)
GlobalMissRateiswhatmatters
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
29
MultiLevelCaches(Cont.)
Advantages:
CapacitymissesinL1endupwithasignificantpenaltyreduction
ConflictmissesinL1similarlygetsuppliedbyL2
Holdingsizeof1stlevelcacheconstant:
Decreasesmisspenaltyof1stlevelcache.
Or,increasesaverageglobalhittimeabit:
hittimeL1+missrateL1xhittimeL2
butdecreasesglobalmissrate
Holdingtotalcachesizeconstant:
Globalmissrate,misspenaltyaboutthesame
Decreasesaverageglobalhittimesignificantly!
NewL1muchsmallerthanoldL1
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
30
MissRateExample
Supposethatin1000memoryreferencesthereare40missesinthefirstlevel
cacheand20missesinthesecondlevelcache
Missrateforthefirstlevelcache=40/1000(4%)
Localmissrateforthesecondlevelcache=20/40(50%)
Globalmissrateforthesecondlevelcache=20/1000(2%)
AssumemisspenaltyL2is200CC,hittimeL2is10CC,hittimeL1is1CC,and1.5
memoryreferenceperinstruction.Whatisaveragememoryaccesstimeand
averagestallcyclesperinstructions?Ignorewritesimpact.
AMAT=HittimeL1+MissrateL1 (HittimeL2+MissrateL2 MisspenaltyL2)=1+
4% (10+50% 200)=5.4CC
Averagememorystallsperinstruction=MissesperinstructionL1 HittimeL2+
MissesperinstructionsL2MisspenaltyL2
=(401.5/1000) 10+(201.5/1000)200=6.6CC
Or(5.4 1.0) 1.5=6.6CC
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
31
5.GivingPrioritytoReadMissesOver
SW R3, 512(R0)
;cache index 0
R1, 1024(R0) ;cache index 0
Writes LW
LW R2, 512(R0)
;cache index 0
Inwritethrough,writebufferscomplicatememoryaccessinthatthey R2=R3 ?
mightholdtheupdatedvalueoflocationneededonareadmiss
RAW conflictswithmainmemoryreadsoncachemisses
Readmisswaitsuntilthewritebufferempty increasereadmisspenalty
Checkwritebuffercontentsbeforeread,andifnoconflicts,letthe
read priority over write
memoryaccesscontinue
WriteBack?
Readmissreplacingdirtyblock
Normal:Writedirtyblocktomemory,andthendotheread
Instead,copythedirtyblocktoawritebuffer,thendotheread,andthendo
thewrite
CPUstalllesssincerestartsassoonasdoread
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
32
6.AvoidingAddressTranslationduring
IndexingoftheCache
Virtual
Address
Physical
Address
Address
Translation
$ means cache
Virtuallyaddressedcaches
TLB
PA
$
PA
MEM
Conventional
Organization
CPU
CPU
CPU
VA
Cache
Indexing
VA
Tags
VA
$
VA
VA
VA
Tags
TLB
TLB
L2 $
PA
MEM
MEM
PA
33
WhynotVirtualCache?
Taskswitchcausesthesame VAtorefertodifferentPAs
Hence,cachemustbeflushed
Hughtaskswitchoverhead
Alsocreateshugecompulsorymissratesfornewprocess
SynonymsorAliasproblemcausesdifferentVAswhich
maptothesamePA
Twocopiesofthesamedatainavirtualcache
AntialiasingHWmechanismisrequired(complicated)
SWcanhelp
I/O(alwaysusesPA)
RequiremappingtoVAtointeractwithavirtualcache
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
34
AdvancedCacheOptimizations
Reducinghittime
1. Smallandsimplecaches
2. Wayprediction
Increasingcachebandwidth
3.Pipelinedcaches
4. Multibanked caches
5. Nonblocking caches
ReducingMissPenalty
6.Criticalwordfirst
7. Mergingwritebuffers
ReducingMissRate
8.Compileroptimizations
Reducingmisspenaltyor
missrate viaparallelism
9.Hardwareprefetching
10. Compilerprefetching
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
35
Advanced Optimizations
1.SmallandSimpleL1Cache
Criticaltimingpathincache:
addressingtagmemory,thencomparingtags,
thenselectingcorrectset
Indextagmemoryandthencomparetakestime
Directmappedcachescanoverlaptag
compareandtransmissionofdata
Sincethereisonlyonechoice
Lowerassociativityreducespowerbecause
fewercachelinesareaccessed
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
36
Advanced Optimizations
L1SizeandAssociativity
37
Advanced Optimizations
L1SizeandAssociativity
38
2.FastHittimesviaWayPrediction
HowtocombinefasthittimeofDirectMappedandhavethelowerconflictmisses
of2waySAcache?
Wayprediction:keepextrabitsincachetopredicttheway, orblockwithinthe
set,ofnextcacheaccess.
Multiplexorissetearlytoselectdesiredblock,only1tagcomparisonperformedthat
clockcycleinparallelwithreadingthecachedata
Miss 1st checkotherblocksformatchesinnextclockcycle
Hit Time
Miss Penalty
Accuracy 85%
Drawback:CPUpipelineishardifhittakes1or2cycles
Usedforinstructioncachesvs.datacaches
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
39
Advanced Optimizations
WayPrediction
Toimprovehittime,predictthewaytopresetmux
Mispredictiongiveslongerhittime
Predictionaccuracy
>90%fortwoway
>80%forfourway
IcachehasbetteraccuracythanDcache
FirstusedonMIPSR10000inmid90s
UsedonARMCortexA8
Extendtopredictblockaswell
Wayselection
Increasesmispredictionpenalty
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
40
Advanced Optimizations
3.IncreasingCacheBandwidthby
Pipelining
Pipelinecacheaccesstoimprovebandwidth
Examples:
Pentium:1cycle
PentiumPro PentiumIII:2cycles
Pentium4 Corei7:4cycles
Makesiteasiertoincreaseassociativity
But,pipelinecacheincreasestheaccesslatency
Moreclockcyclesbetweentheissueoftheloadandthe
useofthedata
AlsoIncreasesbranchmispredictionpenalty
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
41
4.IncreasingCacheBandwidth:
NonBlockingCaches
hitundermiss reducestheeffectivemisspenaltybyworkingduringmiss
vs.ignoringCPUrequests
hitundermultiplemiss ormissundermiss mayfurtherlowerthe
effectivemisspenaltybyoverlappingmultiplemisses
Significantlyincreasesthecomplexityofthecachecontrollerasthere
canbemultipleoutstandingmemoryaccesses
Requiresmuliplememorybanks(otherwisecannotsupport)
PeniumProallows4outstandingmemorymisses
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
42
Advanced Optimizations
Nonblocking CachePerformances
L2mustsupportthis
Ingeneral,processorscanhideL1misspenaltybutnotL2miss
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
penalty
43
6:IncreasingCacheBandwidthviaMultiple
Banks
Ratherthantreatthecacheasasinglemonolithicblock,
divideintoindependentbanksthatcansupportsimultaneous
accesses
E.g.,T1(Niagara)L2has4banks
Bankingworksbestwhenaccessesnaturallyspread
themselvesacrossbanks mappingofaddressestobanks
affectsbehaviorofmemorysystem
Simplemappingthatworkswellissequentialinterleaving
Spreadblockaddressessequentiallyacrossbanks
E,g,ifthere4banks,Bank0hasallblockswhoseaddressmodulo4is0;
bank1hasallblockswhoseaddressmodulo4is1;
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
44
Advanced Optimizations
5.IncreasingCacheBandwidthvia
Multibanked Caches
Organizecacheasindependentbankstosupport
simultaneousaccess(ratherthanasinglemonolithic
block)
ARMCortexA8supports14banksforL2
Inteli7supports4banksforL1and8banksforL2
Bankingworksbestwhenaccessesnaturallyspread
themselvesacrossbanks
Interleavebanksaccordingtoblockaddress
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
45
6.ReduceMissPenalty:
CriticalWordFirstandEarlyRestart
Processorusuallyneedsonewordoftheblockatatime
Donotwaitforfullblocktobeloadedbeforerestartingprocessor
CriticalWordFirst requestthemissedwordfirstfrommemoryandsenditto
theprocessorassoonasitarrives;lettheprocessorcontinueexecutionwhile
fillingtherestofthewordsintheblock.Alsocalledwrappedfetch and
requestedwordfirst
Earlyrestart assoonastherequestedwordoftheblockarrives,senditto
theprocessorandlettheprocessorcontinueexecution
Benefitsofcriticalwordfirstandearlyrestartdependon
Blocksize:generallyusefulonlyinlargeblocks
Likelihood ofanotheraccesstotheportionoftheblockthathasnotyetbeen
fetched
Spatiallocalityproblem:tendtowantnextsequentialword,sonotclearifbenefit
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
46
7.MergingWriteBufferto
ReduceMissPenalty
Writebuffertoallowprocessortocontinuewhilewaitingto
writetomemory
Ifbuffercontainsmodifiedblocks,theaddressescanbe
checkedtoseeifaddressofnewdatamatchestheaddressof
avalidwritebufferentry.Ifso,newdataarecombinedwith
thatentry
Increasesblocksizeofwriteforwritethroughcacheofwrites
tosequentialwords,bytessincemultiwordwritesmore
efficienttomemory
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
47
Advanced Optimizations
MergingWriteBuffer
Whenstoringtoablockthatisalreadypendinginthewrite
buffer,updatewritebuffer
Reducesstallsduetofullwritebuffer
Nowrite
buffering
Writebuffering
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
48
8.ReducingMissesbyCompiler
Optimizations
McFarling [1989]reducedcachesmissesby75%on8KBdirectmapped
cache,4byteblocksinsoftware
Instructions
Reorderproceduresinmemorysoastoreduceconflictmisses
Profilingtolookatconflicts(usingtoolstheydeveloped)
Data
LoopInterchange:swapnestedloopstoaccessdatainorderstoredinmemory
(insequentialorder)
LoopFusion:Combine2independentloopsthathavesameloopingandsome
variablesoverlap
Blocking:Improvetemporallocalitybyaccessingblocks ofdatarepeatedlyvs.
goingdownwholecolumnsorrows
Insteadofaccessingentirerowsorcolumns,subdividematricesintoblocks
Requiresmorememoryaccessesbutimproveslocalityofaccesses
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
49
LoopInterchangeExample
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Sequentialaccessesinsteadofstridingthroughmemoryevery100words;
improvedspatiallocality
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
50
LoopFusionExample
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{
a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
Perform different
computations on the
common data in two loops
fuse the two loops
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
51
BlockingExample
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};
TwoInnerLoops:
ReadallNxNelementsofz[]
ReadNelementsof1rowofy[]repeatedly
WriteNelementsof1rowofx[]
CapacityMissesafunctionofN&CacheSize:
2N3+N2 =>(assumingnoconflict;otherwise)
Idea:computeonBxBsubmatrixthatfits
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
52
Snapshotofx,y,zwhenN=6,i=1
Before.
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
53
BlockingExample
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1) {
r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;
};
BcalledBlockingFactor
CapacityMissesfrom2N3 +N2 to2N3/B+N2
ConflictMissesToo?
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
54
TheAgeofAccessestox,y,zwhenB=3
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
55
9.ReducingMissPenaltyorMissRateby
Hardware PrefetchingofInstructions&Data
Prefetchingreliesonhavingextramemorybandwidth thatcanbeusedwithout
penalty
InstructionPrefetching
Typically,CPUfetches2blocksonamiss:therequestedblockandthenextconsecutive
block.
Requestedblockisplacedininstructioncachewhenitreturns,andprefetched blockis
placedintoinstructionstreambuffer
DataPrefetching
Pentium4canprefetch dataintoL2cachefromupto8streamsfrom8different4KB
pages
Prefetchinginvokedif2successiveL2cachemissestoapage,
ifdistancebetweenthosecacheblocksis<256bytes
1.97
gr
id
eq
ua
ke
SPECfp2000
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
1.49
1.40
1.32
ap
pl
u
1.21
1.29
sw
im
3d
wu
pw
is
e
fa
m
cf
SPECint2000
1.20
ga
lg
el
fa
ce
re
c
1.18
1.16
1.26
lu
ca
s
1.45
2.20
2.00
1.80
1.60
1.40
1.20
1.00
ga
p
Performance Improvement
Intel Pentium 4
56
10.ReducingMissPenaltyorMissRateby
CompilerControlled PrefetchingData
Prefetch instructionisinsertedbeforedataisneeded
DataPrefetch
Registerprefetch:loaddataintoregister(HPPARISCloads)
CachePrefetch:loadintocache(MIPSIV,PowerPC,SPARCv.9)
Specialprefetchinginstructionscannotcausefaults;
aformofspeculativeexecution
IssuingPrefetch Instructionstakestime
Iscostofprefetch issues<savingsinreducedmisses?
Highersuperscalarreducesdifficultyofissuebandwidth
Combinewithsoftwarepipeliningandloopunrolling
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
57
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
Advanced Optimizations
Summary
58
Memory Technology
MemoryTechnology
Performancemetrics
Latencyisconcernofcache
BandwidthisconcernofmultiprocessorsandI/O
Accesstime
Timebetweenreadrequestandwhendesiredwordarrives
Cycletime
Minimumtimebetweenunrelatedrequeststomemory
DRAMusedformainmemory,SRAMusedforcache
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
59
Memory Technology
MemoryTechnology
SRAM:staticrandomaccessmemory
Requireslowpowertoretainbit,sincenorefresh
But,requires6transistors/bit(vs.1transistor/bit)
DRAM
Onetransistor/bit
Mustberewrittenafterbeingread
Mustalsobeperiodicallyrefreshed
Every~8ms
Eachrowcanberefreshedsimultaneously
Addresslinesaremultiplexed:
Upperhalfofaddress:rowaccessstrobe(RAS)
Lowerhalfofaddress:columnaccessstrobe(CAS)
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
60
DRAMTechnology
Emphasizeoncostperbitandcapacity
Multiplexaddresslines cutting#ofaddresspinsinhalf
Rowaccessstrobe(RAS)first,thencolumnaccessstrobe(CAS)
Memoryasa2Dmatrix rowsgotoabuffer
SubsequentCASselectssubrow
Useonlyasingletransistortostoreabit
Readingthatbitcandestroytheinformation
Refresheachbitperiodically(ex.8milliseconds)bywritingback
Keeprefreshingtimelessthan5%ofthetotaltime
DRAMcapacityis4to8timesthatofSRAM
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
61
DRAMLogicalOrganization(4Mbit)
Column Decoder
11
A0A10
Memory Array
(2,048 x 2,048)
Storage
Word Line Cell
SquarerootofbitsperRAS/CAS
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
62
DRAMTechnology(cont.)
DIMM:Dualinlinememorymodule
DRAMchipsarecommonlysoldonsmallboardscalledDIMMs
DIMMstypicallycontain4to16DRAMs
SlowingdowninDRAMcapacitygrowth
Fourtimesthecapacityeverythreeyears,formorethan20years
Newchipsonlydoublecapacityeverytwoyear,since1998
DRAMperformanceisgrowingataslowerrate
RAS(relatedtolatency):5%peryear
CAS(relatedtobandwidth):10%+peryear
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
63
Memory Technology
RASImprovement
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
64
QuestforDRAMPerformance
1.
FastPagemode
Addtimingsignalsthatallowrepeatedaccessestorowbufferwithout
anotherrowaccesstime
Suchabuffercomesnaturally,aseacharraywillbuffer1024to2048bitsfor
eachaccess
2.
SynchronousDRAM(SDRAM)
AddaclocksignaltoDRAMinterface,sothattherepeatedtransferswould
notbearoverheadtosynchronizewithDRAMcontroller
3.
DoubleDataRate(DDRSDRAM)
TransferdataonboththerisingedgeandfallingedgeoftheDRAMclock
signal doublingthepeakdatarate
DDR2lowerspowerbydroppingthevoltagefrom2.5to1.8volts+offers
higherclockrates:upto400MHz
DDR3dropsto1.5volts+higherclockrates:upto800MHz
DDR4dropsto1.2volts,clockrateupto1600MHz
ImprovedBandwidth,notLatency
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
65
DRAMnamebasedonPeakChipTransfers/Sec
DIMMnamebasedonPeakDIMMMBytes /Sec
Standard
Clock Rate
(MHz)
M transfers /
second
DRAM Name
Mbytes/s/
DIMM
DDR
133
266
DDR266
2128
PC2100
DDR
150
300
DDR300
2400
PC2400
DDR
200
400
DDR400
3200
PC3200
DDR2
266
533
DDR2-533
4264
PC4300
DDR2
333
667
DDR2-667
5336
PC5300
DDR2
400
800
DDR2-800
6400
PC6400
DDR3
533
1066
DDR3-1066
8528
PC8500
DDR3
666
1333
DDR3-1333
10664
PC10700
DDR3
800
1600
DDR3-1600
12800
PC12800
x2
DIMM
Name
x8
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
66
Memory Technology
DRAMPerformance
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
67
GraphicsMemory
GDDR5isgraphicsmemorybasedonDDR3
Graphicsmemory:
Achieve25XbandwidthperDRAMvs.DDR3
Widerinterfaces(32vs.16bit)
Higherclockrate
Possiblebecausetheyareattachedviasolderinginsteadof
socketedDIMMmodules
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
68
Memory Technology
MemoryPowerConsumption
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
69
SRAMTechnology
CacheusesSRAM:StaticRandomAccessMemory
SRAMusessixtransistorsperbittopreventtheinformation
frombeingdisturbedwhenread
noneedtorefresh
SRAMneedsonlyminimalpowertoretainthechargeinthestandby
mode goodforembeddedapplications
NodifferencebetweenaccesstimeandcycletimeforSRAM
Emphasizeonspeedandcapacity
SRAMaddresslinesarenotmultiplexed
SRAMspeedis8to16xthatofDRAM
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
70
ROMandFlash
Embeddedprocessormemory
Readonlymemory(ROM)
Programmedatthetimeofmanufacture
Onlyasingletransistorperbittorepresent1or0
Usedfortheembeddedprogramandforconstant
Nonvolatileandindestructible
Flashmemory:
Mustbeerased(inblocks)beforebeingoverwritten
Nonvolatilebutallowthememorytobemodified
ReadsatalmostDRAMspeeds,butwrites10to100timesslower
DRAMcapacityperchipandMBperdollarisabout4to8timesgreater
thanflash
CheaperthanSDRAM,moreexpensivethandisk
SlowerthanSRAM,fasterthandisk
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
71
Memory Technology
MemoryDependability
Memoryissusceptibletocosmicrays
Softerrors:dynamicerrors
Detectedandfixedbyerrorcorrectingcodes(ECC)
Harderrors:permanenterrors
Usesparserowstoreplacedefectiverows
Chipkill:aRAIDlikeerrorrecoverytechnique
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
72
VirtualMemory?
Thelimitsofphysicaladdressing
Allprogramsshareonephysicaladdressspace
Machinelanguageprogramsmustbeawareofthemachine
organization
Nowaytopreventaprogramfromaccessinganymachineresource
Recall:manyprocessesuseonlyasmallportionofaddressspace
Virtualmemorydividesphysicalmemoryintoblocks(calledpageor
segment)andallocatesthemtodifferentprocesses
Withvirtualmemory,theprocessorproducesvirtualaddressthat
aretranslatedbyacombinationofHWandSWtophysical
addresses(calledmemorymappingoraddresstranslation).
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
73
VirtualMemory:AddaLayerofIndirection
Physical Addresses
Virtual Addresses
A0-A31
Virtual
Physical
Address
Translation
CPU
D0-D31
A0-A31
Memory
D0-D31
Data
74
VirtualMemory
Mapping by a
page table
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
75
VirtualMemory(cont.)
Permitsapplicationstogrowbiggerthanmainmemorysize
Helpswithmultipleprocessmanagement
Eachprocessgetsitsownchunkofmemory
Permitsprotection of1processchunksfromanother
Mappingofmultiplechunksontosharedphysicalmemory
Mappingalsofacilitatesrelocation(aprogramcanruninanymemorylocation,
andcanbemovedduringexecution)
ApplicationandCPUruninvirtualspace(logicalmemory,0 max)
Mappingontophysicalspaceisinvisibletotheapplication
Cachevs.virtualmemory
Blockbecomesapage orsegment
Missbecomesapageoraddressfault
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
76
3AdvantagesofVM
Translation:
Programcanbegivenconsistentviewofmemory,eventhoughphysicalmemoryis
scrambled
Makesmultithreadingreasonable(nowusedalot!)
Onlythemostimportantpartofprogram(WorkingSet)mustbeinphysical
memory.
Contiguousstructures(likestacks)useonlyasmuchphysicalmemoryasnecessary
yetstillgrowlater.
Protection:
Differentthreads(orprocesses)protectedfromeachother.
Differentpagescanbegivenspecialbehavior
(ReadOnly,Invisibletouserprograms,etc).
KerneldataprotectedfromUserprograms
Veryimportantforprotectionfrommaliciousprograms
Sharing:
Canmapsamephysicalpagetomultipleusers
(Sharedmemory)
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
77
Protectionviavirtualmemory
Keepsprocessesintheirownmemoryspace
Roleofarchitecture:
Provideusermodeandsupervisormode
ProtectcertainaspectsofCPUstate
Providemechanismsforswitchingbetweenusermodeand
supervisormode
Providemechanismstolimitmemoryaccesses
ProvideTLBtotranslateaddresses
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
78
VirtualMemory
Physical
Memory Space
frame
frame
frame
frame
virtual
address
OS
manages
the page
table for
each ASIDA
A machine usually
supports
pages of a few
sizes
(MIPS R4000):
79
Physical
Memory Space
Virtual Address
12
offset
frame
frame
V page no.
frame
Page Table
frame
virtual
address
Page Table
Base Reg
index
into
page
table
Access
Rights
PA
table located
in physical P page no.
memory
offset
12
Physical Address
80
PageTableEntry(PTE)?
WhatisinaPageTableEntry(orPTE)?
Pointertonextlevelpagetableortoactualpage
Permissionbits:valid,readonly,readwrite,writeonly
Example:Intelx86architecturePTE:
Addresssameformatpreviousslide(10,10,12bitoffset)
IntermediatepagetablescalledDirectories
PWT
P:
W:
U:
PWT:
PCD:
A:
D:
L:
Free
0 L D A
UW P
(OS)
11-9 8 7 6 5 4 3 2 1 0
PCD
Present(sameasvalidbitinotherarchitectures)
Writeable
Useraccessible
Pagewritetransparent:externalcachewritethrough
Pagecachedisabled(pagecannotbecached)
Accessed:pagehasbeenaccessedrecently
Dirty(PTEonly):pagehasbeenmodifiedrecently
L=14MBpage(directoryonly).
Bottom22bitsofvirtualaddressserveasoffset
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
81
Cachevs.VirtualMemory
Replacement
Cachemisshandledbyhardware
PagefaultusuallyhandledbyOS
Addresses
VirtualmemoryspaceisdeterminedbytheaddresssizeoftheCPU
CachespaceisindependentoftheCPUaddresssize
Lowerlevelmemory
Forcaches themainmemoryisnotsharedbysomethingelse
Forvirtualmemory mostofthediskcontainsthefilesystem
Filesystemaddresseddifferently usuallyinI/Ospace
VirtualmemorylowerlevelisusuallycalledSWAP space
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
82
Thesame4questionsforVirtual
Memory
BlockPlacement
Choice:lowermissratesandcomplexplacementorviceversa
Misspenaltyishuge,sochooselowmissrate placeanywhere
Similartofullyassociativecachemodel
BlockIdentification bothuseadditionaldatastructure
Fixedsizepages useapagetable
Variablesizedsegments segmenttable
BlockReplacement LRUisthebest
HowevertrueLRUisabitcomplex souseapproximation
Pagetablecontainsausetag,andonaccesstheusetagisset
OSchecksthemeverysooften recordswhatitseesinadatastructure thenclears
themall
OnamisstheOSdecideswhohasbeenusedtheleastandreplacethatone
WriteStrategy alwayswriteback
Duetotheaccesstimetothedisk,writethroughissilly
Useadirtybittoonlywritebackpagesthathavebeenmodified
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
83
TechniquesforFastAddress
Translation
Pagetableiskeptinmainmemory(kernelmemory)
Eachprocesshasapagetable
Everydata/instructionaccessrequirestwomemoryaccesses
Oneforthepagetableandoneforthedata/instruction
Canbesolvedbytheuseofaspecialfastlookuphardwarecache
calledassociativeregistersortranslationlookasidebuffers(TLBs)
Iflocality appliesthencachetherecenttranslation
TLB=translationlookasidebuffer
TLBentry:virtualpageno,physicalpageno,protectionbit,usebit,
dirtybit
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
84
TranslationLookAsideBuffers
TranslationLookAsideBuffers(TLB)
Cacheontranslations
FullyAssociative,SetAssociative,orDirectMapped
hit
PA
VA
CPU
Translation
withaTLB
TLB
miss
miss
Cache
hit
Translation
TLBsare:
data
Main
Memory
TLB caches
page table
entries.
virtual address
page
Physical
frame
address
for ASID
off
Page Table
2
0
1
3
physical address
TLB
frame page
2
2
0
5
page
off
CachingAppliedtoAddressTranslation
CPU
Virtual
Address
TLB
Cached?
Yes
No
Translate
(MMU)
Data Read or Write
(untranslated)
Physical
Address
Physical
Memory
VirtualMachines
Supportsisolationandsecurity
Sharingacomputeramongmanyunrelatedusers
Enabledbyrawspeedofprocessors,makingtheoverhead
moreacceptable
AllowsdifferentISAsandoperatingsystemstobepresented
touserprograms
SystemVirtualMachines
SVMsoftwareiscalledvirtualmachinemonitororhypervisor
Individualvirtualmachinesrununderthemonitorarecalledguest
VMs
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
88
EachguestOSmaintainsitsownsetofpagetables
VMMaddsalevelofmemorybetweenphysicalandvirtual
memorycalledrealmemory
VMMmaintainsshadowpagetablethatmapsguestvirtual
addressestophysicaladdresses
RequiresVMMtodetectguestschangestoitsownpagetable
Occursnaturallyifaccessingthepagetablepointerisaprivileged
operation
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
89
ImpactofVMsonVirtualMemory