CA Lec03-Chapter 2-Appendix B-Memory Hierachy Design

ComputerArchitecture
Lecture3:MemoryHierarchy
Design(Chapter2,AppendixB)
ChihWeiLiu
NationalChiao TungUniversity
cwliu@twins.ee.nctu.edu.tw
Introduction
Since1980,CPUhasoutpacedDRAM
CPU
60% per yr
2X in 1.5 yrs
DRAM
Gap grew 50% per
9% per yr
year
2X in 10 yrs
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
Introduction
Introduction
Programmerswantunlimitedamountsofmemorywithlow
latency
Fastmemorytechnologyismoreexpensiveperbitthan
slowermemory
Solution:organizememorysystemintoahierarchy
Entireaddressablememoryspaceavailableinlargest,slowestmemory
Incrementallysmallerandfastermemories,eachcontainingasubset
ofthememorybelowit,proceedinstepsuptowardtheprocessor
Temporalandspatiallocalityinsuresthatnearlyallreferences
canbefoundinsmallermemories
Givestheallusionofalarge,fastmemorybeingpresentedtothe
processor
Introduction
MemoryHierarchyDesign
Memoryhierarchydesignbecomesmorecrucialwith
recentmulticoreprocessors:
Aggregatepeakbandwidthgrowswith#cores:
IntelCorei7cangeneratetworeferencespercoreperclock
Fourcoresand3.2GHzclock
25.6billion64bitdatareferences/second+
12.8billion128bitinstructionreferences
=409.6GB/s!
DRAMbandwidthisonly6%ofthis(25GB/s)
Requires:
Multiport,pipelinedcaches
Twolevelsofcachepercore
Sharedthirdlevelcacheonchip
MemoryHierarchy
Takeadvantageoftheprincipleoflocalityto:
Presentasmuchmemoryasinthecheapesttechnology
Provideaccessatspeedofferedbythefastesttechnology
Processor
Control
1s
Size (bytes): 100s
On-Chip
Cache
Speed (ns):
Registers
Datapath
Second
Level
Cache
(SRAM)
Main
Memory
(DRAM/
FLASH/
PCM)
Secondary
Storage
(Disk/
FLASH/
PCM)
10s-100s
100s
10,000,000s
(10s ms)
Ks-Ms
Ms
Gs
Tertiary
Storage
(Tape/
Cloud
Storage)
10,000,000,000s
(10s sec)
Ts
5
MulticoreArchitecture
Processing Node
Processing Node
Processing Node
CPU
CPU
CPU
Local memory hierarchy

(optimal fixed size)


Processing Node
Processing Node
Processing Node
CPU
CPU
CPU



Interconnection
network
ThePrincipleofLocality
ThePrincipleofLocality:
Programaccessarelativelysmallportionoftheaddressspaceat
anyinstantoftime.
TwoDifferentTypesofLocality:
TemporalLocality(LocalityinTime):Ifanitemisreferenced,it
willtendtobereferencedagainsoon(e.g.,loops,reuse)
SpatialLocality(LocalityinSpace):Ifanitemisreferenced,items
whoseaddressesareclosebytendtobereferencedsoon
(e.g.,straightline code,arrayaccess)
HWreliedonlocalityforspeed
Introduction
MemoryHierarchyBasics
Whenawordisnotfoundinthecache,amissoccurs:
Fetchwordfromlowerlevelinhierarchy,requiringa
higherlatencyreference
Lowerlevelmaybeanothercacheorthemainmemory
Alsofetchtheotherwordscontainedwithintheblock
Takesadvantageofspatiallocality
Placeblockintocacheinanylocationwithinitsset,
determinedbyaddress
blockaddressMODnumberofsets
HitandMiss
Hit:dataappearsinsomeblockintheupperlevel(e.g.:BlockX)
HitRate:thefractionofmemoryaccessfoundintheupperlevel
HitTime:Timetoaccesstheupperlevelwhichconsistsof
RAMaccesstime+Timetodeterminehit/miss
Miss:dataneedstoberetrievefromablockinthelowerlevel
(BlockY)
MissRate=1 (HitRate)
MissPenalty:Timetoreplaceablockintheupperlevel+
Timetodelivertheblocktheprocessor
HitTime<<MissPenalty(500instructionson21264!)
To Processor
Upper Level
Memory
Lower Level
Memory
Blk X
From Processor
Blk Y
CachePerformanceFormulas
(Average memory access time) =
(Hit time) + (Miss rate)(Miss penalty)
Tacc Thit f missT miss
The times Tacc, Thit, and T+miss can be all either:
Real time (e.g., nanoseconds)

Or, number of clock cycles
In contexts where cycle time is known to be a constant
Important:
T+miss means the extra (not total) time for a miss

in addition to Thit, which is incurred by all accesses
Hit time
CPU
Cache
Miss penalty
Lower levels
of hierarchy
10
FourQuestionsforMemoryHierarchy
Consideranylevelinamemoryhierarchy.
Rememberablockistheunitofdatatransfer.
Betweenthegivenlevel,andthelevelsbelowit
Theleveldesignisdescribedbyfourbehaviors:
BlockPlacement:
Wherecouldanewblockbeplacedinthelevel?
BlockIdentification:
Howisablockfoundifitisinthelevel?
BlockReplacement:
Whichexistingblockshouldbereplacedifnecessary?
WriteStrategy:
Howarewritestotheblockhandled?
11
Q1:Wherecanablockbeplaced
intheupperlevel?
Block12placedin8blockcache:
Fullyassociative,directmapped,2waysetassociative
S.A.Mapping=BlockNumberModuloNumberSets
FullMapped
DirectMapped
(12mod8)=4
2WayAssoc
(12mod4)=0
01234567
01234567
01234567
Cache
1111111111222222222233
01234567890123456789012345678901
Memory
12
Q2:Howisablockfoundifitisin
theupperlevel?
Block Address
Index
Tag
IndexUsedtoLookupCandidates
Indexidentifiesthesetincache
Set Select
Tagusedtoidentifyactualcopy
Ifnocandidatesmatch,thendeclarecachemiss
Block
offset
Data Select
Blockisminimumquantumofcaching
Dataselectfieldusedtoselectdatawithinblock
Manycachingapplicationsdonthavedataselectfield
Largerblocksizehasdistincthardwareadvantages:
lesstagoverhead
exploitfastbursttransfersfromDRAM/overwidebusses
Disadvantagesoflargerblocksize?
Fewerblocks moreconflicts.Canwastebandwidth
13
Review:DirectMappedCache
31
Cache Tag
Ex: 0x50
Valid Bit
Cache Tag
4
0
Byte Select
Ex: 0x00
9
Cache Index
Ex: 0x01
0x50
Cache Data
Byte 31
Byte 1 Byte 0 0
Byte 63
Byte 33 Byte 32 1
2
3
:
Byte 1023
DirectMapped2N bytecache:
Theuppermost(32 N)bitsarealwaystheCacheTag
ThelowestMbitsaretheByteSelect(BlockSize=2M)
Example:1KBDirectMappedCachewith32BBlocks
Indexchoosespotentialblock
Tagcheckedtoverifyblock
Byteselectchoosesbytewithinblock
: :
Byte 992 31
14
DirectMappedCacheArchitecture
Tags
Block frames
Address
Tag Frm# Off.
Decode & Row Select
Compare Tags
?
Hit
Mux
select
Data Word
15
Review:SetAssociativeCache
Nwaysetassociative:NentriesperCacheIndex
Ndirectmappedcachesoperatesinparallel
Example:Twowaysetassociativecache
CacheIndexselectsasetfromthecache
Twotagsinthesetarecomparedtoinputinparallel
Dataisselectedbasedonthetagresult
31
Cache Tag
Valid
Cache Tag
Compare
Cache Data
Cache Block 0
8
Cache Index
4
0
Byte Select
Cache Data
Cache Block 0
Cache Tag
Valid
Sel1 1
Mux
0 Sel0
Compare
OR
CA-Lec3
Hit
Cache Block
16
Review:FullyAssociativeCache
FullyAssociative:Everyblockcanholdanyline
Addressdoesnotincludeacacheindex
CompareCacheTagsofallCacheEntriesinParallel
Example:BlockSize=32Bblocks
WeneedN27bitcomparators
Stillhavebyteselecttochoosefromwithinblock
4
31
Cache Tag (27 bits long)
Cache Tag
0
Byte Select
Ex: 0x01
Valid Bit Cache Data

Byte 31
Byte 1 Byte 0
Byte 63
Byte 33 Byte 32
: :
=
=
=
=
=
:
17
ConcludingRemarks
Directmappedcache=1waysetassociative
cache
Fullyassociativecache:thereisonly1set
18
CacheSizeEquation
Simpleequationforthesizeofacache:
(Cachesize)=(Blocksize) (Numberofsets)
(SetAssociativity)
Canrelatetothesizeofvariousaddressfields:
(Blocksize)=2(#ofoffsetbits)
(Numberofsets)=2(#ofindexbits)
(#oftagbits)=(#ofmemoryaddressbits)
(#ofindexbits) (#ofoffsetbits)
Memory address
19
Q3:Whichblockshouldbereplaced
onamiss?
Easyfordirectmappedcache
Onlyonechoice
Setassociativeorfullyassociative
LRU(leastrecentlyused)
Appealing,buthardtoimplementforhighassociativity
Random
Easy,buthowwelldoesitwork?
Firstin,firstout(FIFO)
20
Q4:Whathappensonawrite?
Write-Through
Write-Back
Data written to cache

block
also written to lowerlevel memory
Write data only to

the cache
Update lower level
when a block falls
out of the cache
Debug
Easy
Hard
Do read misses
produce writes?
No
Yes
Do repeated
writes make it to
lower level?
Yes
No
Policy
Additional option -- let writes to an un-cached address

allocate a new cache line (write-allocate).
21
WriteBuffers
Cache
Processor
Lower
Level
Memory
Write Buffer
Holds data awaiting write-through to

lower level memory
Q. Why a write buffer ?
A. So CPU doesnt stall
Q. Why a buffer, why

not just one register ?
A. Bursts of writes are

common.
Q. Are Read After Write A. Yes! Drain buffer before

(RAW) hazards an issue next read, or check write
buffers for match on reads
for write buffer?CA-Lec3 cwliu@twins.ee.nctu.edu.tw
22
MoreonCachePerformanceMetrics
Cansplitaccesstimeintoinstructions&data:
Avg.mem.acc.time=
(%instructionaccesses) (inst.mem.accesstime)+
(%dataaccesses) (datamem.accesstime)
Anotherformulafromchapter1:
CPUtime=(CPUexecutionclockcycles+Memorystallclockcycles)
cycletime
UsefulforexploringISAchanges
Canbreakstallsintoreadsandwrites:
Memorystallcycles=
(Reads readmissrate readmisspenalty)+
(Writes writemissrate writemisspenalty)
23
SourcesofCacheMisses
Compulsory (coldstartorprocessmigration,firstreference):
firstaccesstoablock
Coldfactoflife:notawholelotyoucandoaboutit
Note:Ifyouaregoingtorunbillionsofinstruction,Compulsory
Missesareinsignificant
Capacity:
Cachecannotcontainallblocksaccessbytheprogram
Solution:increasecachesize
Conflict (collision):
Multiplememorylocationsmapped
tothesamecachelocation
Solution1:increasecachesize
Solution2:increaseassociativity
Coherence (Invalidation):otherprocess(e.g.,I/O)updates
memory
24
Introduction
MemoryHierarchyBasics
Sixbasiccacheoptimizations:
Largerblocksize
Reducescompulsorymisses
Increasescapacityandconflictmisses,increasesmisspenalty
Largertotalcachecapacitytoreducemissrate
Increaseshittime,increasespowerconsumption
Higherassociativity
Reducesconflictmisses
Increaseshittime,increasespowerconsumption
Highernumberofcachelevels
Reducesoverallmemoryaccesstime
Givingprioritytoreadmissesoverwrites
Reducesmisspenalty
Avoidingaddresstranslationincacheindexing
Reduceshittime
25
1.LargerBlockSizes
Largerblocksize no.ofblocks
Obviousadvantages:reducecompulsorymisses
Reasonisduetospatiallocality
Obviousdisadvantage
Highermisspenalty:largerblocktakeslongertomove
Mayincreaseconflictmissesandcapacitymissifcacheissmall
Dont let increase in miss penalty outweigh the

decrease in miss rate
26
2.LargeCaches
Cachesize missrate;hittime
Helpwithbothconflictandcapacitymisses
MayneedlongerhittimeAND/ORhigherHW
cost
Popularinoffchipcaches
27
3.HigherAssociativity
Reduceconflictmiss
2:1Cacheruleofthumbonmissrate
2waysetassociativeofsizeN/2isaboutthe
sameasadirectmappedcacheofsizeN(heldfor
cachesize<128KB)
Greaterassociativitycomesatthecostof
increasedhittime
Lengthentheclockcycle
28
4.MultiLevelCaches
2levelcachesexample
AMATL1 =HittimeL1 +MissrateL1 MisspenaltyL1
AMATL2 =HittimeL1 +MissrateL1 (HittimeL2 +Miss
rateL2 MisspenaltyL2)
Probablythebestmisspenaltyreductionmethod
Definitions:
Localmissrate:missesinthiscachedividedbythetotalnumberof
memoryaccessestothiscache(MissrateL2)
Globalmissrate:missesinthiscachedividedbythetotalnumberof
memoryaccessesgeneratedbyCPU(MissrateL1xMissrateL2)
GlobalMissRateiswhatmatters
29
MultiLevelCaches(Cont.)
Advantages:
CapacitymissesinL1endupwithasignificantpenaltyreduction
ConflictmissesinL1similarlygetsuppliedbyL2
Holdingsizeof1stlevelcacheconstant:
Decreasesmisspenaltyof1stlevelcache.
Or,increasesaverageglobalhittimeabit:
hittimeL1+missrateL1xhittimeL2
butdecreasesglobalmissrate
Holdingtotalcachesizeconstant:
Globalmissrate,misspenaltyaboutthesame
Decreasesaverageglobalhittimesignificantly!
NewL1muchsmallerthanoldL1
30
MissRateExample
Supposethatin1000memoryreferencesthereare40missesinthefirstlevel
cacheand20missesinthesecondlevelcache
Missrateforthefirstlevelcache=40/1000(4%)
Localmissrateforthesecondlevelcache=20/40(50%)
Globalmissrateforthesecondlevelcache=20/1000(2%)
AssumemisspenaltyL2is200CC,hittimeL2is10CC,hittimeL1is1CC,and1.5
memoryreferenceperinstruction.Whatisaveragememoryaccesstimeand
averagestallcyclesperinstructions?Ignorewritesimpact.
AMAT=HittimeL1+MissrateL1 (HittimeL2+MissrateL2 MisspenaltyL2)=1+
4% (10+50% 200)=5.4CC
Averagememorystallsperinstruction=MissesperinstructionL1 HittimeL2+
MissesperinstructionsL2MisspenaltyL2
=(401.5/1000) 10+(201.5/1000)200=6.6CC
Or(5.4 1.0) 1.5=6.6CC
31
5.GivingPrioritytoReadMissesOver
SW R3, 512(R0)
;cache index 0
R1, 1024(R0) ;cache index 0
Writes LW
LW R2, 512(R0)
;cache index 0
Inwritethrough,writebufferscomplicatememoryaccessinthatthey R2=R3 ?
mightholdtheupdatedvalueoflocationneededonareadmiss
RAW conflictswithmainmemoryreadsoncachemisses
Readmisswaitsuntilthewritebufferempty increasereadmisspenalty
Checkwritebuffercontentsbeforeread,andifnoconflicts,letthe
read priority over write
memoryaccesscontinue
WriteBack?
Readmissreplacingdirtyblock
Normal:Writedirtyblocktomemory,andthendotheread
Instead,copythedirtyblocktoawritebuffer,thendotheread,andthendo
thewrite
CPUstalllesssincerestartsassoonasdoread
32
6.AvoidingAddressTranslationduring
IndexingoftheCache
Virtual
Address
Physical
Address
Address
Translation
$ means cache
Virtuallyaddressedcaches
TLB
PA
$
PA
MEM
Conventional
Organization
CPU
CPU
CPU
VA
Cache
Indexing
VA
Tags
VA
$
VA
VA
VA
Tags
TLB
TLB
L2 $
PA
MEM
MEM
PA
Overlap $ access with VA

Virtually Addressed Cache
translation: requires $
Translate only on miss
index to remain invariant
Synonym (Alias) Problem across translation
33
WhynotVirtualCache?
Taskswitchcausesthesame VAtorefertodifferentPAs
Hence,cachemustbeflushed
Hughtaskswitchoverhead
Alsocreateshugecompulsorymissratesfornewprocess
SynonymsorAliasproblemcausesdifferentVAswhich
maptothesamePA
Twocopiesofthesamedatainavirtualcache
AntialiasingHWmechanismisrequired(complicated)
SWcanhelp
I/O(alwaysusesPA)
RequiremappingtoVAtointeractwithavirtualcache
34
AdvancedCacheOptimizations
Reducinghittime
1. Smallandsimplecaches
2. Wayprediction
Increasingcachebandwidth
3.Pipelinedcaches
4. Multibanked caches
5. Nonblocking caches
ReducingMissPenalty
6.Criticalwordfirst
7. Mergingwritebuffers
ReducingMissRate
8.Compileroptimizations
Reducingmisspenaltyor
missrate viaparallelism
9.Hardwareprefetching
10. Compilerprefetching
35
Advanced Optimizations
1.SmallandSimpleL1Cache
Criticaltimingpathincache:
addressingtagmemory,thencomparingtags,
thenselectingcorrectset
Indextagmemoryandthencomparetakestime
Directmappedcachescanoverlaptag
compareandtransmissionofdata
Sincethereisonlyonechoice
Lowerassociativityreducespowerbecause
fewercachelinesareaccessed
36
L1SizeandAssociativity
Access time vs. size and associativity

37
L1SizeandAssociativity
Energy per read vs. size and associativity

38
2.FastHittimesviaWayPrediction
HowtocombinefasthittimeofDirectMappedandhavethelowerconflictmisses
of2waySAcache?
Wayprediction:keepextrabitsincachetopredicttheway, orblockwithinthe
set,ofnextcacheaccess.
Multiplexorissetearlytoselectdesiredblock,only1tagcomparisonperformedthat
clockcycleinparallelwithreadingthecachedata
Miss 1st checkotherblocksformatchesinnextclockcycle
Hit Time
Miss Penalty
Way-Miss Hit Time
Accuracy 85%
Drawback:CPUpipelineishardifhittakes1or2cycles
Usedforinstructioncachesvs.datacaches
39
WayPrediction
Toimprovehittime,predictthewaytopresetmux
Mispredictiongiveslongerhittime
Predictionaccuracy
>90%fortwoway
>80%forfourway
IcachehasbetteraccuracythanDcache
FirstusedonMIPSR10000inmid90s
UsedonARMCortexA8
Extendtopredictblockaswell
Wayselection
Increasesmispredictionpenalty
40
3.IncreasingCacheBandwidthby
Pipelining
Pipelinecacheaccesstoimprovebandwidth
Examples:
Pentium:1cycle
PentiumPro PentiumIII:2cycles
Pentium4 Corei7:4cycles
Makesiteasiertoincreaseassociativity
But,pipelinecacheincreasestheaccesslatency
Moreclockcyclesbetweentheissueoftheloadandthe
useofthedata
AlsoIncreasesbranchmispredictionpenalty
41
4.IncreasingCacheBandwidth:
NonBlockingCaches
Nonblockingcache orlockupfreecache allowdatacachetocontinueto

supplycachehitsduringamiss
requiresF/Ebitsonregistersoroutoforderexecution
requiresmultibankmemories
hitundermiss reducestheeffectivemisspenaltybyworkingduringmiss
vs.ignoringCPUrequests
hitundermultiplemiss ormissundermiss mayfurtherlowerthe
effectivemisspenaltybyoverlappingmultiplemisses
Significantlyincreasesthecomplexityofthecachecontrollerasthere
canbemultipleoutstandingmemoryaccesses
Requiresmuliplememorybanks(otherwisecannotsupport)
PeniumProallows4outstandingmemorymisses
42
Nonblocking CachePerformances
L2mustsupportthis
Ingeneral,processorscanhideL1misspenaltybutnotL2miss
penalty
43
6:IncreasingCacheBandwidthviaMultiple
Banks
Ratherthantreatthecacheasasinglemonolithicblock,
divideintoindependentbanksthatcansupportsimultaneous
accesses
E.g.,T1(Niagara)L2has4banks
Bankingworksbestwhenaccessesnaturallyspread
themselvesacrossbanks mappingofaddressestobanks
affectsbehaviorofmemorysystem
Simplemappingthatworkswellissequentialinterleaving
Spreadblockaddressessequentiallyacrossbanks
E,g,ifthere4banks,Bank0hasallblockswhoseaddressmodulo4is0;
bank1hasallblockswhoseaddressmodulo4is1;
44
5.IncreasingCacheBandwidthvia
Multibanked Caches
Organizecacheasindependentbankstosupport
simultaneousaccess(ratherthanasinglemonolithic
block)
ARMCortexA8supports14banksforL2
Inteli7supports4banksforL1and8banksforL2
Bankingworksbestwhenaccessesnaturallyspread
themselvesacrossbanks
Interleavebanksaccordingtoblockaddress
45
6.ReduceMissPenalty:
CriticalWordFirstandEarlyRestart
Processorusuallyneedsonewordoftheblockatatime
Donotwaitforfullblocktobeloadedbeforerestartingprocessor
CriticalWordFirst requestthemissedwordfirstfrommemoryandsenditto
theprocessorassoonasitarrives;lettheprocessorcontinueexecutionwhile
fillingtherestofthewordsintheblock.Alsocalledwrappedfetch and
requestedwordfirst
Earlyrestart assoonastherequestedwordoftheblockarrives,senditto
theprocessorandlettheprocessorcontinueexecution
Benefitsofcriticalwordfirstandearlyrestartdependon
Blocksize:generallyusefulonlyinlargeblocks
Likelihood ofanotheraccesstotheportionoftheblockthathasnotyetbeen
fetched
Spatiallocalityproblem:tendtowantnextsequentialword,sonotclearifbenefit
46
7.MergingWriteBufferto
ReduceMissPenalty
Writebuffertoallowprocessortocontinuewhilewaitingto
writetomemory
Ifbuffercontainsmodifiedblocks,theaddressescanbe
checkedtoseeifaddressofnewdatamatchestheaddressof
avalidwritebufferentry.Ifso,newdataarecombinedwith
thatentry
Increasesblocksizeofwriteforwritethroughcacheofwrites
tosequentialwords,bytessincemultiwordwritesmore
efficienttomemory
47
MergingWriteBuffer
Whenstoringtoablockthatisalreadypendinginthewrite
buffer,updatewritebuffer
Reducesstallsduetofullwritebuffer
Nowrite
buffering
Writebuffering
48
8.ReducingMissesbyCompiler
Optimizations
McFarling [1989]reducedcachesmissesby75%on8KBdirectmapped
cache,4byteblocksinsoftware
Instructions
Reorderproceduresinmemorysoastoreduceconflictmisses
Profilingtolookatconflicts(usingtoolstheydeveloped)
Data
LoopInterchange:swapnestedloopstoaccessdatainorderstoredinmemory
(insequentialorder)
LoopFusion:Combine2independentloopsthathavesameloopingandsome
variablesoverlap
Blocking:Improvetemporallocalitybyaccessingblocks ofdatarepeatedlyvs.
goingdownwholecolumnsorrows
Insteadofaccessingentirerowsorcolumns,subdividematricesintoblocks
Requiresmorememoryaccessesbutimproveslocalityofaccesses
49
LoopInterchangeExample
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Sequentialaccessesinsteadofstridingthroughmemoryevery100words;
improvedspatiallocality
50
LoopFusionExample
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{
a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
Perform different
computations on the
common data in two loops
fuse the two loops
2missesperaccesstoa &c vs.onemissperaccess;improvespatiallocality
51
BlockingExample
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};
TwoInnerLoops:
ReadallNxNelementsofz[]
ReadNelementsof1rowofy[]repeatedly
WriteNelementsof1rowofx[]
CapacityMissesafunctionofN&CacheSize:
2N3+N2 =>(assumingnoconflict;otherwise)
Idea:computeonBxBsubmatrixthatfits
52
Snapshotofx,y,zwhenN=6,i=1
White: not yet touched

Light: older access
Dark: newer access
Before.
53
BlockingExample
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1) {
r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;
};
BcalledBlockingFactor
CapacityMissesfrom2N3 +N2 to2N3/B+N2
ConflictMissesToo?
54
TheAgeofAccessestox,y,zwhenB=3
Note in contrast to previous Figure, the smaller number of elements accessed
55
9.ReducingMissPenaltyorMissRateby
Hardware PrefetchingofInstructions&Data
Prefetchingreliesonhavingextramemorybandwidth thatcanbeusedwithout
penalty
InstructionPrefetching
Typically,CPUfetches2blocksonamiss:therequestedblockandthenextconsecutive
block.
Requestedblockisplacedininstructioncachewhenitreturns,andprefetched blockis
placedintoinstructionstreambuffer
DataPrefetching
Pentium4canprefetch dataintoL2cachefromupto8streamsfrom8different4KB
pages
Prefetchinginvokedif2successiveL2cachemissestoapage,
ifdistancebetweenthosecacheblocksis<256bytes
1.97
gr
id
eq
ua
ke
SPECfp2000
1.49
1.40
1.32
ap
pl
u
1.21
1.29
sw
im
3d
wu
pw
is
e
fa
m
cf
SPECint2000
1.20
ga
lg
el
fa
ce
re
c
1.18
1.16
1.26
lu
ca
s
1.45
2.20
2.00
1.80
1.60
1.40
1.20
1.00
ga
p
Performance Improvement
Intel Pentium 4
56
10.ReducingMissPenaltyorMissRateby
CompilerControlled PrefetchingData
Prefetch instructionisinsertedbeforedataisneeded
DataPrefetch
Registerprefetch:loaddataintoregister(HPPARISCloads)
CachePrefetch:loadintocache(MIPSIV,PowerPC,SPARCv.9)
Specialprefetchinginstructionscannotcausefaults;
aformofspeculativeexecution
IssuingPrefetch Instructionstakestime
Iscostofprefetch issues<savingsinreducedmisses?
Highersuperscalarreducesdifficultyofissuebandwidth
Combinewithsoftwarepipeliningandloopunrolling
57
Summary
58
Memory Technology
MemoryTechnology
Performancemetrics
Latencyisconcernofcache
BandwidthisconcernofmultiprocessorsandI/O
Accesstime
Timebetweenreadrequestandwhendesiredwordarrives
Cycletime
Minimumtimebetweenunrelatedrequeststomemory
DRAMusedformainmemory,SRAMusedforcache
59
Memory Technology
MemoryTechnology
SRAM:staticrandomaccessmemory
Requireslowpowertoretainbit,sincenorefresh
But,requires6transistors/bit(vs.1transistor/bit)
DRAM
Onetransistor/bit
Mustberewrittenafterbeingread
Mustalsobeperiodicallyrefreshed
Every~8ms
Eachrowcanberefreshedsimultaneously
Addresslinesaremultiplexed:
Upperhalfofaddress:rowaccessstrobe(RAS)
Lowerhalfofaddress:columnaccessstrobe(CAS)
60
DRAMTechnology
Emphasizeoncostperbitandcapacity
Multiplexaddresslines cutting#ofaddresspinsinhalf
Rowaccessstrobe(RAS)first,thencolumnaccessstrobe(CAS)
Memoryasa2Dmatrix rowsgotoabuffer
SubsequentCASselectssubrow
Useonlyasingletransistortostoreabit
Readingthatbitcandestroytheinformation
Refresheachbitperiodically(ex.8milliseconds)bywritingback
Keeprefreshingtimelessthan5%ofthetotaltime
DRAMcapacityis4to8timesthatofSRAM
61
DRAMLogicalOrganization(4Mbit)
Column Decoder
11
A0A10
Sense Amps & I/O
Memory Array
(2,048 x 2,048)
Storage
Word Line Cell
SquarerootofbitsperRAS/CAS
62
DRAMTechnology(cont.)
DIMM:Dualinlinememorymodule
DRAMchipsarecommonlysoldonsmallboardscalledDIMMs
DIMMstypicallycontain4to16DRAMs
SlowingdowninDRAMcapacitygrowth
Fourtimesthecapacityeverythreeyears,formorethan20years
Newchipsonlydoublecapacityeverytwoyear,since1998
DRAMperformanceisgrowingataslowerrate
RAS(relatedtolatency):5%peryear
CAS(relatedtobandwidth):10%+peryear
63
Memory Technology
RASImprovement
64
QuestforDRAMPerformance
1.
FastPagemode
Addtimingsignalsthatallowrepeatedaccessestorowbufferwithout
anotherrowaccesstime
Suchabuffercomesnaturally,aseacharraywillbuffer1024to2048bitsfor
eachaccess
2.
SynchronousDRAM(SDRAM)
AddaclocksignaltoDRAMinterface,sothattherepeatedtransferswould
notbearoverheadtosynchronizewithDRAMcontroller
3.
DoubleDataRate(DDRSDRAM)
TransferdataonboththerisingedgeandfallingedgeoftheDRAMclock
signal doublingthepeakdatarate
DDR2lowerspowerbydroppingthevoltagefrom2.5to1.8volts+offers
higherclockrates:upto400MHz
DDR3dropsto1.5volts+higherclockrates:upto800MHz
DDR4dropsto1.2volts,clockrateupto1600MHz
ImprovedBandwidth,notLatency
65
DRAMnamebasedonPeakChipTransfers/Sec
DIMMnamebasedonPeakDIMMMBytes /Sec
Fastest for sale 4/06 ($125/GB)
Standard
Clock Rate
(MHz)
M transfers /
second
DRAM Name
Mbytes/s/
DIMM
DDR
133
266
DDR266
2128
PC2100
DDR
150
300
DDR300
2400
PC2400
DDR
200
400
DDR400
3200
PC3200
DDR2
266
533
DDR2-533
4264
PC4300
DDR2
333
667
DDR2-667
5336
PC5300
DDR2
400
800
DDR2-800
6400
PC6400
DDR3
533
1066
DDR3-1066
8528
PC8500
DDR3
666
1333
DDR3-1333
10664
PC10700
DDR3
800
1600
DDR3-1600
12800
PC12800
x2
DIMM
Name
x8
66
Memory Technology
DRAMPerformance
67
GraphicsMemory
GDDR5isgraphicsmemorybasedonDDR3
Graphicsmemory:
Achieve25XbandwidthperDRAMvs.DDR3
Widerinterfaces(32vs.16bit)
Higherclockrate
Possiblebecausetheyareattachedviasolderinginsteadof
socketedDIMMmodules
68
Memory Technology
MemoryPowerConsumption
69
SRAMTechnology
CacheusesSRAM:StaticRandomAccessMemory
SRAMusessixtransistorsperbittopreventtheinformation
frombeingdisturbedwhenread
noneedtorefresh
SRAMneedsonlyminimalpowertoretainthechargeinthestandby
mode goodforembeddedapplications
NodifferencebetweenaccesstimeandcycletimeforSRAM
Emphasizeonspeedandcapacity
SRAMaddresslinesarenotmultiplexed
SRAMspeedis8to16xthatofDRAM
70
ROMandFlash
Embeddedprocessormemory
Readonlymemory(ROM)
Programmedatthetimeofmanufacture
Onlyasingletransistorperbittorepresent1or0
Usedfortheembeddedprogramandforconstant
Nonvolatileandindestructible
Flashmemory:
Mustbeerased(inblocks)beforebeingoverwritten
Nonvolatilebutallowthememorytobemodified
ReadsatalmostDRAMspeeds,butwrites10to100timesslower
DRAMcapacityperchipandMBperdollarisabout4to8timesgreater
thanflash
CheaperthanSDRAM,moreexpensivethandisk
SlowerthanSRAM,fasterthandisk
71
Memory Technology
MemoryDependability
Memoryissusceptibletocosmicrays
Softerrors:dynamicerrors
Detectedandfixedbyerrorcorrectingcodes(ECC)
Harderrors:permanenterrors
Usesparserowstoreplacedefectiverows
Chipkill:aRAIDlikeerrorrecoverytechnique
72
VirtualMemory?
Thelimitsofphysicaladdressing
Allprogramsshareonephysicaladdressspace
Machinelanguageprogramsmustbeawareofthemachine
organization
Nowaytopreventaprogramfromaccessinganymachineresource
Recall:manyprocessesuseonlyasmallportionofaddressspace
Virtualmemorydividesphysicalmemoryintoblocks(calledpageor
segment)andallocatesthemtodifferentprocesses
Withvirtualmemory,theprocessorproducesvirtualaddressthat
aretranslatedbyacombinationofHWandSWtophysical
addresses(calledmemorymappingoraddresstranslation).
73
VirtualMemory:AddaLayerofIndirection
Physical Addresses
Virtual Addresses
A0-A31
Virtual
Physical
Address
Translation
CPU
D0-D31
A0-A31
Memory
D0-D31
Data
User programs run in an standardized

virtual address space
Address Translation hardware
managed by the operating system (OS)
maps virtual address to physical memory
Hardware supports modern OS features:
Protection,
Translation, Sharing
74
VirtualMemory
Mapping by a
page table
75
VirtualMemory(cont.)
Permitsapplicationstogrowbiggerthanmainmemorysize
Helpswithmultipleprocessmanagement
Eachprocessgetsitsownchunkofmemory
Permitsprotection of1processchunksfromanother
Mappingofmultiplechunksontosharedphysicalmemory
Mappingalsofacilitatesrelocation(aprogramcanruninanymemorylocation,
andcanbemovedduringexecution)
ApplicationandCPUruninvirtualspace(logicalmemory,0 max)
Mappingontophysicalspaceisinvisibletotheapplication
Cachevs.virtualmemory
Blockbecomesapage orsegment
Missbecomesapageoraddressfault
76
3AdvantagesofVM
Translation:
Programcanbegivenconsistentviewofmemory,eventhoughphysicalmemoryis
scrambled
Makesmultithreadingreasonable(nowusedalot!)
Onlythemostimportantpartofprogram(WorkingSet)mustbeinphysical
memory.
Contiguousstructures(likestacks)useonlyasmuchphysicalmemoryasnecessary
yetstillgrowlater.
Protection:
Differentthreads(orprocesses)protectedfromeachother.
Differentpagescanbegivenspecialbehavior
(ReadOnly,Invisibletouserprograms,etc).
KerneldataprotectedfromUserprograms
Veryimportantforprotectionfrommaliciousprograms
Sharing:
Canmapsamephysicalpagetomultipleusers
(Sharedmemory)
77
Protectionviavirtualmemory
Keepsprocessesintheirownmemoryspace
Roleofarchitecture:
Provideusermodeandsupervisormode
ProtectcertainaspectsofCPUstate
Providemechanismsforswitchingbetweenusermodeand
supervisormode
Providemechanismstolimitmemoryaccesses
ProvideTLBtotranslateaddresses
78
Virtual Memory and Virtual Machines
VirtualMemory
Page Tables Encode Virtual Address Spaces

Page Table
Physical
Memory Space
frame
frame
A virtual address space

is divided into blocks
of memory called pages
frame
frame
virtual
address
OS
manages
the page
table for
each ASIDA
A machine usually
supports
pages of a few
sizes
(MIPS R4000):
A page table is indexed by a

virtual address
valid page table entry codes physical memory
frame
address
for the page
CA-Lec3
79
Details of Page Table

Page Table
Physical
Memory Space
Virtual Address
12
offset
frame
frame
V page no.
frame
Page Table
frame
virtual
address
Page Table
Base Reg
index
into
page
table
Access
Rights
PA
table located
in physical P page no.
memory
offset
12
Physical Address
Page table maps virtual page numbers to physical

frames (PTE = Page Table Entry)
Virtual memory => treat memory cache for disk
80
PageTableEntry(PTE)?
WhatisinaPageTableEntry(orPTE)?
Pointertonextlevelpagetableortoactualpage
Permissionbits:valid,readonly,readwrite,writeonly
Example:Intelx86architecturePTE:
Addresssameformatpreviousslide(10,10,12bitoffset)
IntermediatepagetablescalledDirectories
PWT
P:
W:
U:
PWT:
PCD:
A:
D:
L:
Free
0 L D A
UW P
(OS)
11-9 8 7 6 5 4 3 2 1 0
PCD
Page Frame Number

(Physical Page Number)
31-12
Present(sameasvalidbitinotherarchitectures)
Writeable
Useraccessible
Pagewritetransparent:externalcachewritethrough
Pagecachedisabled(pagecannotbecached)
Accessed:pagehasbeenaccessedrecently
Dirty(PTEonly):pagehasbeenmodifiedrecently
L=14MBpage(directoryonly).
Bottom22bitsofvirtualaddressserveasoffset
81
Cachevs.VirtualMemory
Replacement
Cachemisshandledbyhardware
PagefaultusuallyhandledbyOS
Addresses
VirtualmemoryspaceisdeterminedbytheaddresssizeoftheCPU
CachespaceisindependentoftheCPUaddresssize
Lowerlevelmemory
Forcaches themainmemoryisnotsharedbysomethingelse
Forvirtualmemory mostofthediskcontainsthefilesystem
Filesystemaddresseddifferently usuallyinI/Ospace
VirtualmemorylowerlevelisusuallycalledSWAP space
82
Thesame4questionsforVirtual
Memory
BlockPlacement
Choice:lowermissratesandcomplexplacementorviceversa
Misspenaltyishuge,sochooselowmissrate placeanywhere
Similartofullyassociativecachemodel
BlockIdentification bothuseadditionaldatastructure
Fixedsizepages useapagetable
Variablesizedsegments segmenttable
BlockReplacement LRUisthebest
HowevertrueLRUisabitcomplex souseapproximation
Pagetablecontainsausetag,andonaccesstheusetagisset
OSchecksthemeverysooften recordswhatitseesinadatastructure thenclears
themall
OnamisstheOSdecideswhohasbeenusedtheleastandreplacethatone
WriteStrategy alwayswriteback
Duetotheaccesstimetothedisk,writethroughissilly
Useadirtybittoonlywritebackpagesthathavebeenmodified
83
TechniquesforFastAddress
Translation
Pagetableiskeptinmainmemory(kernelmemory)
Eachprocesshasapagetable
Everydata/instructionaccessrequirestwomemoryaccesses
Oneforthepagetableandoneforthedata/instruction
Canbesolvedbytheuseofaspecialfastlookuphardwarecache
calledassociativeregistersortranslationlookasidebuffers(TLBs)
Iflocality appliesthencachetherecenttranslation
TLB=translationlookasidebuffer
TLBentry:virtualpageno,physicalpageno,protectionbit,usebit,
dirtybit
84
TranslationLookAsideBuffers
TranslationLookAsideBuffers(TLB)
Cacheontranslations
FullyAssociative,SetAssociative,orDirectMapped
hit
PA
VA
CPU
Translation
withaTLB
TLB
miss
miss
Cache
hit
Translation
TLBsare:
data
Small typicallynotmorethan128 256entries

FullyAssociative
Main
Memory
The TLB Caches Page Table Entries

Physical and virtual
pages must be the
same size!
TLB caches
page table
entries.
virtual address
page
Physical
frame
address
for ASID
off
Page Table
2
0
1
3
physical address
TLB
frame page
2
2
0
5
page
off
V=0 pages either

reside on disk or
have not yet been
allocated.
OS handles V=0
Page fault
CachingAppliedtoAddressTranslation
CPU
Virtual
Address
TLB
Cached?
Yes
No
Translate
(MMU)
Data Read or Write
(untranslated)
Physical
Address
Physical
Memory
VirtualMachines
Supportsisolationandsecurity
Sharingacomputeramongmanyunrelatedusers
Enabledbyrawspeedofprocessors,makingtheoverhead
moreacceptable
AllowsdifferentISAsandoperatingsystemstobepresented
touserprograms
SystemVirtualMachines
SVMsoftwareiscalledvirtualmachinemonitororhypervisor
Individualvirtualmachinesrununderthemonitorarecalledguest
VMs
88
EachguestOSmaintainsitsownsetofpagetables
VMMaddsalevelofmemorybetweenphysicalandvirtual
memorycalledrealmemory
VMMmaintainsshadowpagetablethatmapsguestvirtual
addressestophysicaladdresses
RequiresVMMtodetectguestschangestoitsownpagetable
Occursnaturallyifaccessingthepagetablepointerisaprivileged
operation
89
ImpactofVMsonVirtualMemory

CA Lec03-Chapter 2-Appendix B-Memory Hierachy Design

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CA Lec03-Chapter 2-Appendix B-Memory Hierachy Design

Uploaded by

Copyright:

Available Formats

ComputerArchitecture

Size (bytes): 100s

Local memory hierarchy

Local memory hierarchy

Local memory hierarchy

Local memory hierarchy

Local memory hierarchy

Local memory hierarchy

Tacc Thit f missT miss

The times Tacc, Thit, and T+miss can be all either:

Real time (e.g., nanoseconds)

T+miss means the extra (not total) time for a miss

Decode & Row Select

Valid Bit Cache Data

Data written to cache

Write data only to

Additional option -- let writes to an un-cached address

Holds data awaiting write-through to

A. So CPU doesnt stall

Q. Why a buffer, why

A. Bursts of writes are

Q. Are Read After Write A. Yes! Drain buffer before

Dont let increase in miss penalty outweigh the

Overlap $ access with VA

Access time vs. size and associativity

Energy per read vs. size and associativity

Way-Miss Hit Time

Nonblockingcache orlockupfreecache allowdatacachetocontinueto

2missesperaccesstoa &c vs.onemissperaccess;improvespatiallocality

White: not yet touched

Note in contrast to previous Figure, the smaller number of elements accessed

Sense Amps & I/O

Fastest for sale 4/06 ($125/GB)

User programs run in an standardized

Virtual Memory and Virtual Machines

Page Tables Encode Virtual Address Spaces

A virtual address space

A page table is indexed by a

Details of Page Table

Page table maps virtual page numbers to physical

Page Frame Number

Small typicallynotmorethan128 256entries

The TLB Caches Page Table Entries

V=0 pages either

Virtual Memory and Virtual Machines

Virtual Memory and Virtual Machines

You might also like