You are on page 1of 89

ComputerArchitecture

Lecture3:MemoryHierarchy
Design(Chapter2,AppendixB)
ChihWeiLiu
NationalChiao TungUniversity
cwliu@twins.ee.nctu.edu.tw

Introduction

Since1980,CPUhasoutpacedDRAM
CPU
60% per yr
2X in 1.5 yrs

DRAM
Gap grew 50% per
9% per yr
year
2X in 10 yrs

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

Introduction

Introduction
Programmerswantunlimitedamountsofmemorywithlow
latency
Fastmemorytechnologyismoreexpensiveperbitthan
slowermemory
Solution:organizememorysystemintoahierarchy
Entireaddressablememoryspaceavailableinlargest,slowestmemory
Incrementallysmallerandfastermemories,eachcontainingasubset
ofthememorybelowit,proceedinstepsuptowardtheprocessor

Temporalandspatiallocalityinsuresthatnearlyallreferences
canbefoundinsmallermemories
Givestheallusionofalarge,fastmemorybeingpresentedtothe
processor

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

Introduction

MemoryHierarchyDesign
Memoryhierarchydesignbecomesmorecrucialwith
recentmulticoreprocessors:
Aggregatepeakbandwidthgrowswith#cores:
IntelCorei7cangeneratetworeferencespercoreperclock
Fourcoresand3.2GHzclock
25.6billion64bitdatareferences/second+
12.8billion128bitinstructionreferences
=409.6GB/s!

DRAMbandwidthisonly6%ofthis(25GB/s)
Requires:
Multiport,pipelinedcaches
Twolevelsofcachepercore
Sharedthirdlevelcacheonchip
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

MemoryHierarchy
Takeadvantageoftheprincipleoflocalityto:
Presentasmuchmemoryasinthecheapesttechnology
Provideaccessatspeedofferedbythefastesttechnology
Processor
Control

1s

Size (bytes): 100s

On-Chip
Cache

Speed (ns):

Registers

Datapath

Second
Level
Cache
(SRAM)

Main
Memory
(DRAM/
FLASH/
PCM)

Secondary
Storage
(Disk/
FLASH/
PCM)

10s-100s

100s

10,000,000s
(10s ms)

Ks-Ms

Ms

Gs

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

Tertiary
Storage
(Tape/
Cloud
Storage)

10,000,000,000s
(10s sec)
Ts
5

MulticoreArchitecture
Processing Node

Processing Node

Processing Node

CPU

CPU

CPU

Local memory hierarchy


(optimal fixed size)

Local memory hierarchy


(optimal fixed size)

Local memory hierarchy


(optimal fixed size)

Processing Node

Processing Node

Processing Node

CPU

CPU

CPU

Local memory hierarchy


(optimal fixed size)

Local memory hierarchy


(optimal fixed size)

Local memory hierarchy


(optimal fixed size)

Interconnection
network
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

ThePrincipleofLocality
ThePrincipleofLocality:
Programaccessarelativelysmallportionoftheaddressspaceat
anyinstantoftime.

TwoDifferentTypesofLocality:
TemporalLocality(LocalityinTime):Ifanitemisreferenced,it
willtendtobereferencedagainsoon(e.g.,loops,reuse)
SpatialLocality(LocalityinSpace):Ifanitemisreferenced,items
whoseaddressesareclosebytendtobereferencedsoon
(e.g.,straightline code,arrayaccess)

HWreliedonlocalityforspeed

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

Introduction

MemoryHierarchyBasics
Whenawordisnotfoundinthecache,amissoccurs:
Fetchwordfromlowerlevelinhierarchy,requiringa
higherlatencyreference
Lowerlevelmaybeanothercacheorthemainmemory
Alsofetchtheotherwordscontainedwithintheblock
Takesadvantageofspatiallocality

Placeblockintocacheinanylocationwithinitsset,
determinedbyaddress
blockaddressMODnumberofsets

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

HitandMiss
Hit:dataappearsinsomeblockintheupperlevel(e.g.:BlockX)
HitRate:thefractionofmemoryaccessfoundintheupperlevel
HitTime:Timetoaccesstheupperlevelwhichconsistsof
RAMaccesstime+Timetodeterminehit/miss

Miss:dataneedstoberetrievefromablockinthelowerlevel
(BlockY)
MissRate=1 (HitRate)
MissPenalty:Timetoreplaceablockintheupperlevel+
Timetodelivertheblocktheprocessor

HitTime<<MissPenalty(500instructionson21264!)

To Processor

Upper Level
Memory

Lower Level
Memory

Blk X

From Processor

Blk Y

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

CachePerformanceFormulas
(Average memory access time) =
(Hit time) + (Miss rate)(Miss penalty)

Tacc Thit f missT miss

The times Tacc, Thit, and T+miss can be all either:

Real time (e.g., nanoseconds)


Or, number of clock cycles
In contexts where cycle time is known to be a constant

Important:

T+miss means the extra (not total) time for a miss


in addition to Thit, which is incurred by all accesses
Hit time

CPU

Cache

Miss penalty

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

Lower levels
of hierarchy
10

FourQuestionsforMemoryHierarchy
Consideranylevelinamemoryhierarchy.
Rememberablockistheunitofdatatransfer.
Betweenthegivenlevel,andthelevelsbelowit

Theleveldesignisdescribedbyfourbehaviors:
BlockPlacement:
Wherecouldanewblockbeplacedinthelevel?

BlockIdentification:
Howisablockfoundifitisinthelevel?

BlockReplacement:
Whichexistingblockshouldbereplacedifnecessary?

WriteStrategy:
Howarewritestotheblockhandled?
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

11

Q1:Wherecanablockbeplaced
intheupperlevel?
Block12placedin8blockcache:
Fullyassociative,directmapped,2waysetassociative
S.A.Mapping=BlockNumberModuloNumberSets
FullMapped

DirectMapped
(12mod8)=4

2WayAssoc
(12mod4)=0

01234567

01234567

01234567

Cache
1111111111222222222233
01234567890123456789012345678901
Memory
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

12

Q2:Howisablockfoundifitisin
theupperlevel?
Block Address
Index

Tag

IndexUsedtoLookupCandidates
Indexidentifiesthesetincache

Set Select

Tagusedtoidentifyactualcopy
Ifnocandidatesmatch,thendeclarecachemiss

Block
offset

Data Select

Blockisminimumquantumofcaching
Dataselectfieldusedtoselectdatawithinblock
Manycachingapplicationsdonthavedataselectfield

Largerblocksizehasdistincthardwareadvantages:
lesstagoverhead
exploitfastbursttransfersfromDRAM/overwidebusses

Disadvantagesoflargerblocksize?
Fewerblocks moreconflicts.Canwastebandwidth

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

13

Review:DirectMappedCache

31
Cache Tag
Ex: 0x50
Valid Bit

Cache Tag

4
0
Byte Select
Ex: 0x00

9
Cache Index
Ex: 0x01

0x50

Cache Data
Byte 31
Byte 1 Byte 0 0
Byte 63
Byte 33 Byte 32 1
2
3

:
Byte 1023
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

DirectMapped2N bytecache:
Theuppermost(32 N)bitsarealwaystheCacheTag
ThelowestMbitsaretheByteSelect(BlockSize=2M)
Example:1KBDirectMappedCachewith32BBlocks
Indexchoosespotentialblock
Tagcheckedtoverifyblock
Byteselectchoosesbytewithinblock

: :

Byte 992 31

14

DirectMappedCacheArchitecture
Tags

Block frames

Address
Tag Frm# Off.

Decode & Row Select

Compare Tags

?
Hit

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

Mux
select
Data Word
15

Review:SetAssociativeCache

Nwaysetassociative:NentriesperCacheIndex
Ndirectmappedcachesoperatesinparallel
Example:Twowaysetassociativecache
CacheIndexselectsasetfromthecache
Twotagsinthesetarecomparedtoinputinparallel
Dataisselectedbasedonthetagresult
31
Cache Tag

Valid

Cache Tag

Compare

Cache Data
Cache Block 0

8
Cache Index

4
0
Byte Select

Cache Data
Cache Block 0

Cache Tag

Valid

Sel1 1

Mux

0 Sel0

Compare

OR
CA-Lec3
cwliu@twins.ee.nctu.edu.tw
Hit
Cache Block

16

Review:FullyAssociativeCache

FullyAssociative:Everyblockcanholdanyline
Addressdoesnotincludeacacheindex
CompareCacheTagsofallCacheEntriesinParallel
Example:BlockSize=32Bblocks
WeneedN27bitcomparators
Stillhavebyteselecttochoosefromwithinblock
4

31
Cache Tag (27 bits long)

Cache Tag

0
Byte Select
Ex: 0x01

Valid Bit Cache Data


Byte 31
Byte 1 Byte 0
Byte 63
Byte 33 Byte 32

: :

=
=
=
=
=

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

:
17

ConcludingRemarks
Directmappedcache=1waysetassociative
cache
Fullyassociativecache:thereisonly1set

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

18

CacheSizeEquation
Simpleequationforthesizeofacache:
(Cachesize)=(Blocksize) (Numberofsets)
(SetAssociativity)
Canrelatetothesizeofvariousaddressfields:
(Blocksize)=2(#ofoffsetbits)
(Numberofsets)=2(#ofindexbits)
(#oftagbits)=(#ofmemoryaddressbits)
(#ofindexbits) (#ofoffsetbits)

Memory address

CA-Lec3 cwliu@twins.ee.nctu.edu.tw
19

Q3:Whichblockshouldbereplaced
onamiss?
Easyfordirectmappedcache
Onlyonechoice

Setassociativeorfullyassociative
LRU(leastrecentlyused)
Appealing,buthardtoimplementforhighassociativity

Random
Easy,buthowwelldoesitwork?

Firstin,firstout(FIFO)
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

20

Q4:Whathappensonawrite?
Write-Through

Write-Back

Data written to cache


block
also written to lowerlevel memory

Write data only to


the cache
Update lower level
when a block falls
out of the cache

Debug

Easy

Hard

Do read misses
produce writes?

No

Yes

Do repeated
writes make it to
lower level?

Yes

No

Policy

Additional option -- let writes to an un-cached address


allocate a new cache line (write-allocate).
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

21

WriteBuffers
Cache

Processor

Lower
Level
Memory

Write Buffer

Holds data awaiting write-through to


lower level memory
Q. Why a write buffer ?

A. So CPU doesnt stall

Q. Why a buffer, why


not just one register ?

A. Bursts of writes are


common.

Q. Are Read After Write A. Yes! Drain buffer before


(RAW) hazards an issue next read, or check write
buffers for match on reads
for write buffer?CA-Lec3 cwliu@twins.ee.nctu.edu.tw
22

MoreonCachePerformanceMetrics
Cansplitaccesstimeintoinstructions&data:
Avg.mem.acc.time=
(%instructionaccesses) (inst.mem.accesstime)+
(%dataaccesses) (datamem.accesstime)

Anotherformulafromchapter1:
CPUtime=(CPUexecutionclockcycles+Memorystallclockcycles)
cycletime
UsefulforexploringISAchanges

Canbreakstallsintoreadsandwrites:
Memorystallcycles=
(Reads readmissrate readmisspenalty)+
(Writes writemissrate writemisspenalty)
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

23

SourcesofCacheMisses
Compulsory (coldstartorprocessmigration,firstreference):
firstaccesstoablock
Coldfactoflife:notawholelotyoucandoaboutit
Note:Ifyouaregoingtorunbillionsofinstruction,Compulsory
Missesareinsignificant

Capacity:
Cachecannotcontainallblocksaccessbytheprogram
Solution:increasecachesize

Conflict (collision):
Multiplememorylocationsmapped
tothesamecachelocation
Solution1:increasecachesize
Solution2:increaseassociativity

Coherence (Invalidation):otherprocess(e.g.,I/O)updates
memory
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

24

Introduction

MemoryHierarchyBasics
Sixbasiccacheoptimizations:
Largerblocksize
Reducescompulsorymisses
Increasescapacityandconflictmisses,increasesmisspenalty

Largertotalcachecapacitytoreducemissrate
Increaseshittime,increasespowerconsumption

Higherassociativity
Reducesconflictmisses
Increaseshittime,increasespowerconsumption

Highernumberofcachelevels
Reducesoverallmemoryaccesstime

Givingprioritytoreadmissesoverwrites
Reducesmisspenalty

Avoidingaddresstranslationincacheindexing
Reduceshittime

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

25

1.LargerBlockSizes
Largerblocksize no.ofblocks
Obviousadvantages:reducecompulsorymisses
Reasonisduetospatiallocality

Obviousdisadvantage
Highermisspenalty:largerblocktakeslongertomove
Mayincreaseconflictmissesandcapacitymissifcacheissmall

Dont let increase in miss penalty outweigh the


decrease in miss rate
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

26

2.LargeCaches
Cachesize missrate;hittime
Helpwithbothconflictandcapacitymisses
MayneedlongerhittimeAND/ORhigherHW
cost
Popularinoffchipcaches

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

27

3.HigherAssociativity
Reduceconflictmiss
2:1Cacheruleofthumbonmissrate
2waysetassociativeofsizeN/2isaboutthe
sameasadirectmappedcacheofsizeN(heldfor
cachesize<128KB)

Greaterassociativitycomesatthecostof
increasedhittime
Lengthentheclockcycle
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

28

4.MultiLevelCaches
2levelcachesexample
AMATL1 =HittimeL1 +MissrateL1 MisspenaltyL1
AMATL2 =HittimeL1 +MissrateL1 (HittimeL2 +Miss
rateL2 MisspenaltyL2)

Probablythebestmisspenaltyreductionmethod
Definitions:
Localmissrate:missesinthiscachedividedbythetotalnumberof
memoryaccessestothiscache(MissrateL2)
Globalmissrate:missesinthiscachedividedbythetotalnumberof
memoryaccessesgeneratedbyCPU(MissrateL1xMissrateL2)
GlobalMissRateiswhatmatters

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

29

MultiLevelCaches(Cont.)
Advantages:
CapacitymissesinL1endupwithasignificantpenaltyreduction
ConflictmissesinL1similarlygetsuppliedbyL2

Holdingsizeof1stlevelcacheconstant:
Decreasesmisspenaltyof1stlevelcache.
Or,increasesaverageglobalhittimeabit:
hittimeL1+missrateL1xhittimeL2

butdecreasesglobalmissrate

Holdingtotalcachesizeconstant:
Globalmissrate,misspenaltyaboutthesame
Decreasesaverageglobalhittimesignificantly!
NewL1muchsmallerthanoldL1
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

30

MissRateExample

Supposethatin1000memoryreferencesthereare40missesinthefirstlevel
cacheand20missesinthesecondlevelcache
Missrateforthefirstlevelcache=40/1000(4%)
Localmissrateforthesecondlevelcache=20/40(50%)
Globalmissrateforthesecondlevelcache=20/1000(2%)

AssumemisspenaltyL2is200CC,hittimeL2is10CC,hittimeL1is1CC,and1.5
memoryreferenceperinstruction.Whatisaveragememoryaccesstimeand
averagestallcyclesperinstructions?Ignorewritesimpact.
AMAT=HittimeL1+MissrateL1 (HittimeL2+MissrateL2 MisspenaltyL2)=1+
4% (10+50% 200)=5.4CC
Averagememorystallsperinstruction=MissesperinstructionL1 HittimeL2+
MissesperinstructionsL2MisspenaltyL2
=(401.5/1000) 10+(201.5/1000)200=6.6CC
Or(5.4 1.0) 1.5=6.6CC

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

31

5.GivingPrioritytoReadMissesOver
SW R3, 512(R0)
;cache index 0
R1, 1024(R0) ;cache index 0
Writes LW
LW R2, 512(R0)
;cache index 0

Inwritethrough,writebufferscomplicatememoryaccessinthatthey R2=R3 ?
mightholdtheupdatedvalueoflocationneededonareadmiss
RAW conflictswithmainmemoryreadsoncachemisses

Readmisswaitsuntilthewritebufferempty increasereadmisspenalty
Checkwritebuffercontentsbeforeread,andifnoconflicts,letthe
read priority over write
memoryaccesscontinue
WriteBack?
Readmissreplacingdirtyblock
Normal:Writedirtyblocktomemory,andthendotheread
Instead,copythedirtyblocktoawritebuffer,thendotheread,andthendo
thewrite
CPUstalllesssincerestartsassoonasdoread

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

32

6.AvoidingAddressTranslationduring
IndexingoftheCache
Virtual
Address

Physical
Address

Address
Translation

$ means cache

Virtuallyaddressedcaches

TLB
PA
$
PA
MEM
Conventional
Organization

CPU

CPU

CPU
VA

Cache
Indexing

VA
Tags

VA
$
VA

VA
VA
Tags

TLB

TLB

L2 $

PA
MEM

MEM

PA

Overlap $ access with VA


Virtually Addressed Cache
translation: requires $
Translate only on miss
index to remain invariant
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
Synonym (Alias) Problem across translation

33

WhynotVirtualCache?
Taskswitchcausesthesame VAtorefertodifferentPAs
Hence,cachemustbeflushed
Hughtaskswitchoverhead
Alsocreateshugecompulsorymissratesfornewprocess

SynonymsorAliasproblemcausesdifferentVAswhich
maptothesamePA
Twocopiesofthesamedatainavirtualcache
AntialiasingHWmechanismisrequired(complicated)
SWcanhelp

I/O(alwaysusesPA)
RequiremappingtoVAtointeractwithavirtualcache
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

34

AdvancedCacheOptimizations
Reducinghittime
1. Smallandsimplecaches
2. Wayprediction
Increasingcachebandwidth
3.Pipelinedcaches
4. Multibanked caches
5. Nonblocking caches

ReducingMissPenalty
6.Criticalwordfirst
7. Mergingwritebuffers
ReducingMissRate
8.Compileroptimizations
Reducingmisspenaltyor
missrate viaparallelism
9.Hardwareprefetching
10. Compilerprefetching

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

35

Advanced Optimizations

1.SmallandSimpleL1Cache
Criticaltimingpathincache:
addressingtagmemory,thencomparingtags,
thenselectingcorrectset
Indextagmemoryandthencomparetakestime

Directmappedcachescanoverlaptag
compareandtransmissionofdata
Sincethereisonlyonechoice

Lowerassociativityreducespowerbecause
fewercachelinesareaccessed
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

36

Advanced Optimizations

L1SizeandAssociativity

Access time vs. size and associativity


CA-Lec3 cwliu@twins.ee.nctu.edu.tw

37

Advanced Optimizations

L1SizeandAssociativity

Energy per read vs. size and associativity


CA-Lec3 cwliu@twins.ee.nctu.edu.tw

38

2.FastHittimesviaWayPrediction

HowtocombinefasthittimeofDirectMappedandhavethelowerconflictmisses
of2waySAcache?
Wayprediction:keepextrabitsincachetopredicttheway, orblockwithinthe
set,ofnextcacheaccess.
Multiplexorissetearlytoselectdesiredblock,only1tagcomparisonperformedthat
clockcycleinparallelwithreadingthecachedata
Miss 1st checkotherblocksformatchesinnextclockcycle

Hit Time
Miss Penalty

Way-Miss Hit Time

Accuracy 85%
Drawback:CPUpipelineishardifhittakes1or2cycles
Usedforinstructioncachesvs.datacaches
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

39

Advanced Optimizations

WayPrediction
Toimprovehittime,predictthewaytopresetmux
Mispredictiongiveslongerhittime
Predictionaccuracy
>90%fortwoway
>80%forfourway
IcachehasbetteraccuracythanDcache

FirstusedonMIPSR10000inmid90s
UsedonARMCortexA8

Extendtopredictblockaswell
Wayselection
Increasesmispredictionpenalty
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

40

Advanced Optimizations

3.IncreasingCacheBandwidthby
Pipelining
Pipelinecacheaccesstoimprovebandwidth
Examples:
Pentium:1cycle
PentiumPro PentiumIII:2cycles
Pentium4 Corei7:4cycles

Makesiteasiertoincreaseassociativity
But,pipelinecacheincreasestheaccesslatency
Moreclockcyclesbetweentheissueoftheloadandthe
useofthedata

AlsoIncreasesbranchmispredictionpenalty
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

41

4.IncreasingCacheBandwidth:
NonBlockingCaches

Nonblockingcache orlockupfreecache allowdatacachetocontinueto


supplycachehitsduringamiss
requiresF/Ebitsonregistersoroutoforderexecution
requiresmultibankmemories

hitundermiss reducestheeffectivemisspenaltybyworkingduringmiss
vs.ignoringCPUrequests
hitundermultiplemiss ormissundermiss mayfurtherlowerthe
effectivemisspenaltybyoverlappingmultiplemisses
Significantlyincreasesthecomplexityofthecachecontrollerasthere
canbemultipleoutstandingmemoryaccesses
Requiresmuliplememorybanks(otherwisecannotsupport)
PeniumProallows4outstandingmemorymisses

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

42

Advanced Optimizations

Nonblocking CachePerformances

L2mustsupportthis
Ingeneral,processorscanhideL1misspenaltybutnotL2miss
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
penalty

43

6:IncreasingCacheBandwidthviaMultiple
Banks
Ratherthantreatthecacheasasinglemonolithicblock,
divideintoindependentbanksthatcansupportsimultaneous
accesses
E.g.,T1(Niagara)L2has4banks

Bankingworksbestwhenaccessesnaturallyspread
themselvesacrossbanks mappingofaddressestobanks
affectsbehaviorofmemorysystem
Simplemappingthatworkswellissequentialinterleaving
Spreadblockaddressessequentiallyacrossbanks
E,g,ifthere4banks,Bank0hasallblockswhoseaddressmodulo4is0;
bank1hasallblockswhoseaddressmodulo4is1;

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

44

Advanced Optimizations

5.IncreasingCacheBandwidthvia
Multibanked Caches
Organizecacheasindependentbankstosupport
simultaneousaccess(ratherthanasinglemonolithic
block)
ARMCortexA8supports14banksforL2
Inteli7supports4banksforL1and8banksforL2

Bankingworksbestwhenaccessesnaturallyspread
themselvesacrossbanks
Interleavebanksaccordingtoblockaddress

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

45

6.ReduceMissPenalty:
CriticalWordFirstandEarlyRestart

Processorusuallyneedsonewordoftheblockatatime
Donotwaitforfullblocktobeloadedbeforerestartingprocessor
CriticalWordFirst requestthemissedwordfirstfrommemoryandsenditto
theprocessorassoonasitarrives;lettheprocessorcontinueexecutionwhile
fillingtherestofthewordsintheblock.Alsocalledwrappedfetch and
requestedwordfirst
Earlyrestart assoonastherequestedwordoftheblockarrives,senditto
theprocessorandlettheprocessorcontinueexecution

Benefitsofcriticalwordfirstandearlyrestartdependon
Blocksize:generallyusefulonlyinlargeblocks
Likelihood ofanotheraccesstotheportionoftheblockthathasnotyetbeen
fetched
Spatiallocalityproblem:tendtowantnextsequentialword,sonotclearifbenefit

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

46

7.MergingWriteBufferto
ReduceMissPenalty
Writebuffertoallowprocessortocontinuewhilewaitingto
writetomemory
Ifbuffercontainsmodifiedblocks,theaddressescanbe
checkedtoseeifaddressofnewdatamatchestheaddressof
avalidwritebufferentry.Ifso,newdataarecombinedwith
thatentry
Increasesblocksizeofwriteforwritethroughcacheofwrites
tosequentialwords,bytessincemultiwordwritesmore
efficienttomemory

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

47

Advanced Optimizations

MergingWriteBuffer
Whenstoringtoablockthatisalreadypendinginthewrite
buffer,updatewritebuffer
Reducesstallsduetofullwritebuffer

Nowrite
buffering

Writebuffering
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

48

8.ReducingMissesbyCompiler
Optimizations

McFarling [1989]reducedcachesmissesby75%on8KBdirectmapped
cache,4byteblocksinsoftware
Instructions
Reorderproceduresinmemorysoastoreduceconflictmisses
Profilingtolookatconflicts(usingtoolstheydeveloped)

Data
LoopInterchange:swapnestedloopstoaccessdatainorderstoredinmemory
(insequentialorder)
LoopFusion:Combine2independentloopsthathavesameloopingandsome
variablesoverlap
Blocking:Improvetemporallocalitybyaccessingblocks ofdatarepeatedlyvs.
goingdownwholecolumnsorrows
Insteadofaccessingentirerowsorcolumns,subdividematricesintoblocks
Requiresmorememoryaccessesbutimproveslocalityofaccesses

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

49

LoopInterchangeExample
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];

Sequentialaccessesinsteadofstridingthroughmemoryevery100words;
improvedspatiallocality

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

50

LoopFusionExample
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{
a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}

Perform different
computations on the
common data in two loops
fuse the two loops

2missesperaccesstoa &c vs.onemissperaccess;improvespatiallocality

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

51

BlockingExample
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};

TwoInnerLoops:
ReadallNxNelementsofz[]
ReadNelementsof1rowofy[]repeatedly
WriteNelementsof1rowofx[]
CapacityMissesafunctionofN&CacheSize:
2N3+N2 =>(assumingnoconflict;otherwise)
Idea:computeonBxBsubmatrixthatfits

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

52

Snapshotofx,y,zwhenN=6,i=1

White: not yet touched


Light: older access
Dark: newer access

Before.
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

53

BlockingExample
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1) {
r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;
};

BcalledBlockingFactor
CapacityMissesfrom2N3 +N2 to2N3/B+N2
ConflictMissesToo?
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

54

TheAgeofAccessestox,y,zwhenB=3

Note in contrast to previous Figure, the smaller number of elements accessed

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

55

9.ReducingMissPenaltyorMissRateby
Hardware PrefetchingofInstructions&Data

Prefetchingreliesonhavingextramemorybandwidth thatcanbeusedwithout
penalty
InstructionPrefetching
Typically,CPUfetches2blocksonamiss:therequestedblockandthenextconsecutive
block.
Requestedblockisplacedininstructioncachewhenitreturns,andprefetched blockis
placedintoinstructionstreambuffer

DataPrefetching
Pentium4canprefetch dataintoL2cachefromupto8streamsfrom8different4KB
pages
Prefetchinginvokedif2successiveL2cachemissestoapage,
ifdistancebetweenthosecacheblocksis<256bytes
1.97

gr
id

eq
ua
ke

SPECfp2000
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

1.49

1.40

1.32

ap
pl
u

1.21

1.29

sw
im

3d
wu
pw
is
e

fa
m

cf

SPECint2000

1.20

ga
lg
el
fa
ce
re
c

1.18

1.16

1.26

lu
ca
s

1.45

2.20
2.00
1.80
1.60
1.40
1.20
1.00

ga
p

Performance Improvement

Intel Pentium 4

56

10.ReducingMissPenaltyorMissRateby
CompilerControlled PrefetchingData

Prefetch instructionisinsertedbeforedataisneeded

DataPrefetch
Registerprefetch:loaddataintoregister(HPPARISCloads)
CachePrefetch:loadintocache(MIPSIV,PowerPC,SPARCv.9)
Specialprefetchinginstructionscannotcausefaults;
aformofspeculativeexecution

IssuingPrefetch Instructionstakestime
Iscostofprefetch issues<savingsinreducedmisses?
Highersuperscalarreducesdifficultyofissuebandwidth
Combinewithsoftwarepipeliningandloopunrolling

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

57

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

Advanced Optimizations

Summary

58

Memory Technology

MemoryTechnology
Performancemetrics
Latencyisconcernofcache
BandwidthisconcernofmultiprocessorsandI/O
Accesstime
Timebetweenreadrequestandwhendesiredwordarrives

Cycletime
Minimumtimebetweenunrelatedrequeststomemory

DRAMusedformainmemory,SRAMusedforcache

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

59

Memory Technology

MemoryTechnology
SRAM:staticrandomaccessmemory
Requireslowpowertoretainbit,sincenorefresh
But,requires6transistors/bit(vs.1transistor/bit)

DRAM
Onetransistor/bit
Mustberewrittenafterbeingread
Mustalsobeperiodicallyrefreshed
Every~8ms
Eachrowcanberefreshedsimultaneously

Addresslinesaremultiplexed:
Upperhalfofaddress:rowaccessstrobe(RAS)
Lowerhalfofaddress:columnaccessstrobe(CAS)

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

60

DRAMTechnology
Emphasizeoncostperbitandcapacity
Multiplexaddresslines cutting#ofaddresspinsinhalf
Rowaccessstrobe(RAS)first,thencolumnaccessstrobe(CAS)
Memoryasa2Dmatrix rowsgotoabuffer
SubsequentCASselectssubrow

Useonlyasingletransistortostoreabit
Readingthatbitcandestroytheinformation
Refresheachbitperiodically(ex.8milliseconds)bywritingback
Keeprefreshingtimelessthan5%ofthetotaltime

DRAMcapacityis4to8timesthatofSRAM

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

61

DRAMLogicalOrganization(4Mbit)
Column Decoder

11

A0A10

Sense Amps & I/O

Memory Array
(2,048 x 2,048)

Storage
Word Line Cell
SquarerootofbitsperRAS/CAS
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

62

DRAMTechnology(cont.)
DIMM:Dualinlinememorymodule
DRAMchipsarecommonlysoldonsmallboardscalledDIMMs
DIMMstypicallycontain4to16DRAMs

SlowingdowninDRAMcapacitygrowth
Fourtimesthecapacityeverythreeyears,formorethan20years
Newchipsonlydoublecapacityeverytwoyear,since1998

DRAMperformanceisgrowingataslowerrate
RAS(relatedtolatency):5%peryear
CAS(relatedtobandwidth):10%+peryear

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

63

Memory Technology

RASImprovement

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

64

QuestforDRAMPerformance
1.

FastPagemode
Addtimingsignalsthatallowrepeatedaccessestorowbufferwithout
anotherrowaccesstime
Suchabuffercomesnaturally,aseacharraywillbuffer1024to2048bitsfor
eachaccess

2.

SynchronousDRAM(SDRAM)
AddaclocksignaltoDRAMinterface,sothattherepeatedtransferswould
notbearoverheadtosynchronizewithDRAMcontroller

3.

DoubleDataRate(DDRSDRAM)
TransferdataonboththerisingedgeandfallingedgeoftheDRAMclock
signal doublingthepeakdatarate
DDR2lowerspowerbydroppingthevoltagefrom2.5to1.8volts+offers
higherclockrates:upto400MHz
DDR3dropsto1.5volts+higherclockrates:upto800MHz
DDR4dropsto1.2volts,clockrateupto1600MHz

ImprovedBandwidth,notLatency
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

65

DRAMnamebasedonPeakChipTransfers/Sec
DIMMnamebasedonPeakDIMMMBytes /Sec

Fastest for sale 4/06 ($125/GB)

Standard

Clock Rate
(MHz)

M transfers /
second

DRAM Name

Mbytes/s/
DIMM

DDR

133

266

DDR266

2128

PC2100

DDR

150

300

DDR300

2400

PC2400

DDR

200

400

DDR400

3200

PC3200

DDR2

266

533

DDR2-533

4264

PC4300

DDR2

333

667

DDR2-667

5336

PC5300

DDR2

400

800

DDR2-800

6400

PC6400

DDR3

533

1066

DDR3-1066

8528

PC8500

DDR3

666

1333

DDR3-1333

10664

PC10700

DDR3

800

1600

DDR3-1600

12800

PC12800

x2

DIMM
Name

x8
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

66

Memory Technology

DRAMPerformance

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

67

GraphicsMemory
GDDR5isgraphicsmemorybasedonDDR3
Graphicsmemory:
Achieve25XbandwidthperDRAMvs.DDR3
Widerinterfaces(32vs.16bit)
Higherclockrate
Possiblebecausetheyareattachedviasolderinginsteadof
socketedDIMMmodules

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

68

Memory Technology

MemoryPowerConsumption

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

69

SRAMTechnology
CacheusesSRAM:StaticRandomAccessMemory
SRAMusessixtransistorsperbittopreventtheinformation
frombeingdisturbedwhenread
noneedtorefresh
SRAMneedsonlyminimalpowertoretainthechargeinthestandby
mode goodforembeddedapplications
NodifferencebetweenaccesstimeandcycletimeforSRAM

Emphasizeonspeedandcapacity
SRAMaddresslinesarenotmultiplexed

SRAMspeedis8to16xthatofDRAM

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

70

ROMandFlash
Embeddedprocessormemory
Readonlymemory(ROM)

Programmedatthetimeofmanufacture
Onlyasingletransistorperbittorepresent1or0
Usedfortheembeddedprogramandforconstant
Nonvolatileandindestructible

Flashmemory:
Mustbeerased(inblocks)beforebeingoverwritten
Nonvolatilebutallowthememorytobemodified
ReadsatalmostDRAMspeeds,butwrites10to100timesslower
DRAMcapacityperchipandMBperdollarisabout4to8timesgreater
thanflash
CheaperthanSDRAM,moreexpensivethandisk
SlowerthanSRAM,fasterthandisk

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

71

Memory Technology

MemoryDependability
Memoryissusceptibletocosmicrays
Softerrors:dynamicerrors
Detectedandfixedbyerrorcorrectingcodes(ECC)

Harderrors:permanenterrors
Usesparserowstoreplacedefectiverows

Chipkill:aRAIDlikeerrorrecoverytechnique

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

72

VirtualMemory?
Thelimitsofphysicaladdressing
Allprogramsshareonephysicaladdressspace
Machinelanguageprogramsmustbeawareofthemachine
organization
Nowaytopreventaprogramfromaccessinganymachineresource

Recall:manyprocessesuseonlyasmallportionofaddressspace
Virtualmemorydividesphysicalmemoryintoblocks(calledpageor
segment)andallocatesthemtodifferentprocesses
Withvirtualmemory,theprocessorproducesvirtualaddressthat
aretranslatedbyacombinationofHWandSWtophysical
addresses(calledmemorymappingoraddresstranslation).

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

73

VirtualMemory:AddaLayerofIndirection
Physical Addresses

Virtual Addresses
A0-A31

Virtual

Physical

Address
Translation

CPU
D0-D31

A0-A31

Memory
D0-D31

Data

User programs run in an standardized


virtual address space
Address Translation hardware
managed by the operating system (OS)
maps virtual address to physical memory
Hardware supports modern OS features:
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
Protection,
Translation, Sharing

74

VirtualMemory

Mapping by a
page table

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

75

VirtualMemory(cont.)

Permitsapplicationstogrowbiggerthanmainmemorysize
Helpswithmultipleprocessmanagement
Eachprocessgetsitsownchunkofmemory
Permitsprotection of1processchunksfromanother
Mappingofmultiplechunksontosharedphysicalmemory
Mappingalsofacilitatesrelocation(aprogramcanruninanymemorylocation,
andcanbemovedduringexecution)
ApplicationandCPUruninvirtualspace(logicalmemory,0 max)
Mappingontophysicalspaceisinvisibletotheapplication

Cachevs.virtualmemory
Blockbecomesapage orsegment
Missbecomesapageoraddressfault

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

76

3AdvantagesofVM

Translation:
Programcanbegivenconsistentviewofmemory,eventhoughphysicalmemoryis
scrambled
Makesmultithreadingreasonable(nowusedalot!)
Onlythemostimportantpartofprogram(WorkingSet)mustbeinphysical
memory.
Contiguousstructures(likestacks)useonlyasmuchphysicalmemoryasnecessary
yetstillgrowlater.

Protection:
Differentthreads(orprocesses)protectedfromeachother.
Differentpagescanbegivenspecialbehavior
(ReadOnly,Invisibletouserprograms,etc).
KerneldataprotectedfromUserprograms
Veryimportantforprotectionfrommaliciousprograms

Sharing:
Canmapsamephysicalpagetomultipleusers
(Sharedmemory)

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

77

Protectionviavirtualmemory
Keepsprocessesintheirownmemoryspace

Roleofarchitecture:
Provideusermodeandsupervisormode
ProtectcertainaspectsofCPUstate
Providemechanismsforswitchingbetweenusermodeand
supervisormode
Providemechanismstolimitmemoryaccesses
ProvideTLBtotranslateaddresses

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

78

Virtual Memory and Virtual Machines

VirtualMemory

Page Tables Encode Virtual Address Spaces


Page Table

Physical
Memory Space
frame
frame

A virtual address space


is divided into blocks
of memory called pages

frame
frame

virtual
address

OS
manages
the page
table for
each ASIDA

A machine usually
supports
pages of a few
sizes
(MIPS R4000):

A page table is indexed by a


virtual address
valid page table entry codes physical memory
frame
address
for the page
CA-Lec3
cwliu@twins.ee.nctu.edu.tw

79

Details of Page Table


Page Table

Physical
Memory Space

Virtual Address
12
offset

frame
frame

V page no.

frame

Page Table

frame

virtual
address

Page Table
Base Reg
index
into
page
table

Access
Rights

PA

table located
in physical P page no.
memory

offset
12

Physical Address

Page table maps virtual page numbers to physical


frames (PTE = Page Table Entry)
Virtual memory => treat memory cache for disk
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

80

PageTableEntry(PTE)?

WhatisinaPageTableEntry(orPTE)?
Pointertonextlevelpagetableortoactualpage
Permissionbits:valid,readonly,readwrite,writeonly

Example:Intelx86architecturePTE:
Addresssameformatpreviousslide(10,10,12bitoffset)
IntermediatepagetablescalledDirectories

PWT

P:
W:
U:
PWT:
PCD:
A:
D:
L:

Free
0 L D A
UW P
(OS)
11-9 8 7 6 5 4 3 2 1 0
PCD

Page Frame Number


(Physical Page Number)
31-12

Present(sameasvalidbitinotherarchitectures)
Writeable
Useraccessible
Pagewritetransparent:externalcachewritethrough
Pagecachedisabled(pagecannotbecached)
Accessed:pagehasbeenaccessedrecently
Dirty(PTEonly):pagehasbeenmodifiedrecently
L=14MBpage(directoryonly).
Bottom22bitsofvirtualaddressserveasoffset
CA-Lec3 cwliu@twins.ee.nctu.edu.tw

81

Cachevs.VirtualMemory
Replacement
Cachemisshandledbyhardware
PagefaultusuallyhandledbyOS

Addresses
VirtualmemoryspaceisdeterminedbytheaddresssizeoftheCPU
CachespaceisindependentoftheCPUaddresssize

Lowerlevelmemory
Forcaches themainmemoryisnotsharedbysomethingelse
Forvirtualmemory mostofthediskcontainsthefilesystem
Filesystemaddresseddifferently usuallyinI/Ospace
VirtualmemorylowerlevelisusuallycalledSWAP space

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

82

Thesame4questionsforVirtual
Memory

BlockPlacement
Choice:lowermissratesandcomplexplacementorviceversa
Misspenaltyishuge,sochooselowmissrate placeanywhere
Similartofullyassociativecachemodel

BlockIdentification bothuseadditionaldatastructure
Fixedsizepages useapagetable
Variablesizedsegments segmenttable

BlockReplacement LRUisthebest
HowevertrueLRUisabitcomplex souseapproximation
Pagetablecontainsausetag,andonaccesstheusetagisset
OSchecksthemeverysooften recordswhatitseesinadatastructure thenclears
themall
OnamisstheOSdecideswhohasbeenusedtheleastandreplacethatone

WriteStrategy alwayswriteback
Duetotheaccesstimetothedisk,writethroughissilly
Useadirtybittoonlywritebackpagesthathavebeenmodified

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

83

TechniquesforFastAddress
Translation
Pagetableiskeptinmainmemory(kernelmemory)
Eachprocesshasapagetable

Everydata/instructionaccessrequirestwomemoryaccesses
Oneforthepagetableandoneforthedata/instruction
Canbesolvedbytheuseofaspecialfastlookuphardwarecache
calledassociativeregistersortranslationlookasidebuffers(TLBs)

Iflocality appliesthencachetherecenttranslation
TLB=translationlookasidebuffer
TLBentry:virtualpageno,physicalpageno,protectionbit,usebit,
dirtybit

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

84

TranslationLookAsideBuffers
TranslationLookAsideBuffers(TLB)
Cacheontranslations
FullyAssociative,SetAssociative,orDirectMapped
hit
PA

VA
CPU
Translation
withaTLB

TLB
miss

miss
Cache
hit

Translation

TLBsare:

data

Small typicallynotmorethan128 256entries


FullyAssociative

Main
Memory

The TLB Caches Page Table Entries


Physical and virtual
pages must be the
same size!

TLB caches
page table
entries.
virtual address
page

Physical
frame
address

for ASID

off
Page Table
2
0
1
3
physical address

TLB
frame page
2
2
0
5

page

off

V=0 pages either


reside on disk or
have not yet been
allocated.
OS handles V=0
Page fault

CachingAppliedtoAddressTranslation

CPU

Virtual
Address

TLB
Cached?
Yes
No

Translate
(MMU)
Data Read or Write
(untranslated)

Physical
Address

Physical
Memory

Virtual Memory and Virtual Machines

VirtualMachines
Supportsisolationandsecurity
Sharingacomputeramongmanyunrelatedusers
Enabledbyrawspeedofprocessors,makingtheoverhead
moreacceptable
AllowsdifferentISAsandoperatingsystemstobepresented
touserprograms
SystemVirtualMachines
SVMsoftwareiscalledvirtualmachinemonitororhypervisor
Individualvirtualmachinesrununderthemonitorarecalledguest
VMs

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

88

EachguestOSmaintainsitsownsetofpagetables
VMMaddsalevelofmemorybetweenphysicalandvirtual
memorycalledrealmemory
VMMmaintainsshadowpagetablethatmapsguestvirtual
addressestophysicaladdresses
RequiresVMMtodetectguestschangestoitsownpagetable
Occursnaturallyifaccessingthepagetablepointerisaprivileged
operation

CA-Lec3 cwliu@twins.ee.nctu.edu.tw

89

Virtual Memory and Virtual Machines

ImpactofVMsonVirtualMemory

You might also like