Capp 08

Memory Hierarchy Design
Topic 8
Introduction
The five classic components of a computer:
Processor Input Control Memory Datapath
Output
Where do we fetch instructions to execute?

Build a memory hierarchy which includes main memory & caches (internal memory) and hard disk (external memory) Instructions are first fetched from external storage such as hard disk and are kept in the main memory. Before they go to the CPU, they are probably extracted to stay in the caches Programmers would desire an Indefinitely large memory such that any particular word would be available in FAST memory This forces the possibility of constructing a hierarchy of memories as a solution
Memory Performance Index

Capacity 2x in 1.5 years 4x in 3 years 4x in 3 years Speed (latency) 2x in 1.5 years 2x in 10 years 2x in 10 years
CPU: DRAM: Disk:
Technology Trends
DRAM
Year 1980 1983 1986 1989 1992 1995 2000
Size 64 Kb 256 Kb 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb
Cycle Time 250 ns 220 ns 190 ns 165 ns 145 ns 120 ns 100 ns
4000:1!
2.5:1!
3
Performance Gap between CPUs and Memory

CPU 1.35X/yr 1.55X/yr
(improvement ratio)
Memory 7%/yr
The gap (latency) grows about 50% per year! From 2005 onwards, no change in processor perf (per core)
4
Memory Hierarchy
Levels of the Memory Hierarchy (Typical Server)
Capacity Access Time CPU Registers 1000 bytes (300ps) 0.30 ns Cache 64 KB/256KB/2-4MB 1ns/3-10ns/10-20ns
Upper Level Faster
Registers
Blocks (inclusion prop.)

Main Memory 4-16GB 50-100ns Disk 4-16TB 5-10ms
Memory Pages (inclusion prop.) Disk Storage Files ??? Larger Lower Level
5
Capacity
Speed
Cache
Memory Hierarchy
Levels of the Memory Hierarchy (Personal Mobile Device) Upper Level
Capacity Access Time CPU Registers 500 bytes (500ps) 0.50 ns Cache 64 KB/256KB 2ns/10-20ns
Faster
Registers
Blocks (inclusion prop.)

Main Memory 256-512MB 50-100ns Flash (EEPROM) 4-8GB 25-50us
Memory Pages (inclusion prop.) Storage Larger Lower Level

6
Capacity
Speed
Cache
ABCs of Caches
Cache: it mainly means the first level of the memory hierarchy encountered once the address leaves the CPU applied whenever buffering is employed to reuse commonly occurring items, i.e. file caches, name caches, and so on Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)
Guideline: For a given implementation technology and a power budget, Smaller hardware can be made Faster
7
Memory Hierarchy: Terminology

Traditionally designers of MH focused on Optimizing Avg Mem. Access Time , which is determined by Cache Access Time, Miss Rate and MP
Hit: data appears in some block in the cache (example: Block X) Hit Rate: the fraction of cache access found in the cache Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieved from a block in the main memory (Block Y) (3 Cs model of miss Compulsory, Capacity, Conflict ) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in cache+ Time to deliver the block
to the processor
Hit Time << Miss Penalty (e.g. 1 clock cycle .vs. 40 clock cycles)
To Processor cache
Blk X
main Memory
Blk Y
From Processor
Cache Measures
CPU execution time incorporated with cache performance: CPU execution time = (CPU clock cycles + Memory stall cycles) * Clock cycle time Memory stall cycles: number of cycles during which the CPU is stalled waiting for a memory access Memory stall clock cycles = Number of misses * miss penalty = IC*(Misses/Instruction)*Miss penalty = IC*(Memory accesses/Instruction)*Miss rate*Miss penalty = IC * Reads per instruction * Read miss rate * Read miss penalty +IC * Writes per instruction * Write miss rate * Write miss penalty Memory access consists of fetching instructions and reading/writing data
Example
Example Assume we have a computer where the CPI is 1.0 when all memory accesses
hit the cache. The only data access are loads and stores, and these total 50% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instructions are in the cache?
Answer:
(A) If instructions always hit in the cache, CPI=1.0, no memory stalls, then CPU(A) = (IC*CPI + 0)*clock cycle time = IC*clock cycle time (B) If there are 2% miss, CPI = 1.0, we need to calculate memory stalls. memory stall = IC*(Memory accesses/Instruction)*miss rate* miss penalty = IC*(1+50%)*2%*25 = IC*0.75 then CPU(B) = (IC + IC*0.75)* Clock cycle time = 1.75*IC*clock cycle time The performance ration is easy to get to be the inverse of the CPU execution time : CPU(B)/CPU(A) = 1.75 The computer with no cache miss is 1.75 times faster.
10
Four Memory Hierarchy Questions

Q1 (block placement): Where can a block/line be placed in the upper level (cache)? (most popular set-associative, direct-mapped, fullyassociative) Q2 (block identification): How is a block found if it is in the upper level (cache)? Q3 (block replacement): Which bock should be replaced on a miss? Q4 (write strategy): What happens on a write? (caching read only data is very simple) (Write-through or write-back? Write buffer?)
11
Q1(block placement): Where can a block be placed?

Direct mapped: (Block number) mod (Number of blocks in cache) Set associative: (Block number) mod (Number of sets in cache) # of set # of blocks n-way: n blocks in a set 1-way = direct mapped Fully associative: # of set = 1
Example: block 12 placed in a 8-block cache
12
Simplest Cache: Direct Mapped (1-way)

Block number
0 1 2
Memory 4 Block Direct Mapped Cache

Block Index in Cache 0 1 2 3
3
4 5 6 7
8
9 A B C D E F
The block have only one place it can appear in the cache. The mapping is usually (Block address) MOD ( Number of blocks in cache)
13
Example: 1 KB Direct Mapped Cache, 32B Blocks

For a 2N byte cache: The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2M)
31 Cache Tag Example: 0x50 9 Cache Index Ex: 0x01 4 0 Byte Select Ex: 0x00
Stored as part of the cache state

Valid Bit Cache Tag 0x50 Cache Data Byte 31 Byte 63 Byte 1 Byte 0 0 Byte 33 Byte 32 1 2 3
: :
:
Byte 1023
:
Byte 992 31
14
Q2 (block identification): How is a block found?

Three portions of an address in a set-associative or direct-mapped cache
Block Address Tag Cache/Set Index Block Offset (Block Size)
Block Offset selects the desired data from the block, the index filed selects the set, and the tag field compared against the CPU address for a hit Use the Cache Index to select the cache set Check the Tag on each block in that set No need to check index or block offset A valid bit is added to the Tag to indicate whether or not this entry contains a valid address Select the desired bytes using Block Offset Increasing associativity => shrinks index expands tag
15
Example: Two-way set associative cache

Cache Index selects a set from the cache The two tags in the set are compared in parallel Data is selected based on the tag result
31 Cache Tag Example: 0x50 9 Cache Index Ex: 0x01 4 0 Byte Select Ex: 0x00
Valid
Cache Tag
Cache Index Cache Data Cache Data Cache Block 0 Cache Block 0
Cache Tag
Valid
:
Adr Tag
:
0x50
Compare
Sel1 1
Mux
0 Sel0
Compare
OR Hit
Cache Block
16
Disadvantage of Set Associative Cache

N-way Set Associative Cache v.s. Direct Mapped Cache: N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: Possible to assume a hit and continue. Recover later if miss.
Cache Index Cache Data Cache Data Cache Block 0 Cache Block 0
Valid
Cache Tag
Cache Tag
Valid
:
Adr Tag
Compare
Sel1 1
Mux
0 Sel0
Compare
OR
Hit Cache Block
17
Q3 (block replacement): Which block should be replaced on a cache miss?

Easy for Direct Mapped hardware decisions are simplified
Only one block frame is checked and only that block can be replaced
Set Associative or Fully Associative

There are many blocks to choose from on a miss to replace
Three primary strategies for selecting a block to be replaced

Random: randomly selected (to spread allocation uniformly) LRU: Least Recently Used block is removed (based on LOR) FIFO(First in, First out)
Data cache misses per 1000 instructions for various replacement strategies Associativity: 2-way 4-way 8-way
Size 16 KB 64 KB 256 KB LRU Random FIFO LRU Random FIFO LRU Random FIFO 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5
There are little difference between LRU and random for the largest size cache, with LRU outperforming the others for smaller caches. FIFO generally outperforms random in the smaller cache sizes
18
Q4(write strategy): What happens on a write?

Reads dominate processor cache accesses. E.g. 7% of overall memory traffic are writes while 21% of data cache access are writes . Making the common case fast (Amdahls law) Two option we can adopt when writing to the cache: Write through The information is written to both the block in the cache and to the block in the lower-level memory. Write back The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. To reduce the frequency of writing back blocks on replacement, a dirty bit is used to indicate whether the block was modified in the cache (dirty) or not (clean). If clean, no write back since identical information to the cache is found Pros and Cons WT: simple to be implemented. The cache is always clean, so read misses cannot result in writes WB: writes occur at the speed of the cache. And multiple writes within a block require only one write to the lower-level memory
19
Write Stall and Write Buffer

When the CPU must wait for writes to complete during WT, the CPU is said to write stall A common optimization to reduce write stall is a write buffer, which allows the processor to continue as soon as the data are written to the buffer, thereby overlapping processor execution with memory updating
Processor
Cache
DRAM
Write Buffer
A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4
20
Write-Miss Policy: Write Allocate vs. Not Allocate

Two options on a write miss
Write allocate the block is allocated on a write miss, followed by the write hit actions
Write misses act like read misses
No-write allocate write misses do not affect the cache. The block is modified only in the lower-level memory
Block stay out of the cache in no-write allocate until the program tries to read the blocks, but with write allocate even blocks that are only written will still be in the cache
21
Write Through with Write Allocate:

on hits it writes to cache and main memory on misses it updates the block in main memory and brings the block to the cache Bringing the block to cache on a miss does not make a lot of sense in this combination because the next hit to this block will generate a write to main memory anyway (according to Write Through policy)
Write Through with No Write Allocate:

on hits it writes to cache and main memory; on misses it updates the block in main memory not bringing that block to the cache; Subsequent writes to the block will update main memory because Write Through policy is employed. So, some time is saved not bringing the block in the cache on a miss because it appears useless anyway. 22
Write Back with Write Allocate:
on hits it writes to cache setting dirty bit for the block, main memory is not updated; on misses it updates the block in main memory and brings the block to the cache; Subsequent writes to the same block, if the block originally caused a miss, will hit in the cache next time, setting dirty bit for the block. That will eliminate extra memory accesses and result in very efficient execution compared with Write Through with Write Allocate combination.
23
Write Back with No Write Allocate:
on hits it writes to cache setting dirty bit for the block, main memory is not updated;
on misses it updates the block in main memory not bringing that block to the cache;
Subsequent writes to the same block, if the block originally caused a miss, will generate misses all the way and result in very inefficient execution.
24
Write-Miss Policy Example

Example: Assume a fully associative write-back cache with many cache entries that starts empty. Below is sequence of five memory operations.
Write Mem[100]; Write Mem[100]; Read Mem[200]; Write Mem[200]; Write Mem[100]. What are the number of hits and misses (inclusive reads and writes) when using no-write allocate versus write allocate?
Answer: No-write Allocate:

Write Mem[100]; 1 write miss Write Mem[100]; 1 write miss Read Mem[200]; 1 read miss Write Mem[200]; 1 write hit Write Mem[100]. 1 write miss
Write allocate:
Write Mem[100]; 1 write miss Write Mem[100]; 1 write hit Read Mem[200]; 1 read miss Write Mem[200]; 1 write hit Write Mem[100]; 1 write hit
4 misses; 1 hit
2 misses; 3 hits
25

Example: Consider a computer with the following features: 90% of all memory accesses are found in the cache (hit ratio = 0.9); The block size is 2 words and the whole block is read on any miss; The CPU sends references to the cache at the rate of 107 words per second; 25% of the above references are writes (writes = 25%, reads = 75%); The bus can support 107 words per second, read or writes (total bus bandwidth = 107); The bus reads or writes a single word at a time; Assume at any one time, 30% of the block frames in the cache have been modified; The cache uses write allocate on a write miss.
Calculate the percentage of the bus bandwidth used on the average in the two cases - Write Back and Write Through.
26
Write-Miss Policy: Write Allocate

Write Through
(Total Bus B/W used = B/W for Read Hit + B/W for Read Miss + B/W for Write Hit + B/W for Write Miss) Read Hit : 0 Read Hit = 0
Write Back
Read Miss : 107 0.1 0.75 2 = 0.15107 Write Hit : 107 0.9 0.25 1 = 0.225107 Write Miss :107 0.1 0.25 (2 +1) = 0.075107
Read Miss : 1070.10.75(2+0.32) = 0.195107 Write Hit = 0
Write Miss :1070.10.25(2+0.32 ) = 0.075107
Total Bus B/W used = (0.15+0.225+0.075)107 = 0.45107
Total Bus B/W used = (0.195+0.075)107 = 0.27107

27
Cache Performance
Example: Split Cache vs. Unified Cache Which has the better avg. memory access time? A 16-KB instruction cache with a 16-KB data cache (split cache), or A 32-KB unified cache? Miss rates Size Instruction Cache Data Cache Unified Cache
16KB 0.4% 11.4% 32 KB 3.18% Assume A hit takes 1 clock cycle and the miss penalty is 100 cycles A load or store takes 1 extra clock cycle on a unified cache since there is only one cache port 36% of the instructions are data transfer instructions. About 74% of the memory accesses are instruction references
Answer: Average memory access time (split) = % instructions x (Hit time + Instruction miss rate x Miss penalty) + % data x (Hit time + Instruction miss rate x Miss penalty) = 74% X ( 1 + 0.4% X 100) + 26% X ( 1 + 11.4% X 100) = 4.24 Average memory access time(unified) = 74% X ( 1 + 3.18%x100) + 26% X ( 1 + 1 + 3.18% X 100) = 4.44
28
Impact of Memory Access on CPU Performance

Example: Suppose a processor: Ideal CPI = 1.0 (ignoring memory stalls) Avg. miss rate is 2% Avg. memory references per instruction is 1.5 Miss penalty is 100 cycles What are the impact on performance when behavior of the cache is included? Answer: CPI = CPU execution cycles per instr. + Memory stall cycles per instr. = CPI execution + Miss rate x Memory accesses per instr. x Miss penalty CPI with cache = 1.0 + 2% x 1.5 x 100 = 4 CPI without cache = 1.0 + 1.5 x 100 = 151 CPU time with cache = IC x CPI x Clock cycle time = IC x 4.0 x Clock cycle time CPU time without cache = IC x 151 x Clock cycle time Without cache, the CPI of the processor increases from 1 to 151! 75 % of the time the processor is stalled waiting for memory! (CPI: 14)
29
Impact of Cache Organizations on CPU Performance

Example: What is the impact of two different cache organizations (direct mapped vs. 2-way set associative) on the performance of a CPU? Ideal CPI = 2.0 (ignoring memory stalls) Clock cycle time is 1.0 ns Avg. memory references per instruction is 1.5 Cache size: 64 KB, block size: 64 bytes For set-associative, assume the clock cycle time is stretched 1.25 times to accommodate the selection multiplexer Cache miss penalty is 75 ns Hit time is 1 clock cycle Miss rate: direct mapped 1.4%; 2-way set-associative 1.0%. Answer: Avg. memory access time1-way= 1.0+(0.014 x 75) = 2.05 ns Avg. memory access time2-way= 1.0 x 1.25 + (0.01 x 75) = 2.00 ns CPU time1-way = IC x (CPIexecution + Miss rate x Memory accesses per instruction x Miss penalty) x Clock cycle time = IC x (2.0 x 1.0 + (1.5 x 0.014 x 75)) = 3.58 IC CPU time2-way = IC x (2.0 x 1.0 x 1.25 + (1.5 x 0.01 x 75)) = 3.63 IC
30
Summary of Performance Equations
31
Improving Cache Performance

Next we look at ways to improve cache and memory access times.
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty
CPU Time IC * (CPIExecution
Memory Accesses Miss Rate Miss Penalty) Clock Cycle Time Instruction
32
Reducing Cache Miss Penalty

Time to handle a miss is becoming more and more the controlling factor. This is because of the great improvement in speed of processors as compared to the speed of memory. Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty Five optimizations 1. Multilevel caches 2. Critical word first and early restart 3. Giving priority to read misses over writes 4. Merging write buffer 5. Victim caches
33
O1: Multilevel Caches

Approaches Make the cache faster to keep pace with the speed of CPUs Make the cache larger to overcome the widening gap L1: fast hits, L2: fewer misses L2 Equations Average Memory Access Time = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 Average Memory Access Time = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2) Hit TimeL1 << Hit TimeL2 << << Hit TimeMem Miss RateL1 < Miss RateL2 < Definitions: Local miss rate misses in this cache divided by the total number of memory accesses to this cache (Miss rateL1 , Miss rateL2) L1 cache skims the cream of the memory accesses Global miss ratemisses in this cache divided by the total number of memory accesses generated by the CPU (Miss rateL1, Miss RateL1 x Miss RateL2) Indicate what fraction of the memory accesses that leave the CPU go all the way to memory 34
Design of L2 Cache
Size
Since everything in L1 cache is likely to be in L2 cache, L2 cache should be much bigger than L1
Whether data in L1 is in L2
novice approach: design L1 and L2 independently multilevel inclusion: L1 data are always present in L2
Advantage: easy for consistency between I/O and cache (checking L2 only) Drawback: L2 must invalidate all L1 blocks that map onto the 2nd-level block to be replaced => slightly higher 1st-level miss rate i.e. Intel Pentium 4: 64-byte block in L1 and 128-byte in L2
multilevel exclusion: L1 data is never found in L2

A cache miss in L1 results in a swap of blocks between L1 and L2 Advantage: prevent wasting space in L2 i.e. AMD Athlon: 64 KB L1 and 256 KB L2 35
O2: Critical Word First and Early Restart

Dont wait for full block to be loaded before restarting CPU Critical Word FirstRequest missed word first from memory and send it to CPU as soon as it arrives; let CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first Early restartAs soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution
Given spatial locality, CPU tends to want next sequential word, so its not clear if benefit by early restart
Generally useful only in large blocks,

block
36
O3: Giving Priority to Read Misses over Writes

Serve reads before writes have been completed Write through with write buffers
SW LW LW R3, 512(R0) ; M[512] <- R3 R1, 1024(R0) ; R1 <- M[1024] R2, 512(R0) ; R2 <- M[512] (cache index 0) (cache index 0) (cache index 0)
Problem: write through with write buffers offer RAW conflicts with main memory reads on cache misses If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% ) Check write buffer contents before read; if no conflicts, let the memory access continue Write Back Suppose a read miss will replace a dirty block Normal: Write dirty block to memory, and then do the read Instead: Copy the dirty block to a write buffer, do the read, and then do the write CPU stall less since restarts as soon as do read
37
O4: Merging Write Buffer

If a write buffer is empty, the data and the full address are written in the buffer, and the write is finished from the CPUs perspective
Usually a write buffer supports multi-words
Write merging: addresses of write buffers are checked to see if the address of the new data matches the address of a valid write buffer entry. If so, the new data are combined
Write buffer with 4 entries, each can hold four 64-bit words (left) without merging (right) Four writes are merged into a single entry writing multiple words at the same time is faster than writing multiple times
38
O5: Victim Caches
Idea of recycling: remember what was discarded latest due to cache miss in case it is needed again
rather simply discarded or swapped into L2
victim cache: a small, fully associative cache between a cache and its refill path
contain only blocks that are discarded from a cache because of a miss, victims checked on a miss before going to the next lower-level memory Victim caches of 1 to 5 entries are effective at reducing misses, especially for small, direct-mapped data caches AMD Athlon: 8 entries 39
Reducing Miss Rate

3 Cs of Cache Miss
CompulsoryThe first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. (Misses in even an Infinite Cache) CapacityIf the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache) ConflictIf block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. (Misses in N-way Associative but hits in Fully Associative Size X Cache)
40
3 Cs of Cache Miss
3Cs Absolute Miss Rate (SPEC92) 2:1 Cache Rule
miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2
0.14 0.12 2-way Miss Rate per Type 0.1 4-way 0.08 8-way 0.06 Capacity
Conflict
1-way
0.04 0.02
0 4 1 2 8 16 32 64
Compulsory vanishingly small
Cache Size (KB)
Compulsory
128
41
3Cs Relative Miss Rate
100% 1-way 80% Miss Rate per Type 60% 40% 20% 2-way 4-way 8-way
Conflict
Capacity
0%
4 1 2 8 16 32 64 128
Flaws: for fixed block size Good: insight => invention
Cache Size (KB)
Compulsory
42
Five Techniques to Reduce Miss Rate
1. 2. 3. 4. 5.
Larger block size Larger caches Higher associativity Way prediction and pseudoassociative caches Compiler optimizations
43
O1: Larger Block Size

25% 20% Miss Rate 15% 10% 64K 5% 0% 256K 1K 4K 16K
Using the principle of locality: The larger the block, the greater the chance parts of it will be used again.
Size of Cache
16 32 64 128
Block Size (bytes)
Take advantage of spatial locality -The larger the block, the greater the chance parts of it is used again # of blocks is reduced for the cache of same size => Increase miss penalty It may increase conflict misses and even capacity misses if the cache is small Usually high latency and high bandwidth encourage large block size
44
256
O2: Larger Caches

0.14 0.12 2-way Miss Rate per Type 1-way
0.1
4-way
0.08 8-way
0.06 0.04 0.02 0 4 1 2 8 16 32 64 128
Capacity
Cache Size (KB)
Compulsory
Increasing capacity of cache reduces capacity misses (Figure 5.14 and 5.15) May be longer hit time and higher cost Trends: Larger L2 or L3 off-chip caches
45
O3: Higher Associativity

Figure 5-14 and 5-15 show how improve miss rates improve with higher associativity
8-way set asociative is as effective as fully associative for practical purposes 2:1 Cache Rule: Miss Rate Direct Mapped cache size N = Miss Rate 2-way cache size N/2
Tradeoff: higher associative cache complicates the circuit

May have longer clock cycle
Beware: Execution time is the only final measure! Will Clock Cycle time increase as a result of having a more complicated cache? Hill [1988] suggested hit time for 2-way vs. 1-way is: external cache +10%, internal + 2%
46
O4: Way Prediction & Pseudoassociative Caches

way prediction: extra bits are kept in cache to predict the way, or block within the set of the next cache access
Example: 2-way I-cache of Alpha 21264 If the predictor is correct, I-cache latency is 1 clock cycle If incorrect, tries the other block, changes the way predictor, and has a latency of 3 clock cycles excess of 85% accuracy reduce conflict miss and maintain the hit speed of direct-mapped cache
pseudoassociative or column associative

one fast hit and one slow hit
On a miss, a 2nd cache entry is checked before going to the next lower level Invert the most significant bit to the find other block in the pseudoset Miss penalty may become slightly longer
47
O5: Compiler Optimizations

Improve hit rate by compile-time optimization Reordering instructions with profiling information (McFarling[1989])
Reduce misses by 50% for a 2KB direct-mapped 4-byte-block I-cache, and 75% in an 8KB cache Get best performance when it was possible to prevent some instruction from entering the cache
Aligning basic block: the entry point is at the beginning of a cache block
Decrease the chance of a cache miss for sequential code
Loop Interchange: exchanging the nesting of loops

Improve spatial locality => reduce misses Make data be accessed in order => maximize use of data in a cache block before discarded
/* Before: row first */ for(j=0;j<100;j=j+1) for(i=0;i<5000;i=i+1) x[i][j]=2*x[i][j]; /* Before: row first */ for(i=0;i<5000;i=i+1) for(j=0;j<100;j=j+1) x[i][j]=2*x[i][j];
skip through memory in strides of 100 words
access all words in a cache block
48
Blocking: operating on submatrices or blocks
Maximize accesses to the data loaded into the cache before replaced Improve temporal locality /* After: B=blocking factor */ X=Y*Z for(jj=0;jj<N;jj=jj+B)
/* Before */ for(i=0;i<N;i=i+1) for(j=0;j<N;j=j+1){ r=0; for(k=0;k<N;k=k+1) r=r+y[i][k]*z[k][j]; x[i][j]=r; }
for(kk=0;kk<N;kk=kk+B) for(i=0;i<N;i=i+1) for(j=jj;j<min(jj+B,N;j=j+1){ r=0; for(k=kk;k<min(kk+B,N);k=k+1) r=r+y[i][k]*z[k][j]; x[i][j]=x[i][j]+r; }
# of capacity misses depends on N and cache size
total # of memory words accessed = 2N3/B+N2 y benefits from spatial locality z benefits from temporal locality 49
5.6 Reducing Cache Penalty or Miss Rate via Parallelism

Three techniques that overlap the execution of instructions
1.Nonblocking caches to reduce stalls on cache misses
to match the out-of-order processors
2.Hardware prefetching of insructions and data 3.Compiler-controlled prefetching
50
O1: Nonblocking cache to reduce stalls on cache miss

For pipelined computers that allow out-of-order completion, the CPU need not stall on a cache miss separate I-cache and D-cache
Nonblocking cache (lookup-free cache)
Continue fetching instructions from I-cache while waiting for D-cache to return missing data hit under miss: D-cache continues to supply cache hits during a miss hit under multiple miss or miss under miss: overlap multiple misses
Ratio of average memory stall time for a blocking cache to hit-under-miss schemes first 14 are FP programs average: 76% for 1-miss, 51% for 2-miss, 39% for 64miss final 4 are INT programs average: 81%, 78% and 78% 51
O2: Hardware Prefetching of Instructions and Data

Prefetch instructions or data before requested by the CPU
either directly into the caches or into an external buffer (faster than accessing main memory)
Instruction prefetch: frequently done in hardware outside cache

Fetch two blocks on a miss
the requested block is placed in I-cache when it returns the prefetched block is placed in instruction stream buffer (ISB) 1 single ISB would catch 15% to 25% of misses from a 4KB 16-byte-block direct-mapped I-cache. 4 ISBs increased the data hit rate to 43% (Jouppi1990)
UltraSPARC III: data prefetch

If a load hits in the prefetch cache
the block is read from the prefetch cache the next prefetch request is issued: calculating the stride of the next prefetched block using the difference between the current address and the previous address
Up to 8 simultaneous prefetches It may interfere with demand misses resulting in lowering performance
52
O3: Compiler-Controlled Prefetching

Compiler-controlled prefetching
Register prefetch: load the value into a register Cache prefetch: load data only into the cache (not register)
Faulting vs. nonfaulting: the address does or does not cause an exception for virtual address faults and protection violations
normal load instruction = faulting register prefetch instruction
Most effective prefetch: semantically invisible to a program

doesnt change the contents of registers and memory, and cannot cause virtual memory faults
nonbinding prefetch: nonfaulting cache prefetch

Overlapping execution: CPU proceeds while the prefetched data are being fetched Advantage: The compiler may avoid unnecessary prefetches in hardware Drawback: Prefetch instructions incurs instruction overhead
53
5.7 Reducing Hit Time

Importance of cache hit time
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty More importantly, cache access time limits the clock cycle rate in many processors today!
Fast hit time:

Quickly and efficiently find out if data is in the cache, and if it is, get that data out of the cache
Four techniques:
1.Small and simple caches 2.Avoiding address translation during indexing of the cache 3.Pipelined cache access 4.Trace caches
54
O1: Small and Simple Caches

A time-consuming portion of a cache hit is using the index portion of the address to read the tag memory and then compare it to the address
Guideline: smaller hardware is faster

Why Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB second level cache? Small data cache and thus fast clock rate
Guideline: simpler hardware is faster

Direct Mapped, on chip
General design:
small and simple cache for 1st-level cache Keeping the tags on chip and the data off chip for 2nd-level caches The emphasis recently is on fast clock time while hiding L1 misses with dynamic execution and using L2 caches to avoid going to memory
55
O2: Avoiding address translation during cache indexing

Two tasks: indexing the cache and comparing addresses virtually vs. physically addressed cache
virtual cache: use virtual address (VA) for the cache physical cache: use physical address (PA) after translating virtual address
Challenges to virtual cache

1.Protection: page-level protection (RW/RO/Invalid) must be checked
Its checked as part of the virtual to physical address translation solution: an addition field to copy the protection information from TLB and check it on every access to the cache
2.context switching: same VA of different processes refer to different PA, requiring the cache to be flushed
solution: increase width of cache address tag with process-identifier tag (PID)
3.Synonyms or aliases: two different VA for the same PA

inconsistency problem: two copies of the same data in a virtual cache hardware antialiasing solution: guarantee every cache block a unique PA Alpha 21264: check all possible locations. If one is found, it is invalidated software page-coloring solution: forcing aliases to share some address bits Suns Solaris: all aliases must be identical in last 18 bits => no duplicate PA
4.I/O: typically use PA, so need to interact with cache (see Section 5.12)
56
Virtually indexed, physically tagged cache

CPU VA TB PA $ PA TB PA MEM Virtually Addressed Cache Translate only on miss Synonym Problem VA Tags $ VA CPU VA PA Tags CPU VA $ L2 $ MEM TB PA
MEM
Conventional Organization
Overlap cache access with VA translation: requires $ index to remain invariant across translation
57
O3: Pipelined Cache Access

Simply to pipeline cache access
Multiple clock cycle for 1st-level cache hit
Advantage: fast cycle time and slow hit

Example: accessing instructions from I-cache Pentium: 1 clock cycle Pentium Pro ~ Pentium III: 2 clocks Pentium 4: 4 clocks
Drawback: Increasing the number of pipeline stages leads to

greater penalty on mispredicted branches and more clock cycles between the issue of the load and the use of the data
Note that it increases the bandwidth of instructions rather than decreasing the actual latency of a cache hit
58
O4: Trace Caches

Trace cache for instructions: find a dynamic sequence of instructions including taken branches to load into a cache block
The cache blocks contain dynamic traces of executed instructions determined by CPU rather than static sequences of instructions determined by memory branch prediction is folded into the cache: validated along with the addresses to have a valid fetch i.e. Intel NetBurst microarchitecture
advantage: better utilization

Trace caches store instructions only from the branch entry point to the exit of the trace Unused part of a long block entered or exited from a taken branch in conventional I-cache may not be fetched
Downside: store the same instructions multiple times

59
Cache Optimization Summary

5.4 miss penalty
5.5 miss rate
5.6 parallelism
5.7 hit time
60

Capp 08

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Capp 08

Uploaded by

Copyright:

Available Formats

Memory Hierarchy Design

Where do we fetch instructions to execute?

Memory Performance Index

CPU: DRAM: Disk:

Year 1980 1983 1986 1989 1992 1995 2000

Size 64 Kb 256 Kb 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb

Cycle Time 250 ns 220 ns 190 ns 165 ns 145 ns 120 ns 100 ns

Performance Gap between CPUs and Memory

Upper Level Faster

Blocks (inclusion prop.)

Blocks (inclusion prop.)

Memory Pages (inclusion prop.) Storage Larger Lower Level

Memory Hierarchy: Terminology

Four Memory Hierarchy Questions

Q1(block placement): Where can a block be placed?

Example: block 12 placed in a 8-block cache

Simplest Cache: Direct Mapped (1-way)

Memory 4 Block Direct Mapped Cache

Example: 1 KB Direct Mapped Cache, 32B Blocks

Stored as part of the cache state

Q2 (block identification): How is a block found?

Example: Two-way set associative cache

Disadvantage of Set Associative Cache

Q3 (block replacement): Which block should be replaced on a cache miss?

Set Associative or Fully Associative

Three primary strategies for selecting a block to be replaced

Q4(write strategy): What happens on a write?

Write Stall and Write Buffer

Write-Miss Policy: Write Allocate vs. Not Allocate

Write misses act like read misses

Write-Miss Policy: Write Allocate vs. Not Allocate

Write Through with Write Allocate:

Write Through with No Write Allocate:

Write-Miss Policy: Write Allocate vs. Not Allocate

Write Back with Write Allocate:

Write-Miss Policy: Write Allocate vs. Not Allocate

Write Back with No Write Allocate:

Write-Miss Policy Example

Answer: No-write Allocate:

Write-Miss Policy: Write Allocate vs. Not Allocate

Write-Miss Policy: Write Allocate

Read Miss : 1070.10.75(2+0.32) = 0.195107 Write Hit = 0

Write Miss :1070.10.25(2+0.32 ) = 0.075107

Total Bus B/W used = (0.15+0.225+0.075)107 = 0.45107

Total Bus B/W used = (0.195+0.075)107 = 0.27107

Impact of Memory Access on CPU Performance

Impact of Cache Organizations on CPU Performance

Summary of Performance Equations

Improving Cache Performance

CPU Time IC * (CPIExecution

Reducing Cache Miss Penalty

O1: Multilevel Caches

multilevel exclusion: L1 data is never found in L2

O2: Critical Word First and Early Restart

Generally useful only in large blocks,

O3: Giving Priority to Read Misses over Writes

O4: Merging Write Buffer

O5: Victim Caches

Reducing Miss Rate

Compulsory vanishingly small

Cache Size (KB)

3Cs Relative Miss Rate

Flaws: for fixed block size Good: insight => invention

Cache Size (KB)