Professional Documents
Culture Documents
Topic 8
Introduction
The five classic components of a computer:
Processor Input Control Memory Datapath
Output
Technology Trends
DRAM
4000:1!
2.5:1!
3
(improvement ratio)
Memory 7%/yr
The gap (latency) grows about 50% per year! From 2005 onwards, no change in processor perf (per core)
4
Memory Hierarchy
Levels of the Memory Hierarchy (Typical Server)
Capacity Access Time CPU Registers 1000 bytes (300ps) 0.30 ns Cache 64 KB/256KB/2-4MB 1ns/3-10ns/10-20ns
Registers
Memory Pages (inclusion prop.) Disk Storage Files ??? Larger Lower Level
5
Capacity
Speed
Cache
Memory Hierarchy
Levels of the Memory Hierarchy (Personal Mobile Device) Upper Level
Capacity Access Time CPU Registers 500 bytes (500ps) 0.50 ns Cache 64 KB/256KB 2ns/10-20ns
Faster
Registers
Capacity
Speed
Cache
ABCs of Caches
Cache: it mainly means the first level of the memory hierarchy encountered once the address leaves the CPU applied whenever buffering is employed to reuse commonly occurring items, i.e. file caches, name caches, and so on Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)
Guideline: For a given implementation technology and a power budget, Smaller hardware can be made Faster
7
Hit Time << Miss Penalty (e.g. 1 clock cycle .vs. 40 clock cycles)
To Processor cache
Blk X
main Memory
Blk Y
From Processor
Cache Measures
CPU execution time incorporated with cache performance: CPU execution time = (CPU clock cycles + Memory stall cycles) * Clock cycle time Memory stall cycles: number of cycles during which the CPU is stalled waiting for a memory access Memory stall clock cycles = Number of misses * miss penalty = IC*(Misses/Instruction)*Miss penalty = IC*(Memory accesses/Instruction)*Miss rate*Miss penalty = IC * Reads per instruction * Read miss rate * Read miss penalty +IC * Writes per instruction * Write miss rate * Write miss penalty Memory access consists of fetching instructions and reading/writing data
Example
Example Assume we have a computer where the CPI is 1.0 when all memory accesses
hit the cache. The only data access are loads and stores, and these total 50% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instructions are in the cache?
Answer:
(A) If instructions always hit in the cache, CPI=1.0, no memory stalls, then CPU(A) = (IC*CPI + 0)*clock cycle time = IC*clock cycle time (B) If there are 2% miss, CPI = 1.0, we need to calculate memory stalls. memory stall = IC*(Memory accesses/Instruction)*miss rate* miss penalty = IC*(1+50%)*2%*25 = IC*0.75 then CPU(B) = (IC + IC*0.75)* Clock cycle time = 1.75*IC*clock cycle time The performance ration is easy to get to be the inverse of the CPU execution time : CPU(B)/CPU(A) = 1.75 The computer with no cache miss is 1.75 times faster.
10
12
3
4 5 6 7
8
9 A B C D E F
The block have only one place it can appear in the cache. The mapping is usually (Block address) MOD ( Number of blocks in cache)
13
: :
:
Byte 1023
:
Byte 992 31
14
Block Offset selects the desired data from the block, the index filed selects the set, and the tag field compared against the CPU address for a hit Use the Cache Index to select the cache set Check the Tag on each block in that set No need to check index or block offset A valid bit is added to the Tag to indicate whether or not this entry contains a valid address Select the desired bytes using Block Offset Increasing associativity => shrinks index expands tag
15
Valid
Cache Tag
Cache Index Cache Data Cache Data Cache Block 0 Cache Block 0
Cache Tag
Valid
:
Adr Tag
:
0x50
Compare
Sel1 1
Mux
0 Sel0
Compare
OR Hit
Cache Block
16
Valid
Cache Tag
Cache Tag
Valid
:
Adr Tag
Compare
Sel1 1
Mux
0 Sel0
Compare
OR
Hit Cache Block
17
Random: randomly selected (to spread allocation uniformly) LRU: Least Recently Used block is removed (based on LOR) FIFO(First in, First out)
Data cache misses per 1000 instructions for various replacement strategies Associativity: 2-way 4-way 8-way
Size 16 KB 64 KB 256 KB LRU Random FIFO LRU Random FIFO LRU Random FIFO 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5
There are little difference between LRU and random for the largest size cache, with LRU outperforming the others for smaller caches. FIFO generally outperforms random in the smaller cache sizes
18
19
Processor
Cache
DRAM
Write Buffer
A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4
20
Write allocate the block is allocated on a write miss, followed by the write hit actions
No-write allocate write misses do not affect the cache. The block is modified only in the lower-level memory
Block stay out of the cache in no-write allocate until the program tries to read the blocks, but with write allocate even blocks that are only written will still be in the cache
21
on hits it writes to cache and main memory on misses it updates the block in main memory and brings the block to the cache Bringing the block to cache on a miss does not make a lot of sense in this combination because the next hit to this block will generate a write to main memory anyway (according to Write Through policy)
on hits it writes to cache and main memory; on misses it updates the block in main memory not bringing that block to the cache; Subsequent writes to the block will update main memory because Write Through policy is employed. So, some time is saved not bringing the block in the cache on a miss because it appears useless anyway. 22
on hits it writes to cache setting dirty bit for the block, main memory is not updated; on misses it updates the block in main memory and brings the block to the cache; Subsequent writes to the same block, if the block originally caused a miss, will hit in the cache next time, setting dirty bit for the block. That will eliminate extra memory accesses and result in very efficient execution compared with Write Through with Write Allocate combination.
23
on hits it writes to cache setting dirty bit for the block, main memory is not updated;
on misses it updates the block in main memory not bringing that block to the cache;
Subsequent writes to the same block, if the block originally caused a miss, will generate misses all the way and result in very inefficient execution.
24
Write allocate:
Write Mem[100]; 1 write miss Write Mem[100]; 1 write hit Read Mem[200]; 1 read miss Write Mem[200]; 1 write hit Write Mem[100]; 1 write hit
4 misses; 1 hit
2 misses; 3 hits
25
Calculate the percentage of the bus bandwidth used on the average in the two cases - Write Back and Write Through.
26
Write Back
Read Miss : 107 0.1 0.75 2 = 0.15107 Write Hit : 107 0.9 0.25 1 = 0.225107 Write Miss :107 0.1 0.25 (2 +1) = 0.075107
Cache Performance
Example: Split Cache vs. Unified Cache Which has the better avg. memory access time? A 16-KB instruction cache with a 16-KB data cache (split cache), or A 32-KB unified cache? Miss rates Size Instruction Cache Data Cache Unified Cache
16KB 0.4% 11.4% 32 KB 3.18% Assume A hit takes 1 clock cycle and the miss penalty is 100 cycles A load or store takes 1 extra clock cycle on a unified cache since there is only one cache port 36% of the instructions are data transfer instructions. About 74% of the memory accesses are instruction references
Answer: Average memory access time (split) = % instructions x (Hit time + Instruction miss rate x Miss penalty) + % data x (Hit time + Instruction miss rate x Miss penalty) = 74% X ( 1 + 0.4% X 100) + 26% X ( 1 + 11.4% X 100) = 4.24 Average memory access time(unified) = 74% X ( 1 + 3.18%x100) + 26% X ( 1 + 1 + 3.18% X 100) = 4.44
28
31
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty
Memory Accesses Miss Rate Miss Penalty) Clock Cycle Time Instruction
32
33
Design of L2 Cache
Size
Since everything in L1 cache is likely to be in L2 cache, L2 cache should be much bigger than L1
Whether data in L1 is in L2
novice approach: design L1 and L2 independently multilevel inclusion: L1 data are always present in L2
Advantage: easy for consistency between I/O and cache (checking L2 only) Drawback: L2 must invalidate all L1 blocks that map onto the 2nd-level block to be replaced => slightly higher 1st-level miss rate i.e. Intel Pentium 4: 64-byte block in L1 and 128-byte in L2
36
Problem: write through with write buffers offer RAW conflicts with main memory reads on cache misses If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% ) Check write buffer contents before read; if no conflicts, let the memory access continue Write Back Suppose a read miss will replace a dirty block Normal: Write dirty block to memory, and then do the read Instead: Copy the dirty block to a write buffer, do the read, and then do the write CPU stall less since restarts as soon as do read
37
Write merging: addresses of write buffers are checked to see if the address of the new data matches the address of a valid write buffer entry. If so, the new data are combined
Write buffer with 4 entries, each can hold four 64-bit words (left) without merging (right) Four writes are merged into a single entry writing multiple words at the same time is faster than writing multiple times
38
Idea of recycling: remember what was discarded latest due to cache miss in case it is needed again
rather simply discarded or swapped into L2
victim cache: a small, fully associative cache between a cache and its refill path
contain only blocks that are discarded from a cache because of a miss, victims checked on a miss before going to the next lower-level memory Victim caches of 1 to 5 entries are effective at reducing misses, especially for small, direct-mapped data caches AMD Athlon: 8 entries 39
3 Cs of Cache Miss
3Cs Absolute Miss Rate (SPEC92) 2:1 Cache Rule
miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2
0.14 0.12 2-way Miss Rate per Type 0.1 4-way 0.08 8-way 0.06 Capacity
Conflict
1-way
0.04 0.02
0 4 1 2 8 16 32 64
Compulsory
128
41
100% 1-way 80% Miss Rate per Type 60% 40% 20% 2-way 4-way 8-way
Conflict
Capacity
0%
4 1 2 8 16 32 64 128
Compulsory
42
1. 2. 3. 4. 5.
Larger block size Larger caches Higher associativity Way prediction and pseudoassociative caches Compiler optimizations
43
Using the principle of locality: The larger the block, the greater the chance parts of it will be used again.
Size of Cache
16 32 64 128
Block Size (bytes)
Take advantage of spatial locality -The larger the block, the greater the chance parts of it is used again # of blocks is reduced for the cache of same size => Increase miss penalty It may increase conflict misses and even capacity misses if the cache is small Usually high latency and high bandwidth encourage large block size
44
256
0.1
4-way
0.08 8-way
0.06 0.04 0.02 0 4 1 2 8 16 32 64 128
Capacity
Compulsory
Increasing capacity of cache reduces capacity misses (Figure 5.14 and 5.15) May be longer hit time and higher cost Trends: Larger L2 or L3 off-chip caches
45
Beware: Execution time is the only final measure! Will Clock Cycle time increase as a result of having a more complicated cache? Hill [1988] suggested hit time for 2-way vs. 1-way is: external cache +10%, internal + 2%
46
On a miss, a 2nd cache entry is checked before going to the next lower level Invert the most significant bit to the find other block in the pseudoset Miss penalty may become slightly longer
47
Aligning basic block: the entry point is at the beginning of a cache block
Decrease the chance of a cache miss for sequential code
48
Maximize accesses to the data loaded into the cache before replaced Improve temporal locality /* After: B=blocking factor */ X=Y*Z for(jj=0;jj<N;jj=jj+B)
/* Before */ for(i=0;i<N;i=i+1) for(j=0;j<N;j=j+1){ r=0; for(k=0;k<N;k=k+1) r=r+y[i][k]*z[k][j]; x[i][j]=r; }
for(kk=0;kk<N;kk=kk+B) for(i=0;i<N;i=i+1) for(j=jj;j<min(jj+B,N;j=j+1){ r=0; for(k=kk;k<min(kk+B,N);k=k+1) r=r+y[i][k]*z[k][j]; x[i][j]=x[i][j]+r; }
total # of memory words accessed = 2N3/B+N2 y benefits from spatial locality z benefits from temporal locality 49
50
Up to 8 simultaneous prefetches It may interfere with demand misses resulting in lowering performance
52
Faulting vs. nonfaulting: the address does or does not cause an exception for virtual address faults and protection violations
normal load instruction = faulting register prefetch instruction
53
Four techniques:
1.Small and simple caches 2.Avoiding address translation during indexing of the cache 3.Pipelined cache access 4.Trace caches
54
General design:
small and simple cache for 1st-level cache Keeping the tags on chip and the data off chip for 2nd-level caches The emphasis recently is on fast clock time while hiding L1 misses with dynamic execution and using L2 caches to avoid going to memory
55
2.context switching: same VA of different processes refer to different PA, requiring the cache to be flushed
solution: increase width of cache address tag with process-identifier tag (PID)
4.I/O: typically use PA, so need to interact with cache (see Section 5.12)
56
MEM
Conventional Organization
Overlap cache access with VA translation: requires $ index to remain invariant across translation
57
Note that it increases the bandwidth of instructions rather than decreasing the actual latency of a cache hit
58
5.6 parallelism
60