You are on page 1of 8

Hennessy and Patterson, Computer Architecture a Quantitative Approach (3rd Ed) Chapter 5 Memory Hierarchy Design.

Cache Miss Penalty Reduction


As CPU clock rates have increased at a much faster rate than DRAM, the relative cost of miss penalties has actually increased over time.

#1 Multilevel Caches
Wide gap between CPU and MM access time might lead to a catch 22, o Fast Cache small size low number of hits. o Large Cache slow access time CPU stalls. Can we have best of both? Consider, o Level 1 Cache Match CPU cycle time, but low total memory capacity. Buffered from CPU cycle time, but high hit ratio. o Level 2 Cache

How might this be modeled analytically?


Consider the Average Memory Access Time, o L1 Cache = Hit Time(L1) + Miss Rate(L1) Miss Penalty(L1) What is L1 miss penalty expressed in terms of? Therefore, o Miss Penalty(L1) = Hit Time(L2) + Miss Rate(L2) Miss Penalty(L2) Implication, o L2 accessed at the rate of L1 misses. o L1 designed to minimize hit time, o L2 designed to maximize global (cache) hit rate. Local (cache) Miss Rate = Number of Misses at Cache Level Total Number of CPU memory references generated

CSCI3121

Hennessy and Patterson, Computer Architecture a Quantitative Approach (3rd Ed) Chapter 5 Memory Hierarchy Design.

Global (cache) Miss Rate = Number of Misses up to specified cache level Total Number of CPU memory references generated

With respect to L1 the Global Miss Rate is therefore, o Miss Rate (L1) With respect to L2 the Global Miss Rate is therefore, o Miss Rate (L1) Miss Rate (L2) Note, o L2 local Miss Rate will be poor, Why?

L2 cache policies
What size is L2. (a) Multilevel Inclusion Data in L1 present in L2 Implication, o L2 size >> L1 size; o Consistency between memory content ensured; L1, L2 block size policy? o Case of L1 block < L2 block L2 miss forces a flush of L1 blocks mapped to L2 block which is replaced. (b) Multilevel Exclusion Data in L1 never in L2 o L1 miss swap with L2 block which hits Summary o L1 designed to minimize hit time; o L2 designed to maximize hits.

#2 Critical Word First and Early Restart


Motivation When a cache miss occurs, dont wait for the complete cache block to be filled before forwarding requested word to CPU. Critical Word First o Method Request word causing miss first and forward to CPU. CSCI3121 2

Hennessy and Patterson, Computer Architecture a Quantitative Approach (3rd Ed) Chapter 5 Memory Hierarchy Design.

o Benefit CPU commences execution whilst remainder of block stored in cache. o Drawback complex memory interface which may actually delay access time. Early Restart o Interface to MM Fetch words of block in order they are stored. o Method Forward word causing miss to CPU as soon as it is encountered. o Advantage simple memory interface. Note typically only of significance when block size increases.

#3 Priority to Read Misses Over Write


Write typically employs a write buffer. Implication write buffer may contain value for a read miss. Method Check write buffer for hit first. Write Through policy o Enforces MM consistence, thus write buffer always necessary. Write Back o Waits until dirty block replacement before block write back commences. o Can a write buffer aid write back policy as well as write through?

#4 Victim Cache
Write back policy o At some point need to replace a dirty block. o Let such a block be a victim. o Small (1 to 5 blocks) victim cache inserted between cache and next memory level. o IF (hit is made on the victim cache), THEN (swap block with an alternative cache block) See figure 1.

CSCI3121

Hennessy and Patterson, Computer Architecture a Quantitative Approach (3rd Ed) Chapter 5 Memory Hierarchy Design.

Figure 1: Relationship between victim cache and other forms of memory. Victim cache acts as a waste basket from which most recently discarded items can be retrieved.

Summary
Four basic policies utilized in cache penalty reduction, o More the merrier multi-level caches o Impatience forward required word the CPU before entire block is read. o Preference stall writes, but prioritize reads. o Recycling once high cost of block transfer made, attempt to retain for as long as possible.

Reducing cache miss penalty and rates Overlapping execution with memory accesses
#1 Non blocking caches stall reduction on cache misses
Out of order execution and compiler optimization for scheduling loads away from operand use imply that operands should not stall on a cache miss. Non blocking (lockup free) caches support hits on instructions issued after a cache miss. Cache controller complexity increases, o Required to track multiple misses and hits.

CSCI3121

Hennessy and Patterson, Computer Architecture a Quantitative Approach (3rd Ed) Chapter 5 Memory Hierarchy Design.

Figure 2 indicates that, o FP programs benefit from supporting higher depths of hit under miss. o INT programs gain most from a single hit under miss.

Figure 2: SPEC 92 benchmark of memory stall time under different levels of hit under miss provision.

#2 Hardware Instruction Prefetch (see presentation #5)


Motivation spatial locality principle indicates that block following a fetched block will also be useful. Method Instruction/ Data miss Fetch target block + Following block o Target block stored in cache; o Following block stored in stream buffer. IF (miss resolved in stream buffer) THEN (read block to cache) AND (prefetch next block to stream buffer)

Example
A UltraSPARC III has a 64K byte data cache in which prefetching reduces the data miss rate by 20%. A prefetch hit takes one clock cycle, whereas a miss on both cache and prefetch costs 15 clock cycles. Data references (per instruction) is 22% and Table 1 details misses per instruction on different cache sizes. What is the effective miss rate of the UltraSPARC III using prefetching? How much bigger a data cache would be needed in the UltraSPARC III to match the average access time if prefetching were not available? CSCI3121 5

Hennessy and Patterson, Computer Architecture a Quantitative Approach (3rd Ed) Chapter 5 Memory Hierarchy Design.

Table 1: Misses per 1000 Instructions on instruction, data and unified caches Size 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB Instruction Cache 8.16 3.82 1.36 0.61 0.30 0.02 Data Cache 44.0 40.9 38.4 36.9 35.3 32.6 Unified Cache 63.0 51.0 43.3 39.4 36.2 32.9

Summary
Out of order processors require non blocking cache operation in order to avoid stalling the cache. o Several misses are combined such that CPU operation continues whilst MM is accessed. Prefetching attempts to anticipate cache fetch requirement, but o Not good for power sensitive embedded applications.

Hit Time Reduction


Hit time is still significant because it represents the component that (ideally) should represent the typical case, therefore has a direct effect on CPU stall cycles.

#1 Small, simple Caches


Tag search is very time consuming. Directly mapped caches minimize this, figure 3.

CSCI3121

Hennessy and Patterson, Computer Architecture a Quantitative Approach (3rd Ed) Chapter 5 Memory Hierarchy Design.

Figure 3: Access time as a function of cache associativity and size.

General Cache Summary


Techniques investigated for improving average cache access time typically motivated by one of three properties, o Miss rate, Miss penalty, Hit time. Cannot focus on improving a single parameter however, as optimizations for one may be detrimental to another. Multi-level caches provide the basis for combining conflicting requirements. CPU features also have a significant impact on the type of cache out of order execution. Table 2 summarizes cache optimizations, their complexity and support in practice.

CSCI3121

Hennessy and Patterson, Computer Architecture a Quantitative Approach (3rd Ed) Chapter 5 Memory Hierarchy Design.

Table 2: Cache optimization summary


Technique Multilevel Critical word first Prioritized read Merging Write buffer Victim Cache Large Block Size Large Cache Size High Associativity Way Prediction Compiler miss reduction Non blocking H/w prefetch Small and Simple + + + + 3 2-3 0 Must for out of order CPUs. Most Prefetch Instructions, UltraSPARC III prefetches data. Widely Used. + + 2 0 UltraSPARC III I-cache; MIPS R4300 D-cache. Complex compiler optimizations necessary. + 1 Widely used. + 1 Widely used in L2. + + 2 0 AMD Athlon may hold up to 8 blocks. Pentium 4 L2 has up to 128 bytes. + + 1 1 Widely Used. Write Through policies. Miss Penalty + + Miss Rate Hit Time H/W Complexity 2 2 Widely Used; Difficult if L1 block size is L2 Widely Used. Comment

CSCI3121

You might also like