Computer Architecture Cache Design

COEN6741
Chap 5.1
11/10/2003
(Dr. Sofine Tahar)
COEN6741
Computer Architecture and Design
Chapter 5
Memory Hierarchy Design
COEN6741
Chap 5.2
11/10/2003
Outline
Introduction
Memory Hierarchy
Cache Memory
Cache Performance
Main Memory
Virtual Memory
Translation Lookaside Buffer
Alpha 21064 Example
COEN6741
Chap 5.3
11/10/2003
Computer Architecture Topics
Instruction Set Architecture
Pipelining, Hazard Resolution,
Superscalar, Reordering,
Prediction, Speculation,
Vector, DSP
Addressing,
Protection,
Exception Handling
L1 Cache
L2 Cache
DRAM
Disks, WORM, Tape
Coherence,
Bandwidth,
Latency
Emerging Technologies
Interleaving
Bus protocols
RAID
VLSI
Input/Output and Storage
Memory
Hierarchy
Pipelining and Instruction
Level Parallelism
COEN6741
Chap 5.4
11/10/2003
Who Cares About the Memory Hierarchy?
Proc
60%/yr.
(2X/1.5yr)
DRAM
9%/yr.
(2X/10 yrs)
1
10
100
1000
1
9
8
0
1
9
8
1
1
9
8
3
1
9
8
4
1
9
8
5
1
9
8
6
1
9
8
7
1
9
8
8
1
9
8
9
1
9
9
0
1
9
9
1
1
9
9
2
1
9
9
3
1
9
9
4
1
9
9
5
1
9
9
6
1
9
9
7
1
9
9
8
1
9
9
9
2
0
0
0
DRAM
CPU
1
9
8
2
Processor-Memory
Performance Gap:
(grows 50% / year)
P
e
r
f
o
r
m
a
n
c
e
Time
Processor-DRAM Memory Gap (latency)
COEN6741
Chap 5.5
11/10/2003
Levels of the Memory Hierarchy
CPU Registers
100s Bytes
<1s ns
Cache
10s-100s K Bytes
1-10 ns
$10/ MByte
Main Memory
M Bytes
100ns- 300ns
$1/ MByte
Disk
10s G Bytes, 10 ms
(10,000,000 ns)
$0.0031/ MByte
Capacity
Access Time
Cost
Tape
infinite
sec-min
$0.0014/ MByte
Registers
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
Staging
Xfer Unit
prog./compiler
1-8 bytes
cache cntl
8-128 bytes
OS
512-4K bytes
user/operator
Mbytes
Upper Level
Lower Level
faster
Larger
COEN6741
Chap 5.6
11/10/2003
What is a cache?
Small, fast storage used to improve average access
time to slow memory.
Exploits spatial and temporal locality
In computer architecture, almost everything is a cache!
Registers a cache on variables software managed
First-level cache a cache on second-level cache
Second-level cache a cache on memory
Memory a cache on disk (virtual memory)
TLB a cache on page table
Branch-prediction a cache on prediction information?
Proc/Regs
L1-Cache
L2-Cache
Memory
Disk, Tape, etc.
Bigger Faster
COEN6741
Chap 5.7
11/10/2003
Relationship of Caching and Pipelining
A
L
U
M
e
m
o
r
y
R
e
g
F
i
l
e
M
U
X
M
U
X
D
a
t
a
M
e
m
o
r
y
M
U
X
Sign
Extend
Zero?
I
F
/
I
D
I
D
/
E
X
M
E
M
/
W
B
E
X
/
M
E
M
4
A
d
d
e
r
Next SEQ PC Next SEQ PC
RD RD RD
W
B

D
a
t
a
Next PC
A
d
d
r
e
s
s
RS1
RS2
Imm
M
U
X
I-Cache
D
-
C
a
c
h
e
COEN6741
Chap 5.8
11/10/2003
The Principle of Locality
The Principle of Locality:
Programs access a relatively small portion of the address
space at any instant of time.
Two Different Types of Locality:
Temporal Locality (Locality in Time): If an item is
referenced, it will tend to be referenced again soon (e.g.,
loops, reuse)
Spatial Locality (Locality in Space): If an item is
referenced, items whose addresses are close by tend to
be referenced soon (e.g., straightline code, array access)
Last 15 years, HW (hardware) relied on locality
for speed
COEN6741
Chap 5.9
11/10/2003
A Modern Memory Hierarchy
By taking advantage of the principle of locality:
Present the user with as much memory as is available in the
cheapest technology.
Provide access at the speed offered by the fastest technology.
Requires servicing faults on the processor
Control
Datapath
Secondary
Storage
(Disk)
Processor
R
e
g
i
s
t
e
r
s
Main
Memory
(DRAM)
Second
Level
Cache
(SRAM)
O
n
-
C
h
i
p
C
a
c
h
e
1s 10,000,000s
(10s ms)
Speed (ns): 10s 100s
100s
Gs
Size (bytes):
Ks Ms
Tertiary
Storage
(Disk/Tape)
10,000,000,000s
(10s sec)
Ts COEN6741
Chap 5.10
11/10/2003
The Memory Abstraction
Association of <name, value> pairs
typically named as byte addresses
often values aligned on multiples of size
Sequence of Reads and Writes
Write binds a value to an address
Read of addr returns most recently written
value bound to that address
address (name)
command (R/W)
data (W)
data (R)
done
COEN6741
Chap 5.11
11/10/2003
Memory Hierarchy: Terminology
Hit: data appears in some block in the upper level
(example: Block X)
Hit Rate: the fraction of memory access found in the upper level
Hit Time: Time to access the upper level which consists of
access time + Time to determine hit/miss
Miss: data needs to be retrieve from a block in the
lower level (Block Y)
Miss Rate = 1 - (Hit Rate)
Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block the processor
Hit Time << Miss Penalty (500 instructions on 21264!)
Lower Level
Memory Upper Level
Memory
To Processor
From Processor
Blk X
Blk Y
COEN6741
Chap 5.12
11/10/2003
Cache Measures
Hit rate: fraction found in that level
So high that usually talk about Miss rate
Miss rate fallacy: as MIPS to CPU performance,
miss rate to average memory access time in memory
Average Memory-Access Time (AMAT)
= Hit time + Miss rate x Miss penalty (ns or clocks)
Miss penalty: time to replace a block from
lower level, including time to replace in CPU
access time: time to lower level
= f(latency to lower level)
transfer time: time to transfer block
=f(BW between upper & lower levels)
COEN6741
Chap 5.13
11/10/2003
Several interacting dimensions
cache size
block size
associativity
replacement policy
write-through vs. write-back
The optimal choice is a compromise
depends on access characteristics
workload
use (I-cache, D-cache, TLB)
depends on technology / cost
Simplicity often wins
Associativity
Cache Size
Block Size
Bad
Good
Less More
Factor A Factor B
The Cache Design Space
COEN6741
Chap 5.14
11/10/2003
Traditional Four Questions for
Memory Hierarchy Designers
Q1: Where can a block be placed in the upper level?
(Block placement)
Fully Associative, Set Associative, Direct Mapped
Q2: How is a block found if it is in the upper level?
(Block identification)
Tag/Block
Q3: Which block should be replaced on a miss?
(Block replacement)
Random, LRU, FIFO
Q4: What happens on a write?
(Write strategy)
Write Back or Write Through (with Write Buffer)
COEN6741
Chap 5.15
11/10/2003
Q1: Where can a block be
placed in the upper level?
Example cache has 8 block frames and memory has 32 blocks
Q2: How is a block found if it is in
the upper level?
Tag on each block
No need to check index or block offset
Increasing associativity shrinks index,
expands tag
Tag Index
Block
offset
Block address
The three portions of an address in a set-associative or direct-mapped cache.
Q3: Which block should be replaced
on a miss?
Easy for Direct Mapped
Set Associative or Fully Associative:
Random
LRU (Least Recently Used)
Associativity:
2-way 4-way 8-way
Size LRU Rand. LRU Random LRU Random
16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%
64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%
256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
COEN6741
Chap 5.18
11/10/2003
Q4: What happens on a write?
Write-through: all writes update cache and underlying
memory/cache
Can always discard cached data - most up-to-date data is in
memory
Cache control bit: only a valid bit
Write-back: all writes simply update cache
Cant just discard cached data - may have to write it back to
memory
Cache control bits: both valid and dirty bits
Other Advantages:
Write-through:
memory (or other processors) always have latest data
Simpler management of cache
Write-back:
much lower bandwidth, since data often overwritten multiple
times
Better tolerance to long-latency memory?
COEN6741
Chap 5.19
11/10/2003
Write Policy:
(What happens on write-miss?)
Write allocate: allocate new cache line in cache
Usually means that you have to do a read miss to
fill in rest of the cache-line!
Alternative: per/word valid bits
Write non-allocate (or write-around):
Simply send write data through to underlying
memory/cache - dont allocate new cache line!
Write Buffer for Write Through
A Write Buffer is needed between the Cache and
Memory
Processor: writes data into the cache and the write buffer
Memory controller: write contents of the buffer to memory
Write buffer is just a FIFO:
Typical number of entries: 4
Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle
Memory system designers nightmare:
Store frequency (w.r.t. time) -> 1 / DRAM write cycle
Write buffer saturation
Processor
Cache
Write Buffer
DRAM
COEN6741
Chap 5.21
11/10/2003
Simplest Cache: Direct Mapped
Memory
4 Byte Direct Mapped Cache
Memory Address
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
Cache Index
0
1
2
3
Location 0 can be occupied by data
from:
Memory location 0, 4, 8, ... etc.
In general: any memory location
whose 2 LSBs of the address are 0s
Address<1:0> => cache index
Which one should we place in the
cache?
How can we tell which one is in the
cache?
COEN6741
Chap 5.22
11/10/2003
Example: 1 KB Direct Mapped Cache
For a 2 ** N byte cache:
The uppermost (32 - N) bits are always the Cache Tag
The lowest M bits are the Byte Select (Block Size = 2 ** M)
Cache Index
0
1
2
3
:
Cache Data
Byte 0
0 4 31
:
Cache Tag Example: 0x50
Ex: 0x01
0x50
Stored as part
of the cache state
Valid Bit
:
31
Byte 1 Byte 31
:
Byte 32 Byte 33 Byte 63
:
Byte 992 Byte 1023
:
Cache Tag
Byte Select
Ex: 0x00
9
Block address
COEN6741
Chap 5.23
11/10/2003
Set Associative Cache
N-way set associative: N entries for each Cache
Index
N direct mapped caches operates in parallel
Example: Two-way set associative cache
Cache Index selects a set from the cache
The two tags in the set are compared to the input in parallel
Data is selected based on the tag result
Cache Data
Cache Block 0
Cache Tag Valid
: : :
Cache Data
Cache Block 0
Cache Tag Valid
: : :
Cache Index
Mux
0 1
Sel1 Sel0
Cache Block
Compare
Adr Tag
Compare
OR
Hit
COEN6741
Chap 5.24
11/10/2003
Disadvantage of Set Associative Cache
N-way Set Associative Cache versus Direct Mapped
Cache:
N comparators vs. 1
Extra MUX delay for the data
Data comes AFTER Hit/Miss decision and set selection
In a direct mapped cache, Cache Block is available
BEFORE Hit/Miss:
Possible to assume a hit and continue. Recover later if miss.
Cache Data
Cache Block 0
Cache Tag Valid
: : :
Cache Data
Cache Block 0
Cache Tag Valid
: : :
Cache Index
Mux
0 1
Sel1 Sel0
Cache Block
Compare
Adr Tag
Compare
OR
Hit
Block address
Block
offset
CPU
address
Data
in
Data
out
<21>
Tag Index
<8> <5>
Valid
<1>
Data
<256>
=?
4
3
(256
blocks)
2
1
Write
buffer
Lower level memory
Tag
<21>
4:1 Mux
The organization of the data cache in the Alpha AXP 21064 microprocessor.
COEN6741
Chap 5.26
11/10/2003
Miss-oriented Approach to Memory Access:
CPI
Execution
includes ALU and Memory instructions
CycleTime y MissPenalt MissRate
Inst
MemAccess
Execution
CPI IC CPUtime |
.
|
\
|
+ =
CycleTime y MissPenalt
Inst
MemMisses
Execution
CPI IC CPUtime |
.
|
\
|
+ =
Cache performance
Separating out Memory component entirely
AMAT = Average Memory Access Time
CPI
ALUOps
does not include memory instructions
CycleTime AMAT
Inst
MemAccess
CPI
Inst
AluOps
IC CPUtime
AluOps
|
.
|
\
|
+ =
y MissPenalt MissRate HitTime AMAT + =
( )
( )
Data Data Data
Inst Inst Inst
y MissPenalt MissRate HitTime
y MissPenalt MissRate HitTime
+
+ + =
COEN6741
Chap 5.27
11/10/2003
Impact on Performance
Suppose a processor executes at
Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1
50% arith/logic, 30% ld/st, 20% control
Suppose that 10% of memory operations get 50 cycle
miss penalty
Suppose that 1% of instructions get same miss penalty
CPI = ideal CPI + average stalls per instruction
1.1(cycles/ins) +
[ 0.30 (DataMops/ins)
x 0.10 (miss/DataMop) x 50 (cycle/miss)] +
[ 1 (InstMop/ins)
x 0.01 (miss/InstMop) x 50 (cycle/miss)]
= (1.1 + 1.5 + .5) cycle/ins = 3.1
58% of the time the proc is stalled waiting for memory!
AMAT=(1/1.3)x[1+0.01x50]+(0.3/1.3)x[1+0.1x50]=2.54
COEN6741
Chap 5.28
11/10/2003
Unified vs.. Split Caches
Unified vs. Separate I&D
Example:
16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47%
32KB unified: Aggregate miss rate=1.99%
Which is better (ignore L2 cache)?
Assume 33% data ops 75% accesses from instructions (1.0/1.33)
hit time=1, miss time=50
Note that data hit has 1 stall for unified cache (only one port)
AMAT
Harvard
=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05
AMAT
Unified
=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24
Proc I-Cache-1
Proc
Unified
Cache-1
Unified
Cache-2
D-Cache-1
Proc
Unified
Cache-2
COEN6741
Chap 5.29
11/10/2003
How to Improve Cache Performance?
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
y MissPenalt MissRate HitTime AMAT + =
COEN6741
Chap 5.30
11/10/2003
Miss Rate Reduction
3 Cs: Compulsory, Capacity, Conflict
0. Larger cache
1. Reduce Misses via Larger Block Size
2. Reduce Misses via Higher Associativity
3. Reducing Misses via Victim Cache
4. Reducing Misses via Pseudo-Associativity
5. Reducing Misses by HW Prefetching Instr, Data
6. Reducing Misses by SW Prefetching Data
7. Reducing Misses by Compiler Optimizations
Danger of concentrating on just one parameter!
Prefetching comes in two flavors:
Binding prefetch: Requests load directly into register.
Must be correct address and register!
Non-Binding prefetch: Load into cache.
Can be incorrect. Frees HW/SW to guess!
CPUtime = IC CPI
Execution
+
Memory accesses
Instruction
Miss rate Miss penalty
|
\
|
.
Clock cycle time
COEN6741
Chap 5.31
11/10/2003
Where to misses come from?
Classifying Misses: 3 Cs
CompulsoryThe first access to a block is not in the cache,
so the block must be brought into the cache. Also called cold
start misses or first reference misses.
(Misses in even an Infinite Cache)
CapacityIf the cache cannot contain all the blocks needed
during execution of a program, capacity misses will occur due to
blocks being discarded and later retrieved.
(Misses in Fully Associative Size X Cache)
ConflictIf block-placement strategy is set associative or
direct mapped, conflict misses (in addition to compulsory &
capacity misses) will occur because a block can be discarded and
later retrieved if too many blocks map to its set. Also called
collision misses or interference misses.
(Misses in N-way Associative, Size X Cache)
4th C:
Coherence - Misses caused by cache coherence.
COEN6741
Chap 5.32
11/10/2003
Cache Size (KB)
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 4 8
1
6
3
2
6
4
1
2
8
1-way
2-way
4-way
8-way
Capacity
Compulsory
3Cs Absolute Miss Rate (SPEC92)
Conflict
M
i
s
s
r
a
t
e
p
e
r
t
y
p
e
COEN6741
Chap 5.33
11/10/2003
0. Cache Size
Old rule of thumb: 2x size => 25% cut in miss rate
What does it reduce?
Thrashing reduction!!!
Cache Size (KB)
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 4 8
1
6
3
2
6
4
1
2
8
1-way
2-way
4-way
8-way
Capacity
Compulsory
COEN6741
Chap 5.34
11/10/2003
Cache Organization?
Assume total cache size not changed:
What happens if:
1) Change Block Size:
2) Change Associativity:
3) Change Compiler:
Which of 3Cs is obviously affected?
COEN6741
Chap 5.35
11/10/2003
Block Size (bytes)
Miss
Rate
0%
5%
10%
15%
20%
25%
1
6
3
2
6
4
1
2
8
2
5
6
1K
4K
16K
64K
256K
1. Larger Block Size (fixed size & assoc)
Reduced
compulsory
misses
Increased
Conflict
Misses
What else drives up block size?
COEN6741
Chap 5.36
11/10/2003
Cache Size (KB)
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 4 8
1
6
3
2
6
4
1
2
8
1-way
2-way
4-way
8-way
Capacity
Compulsory
2. Higher Associativity
Conflict
COEN6741
Chap 5.37
11/10/2003
3Cs Relative Miss Rate
Cache Size (KB)
0%
20%
40%
60%
80%
100%
1 2 4 8
1
6
3
2
6
4
1
2
8
1-way
2-way
4-way
8-way
Capacity
Compulsory
Conflict
Flaws: for fixed block size
Good: insight => invention
COEN6741
Chap 5.38
11/10/2003
Associativity vs.. Cycle Time
Beware: Execution time is only final measure!
Why is cycle time tied to hit time?
Will Clock Cycle time increase?
Hill [1988] suggested hit time for 2-way vs.. 1-way
external cache +10%,
internal + 2%
suggested big and dumb caches
Effective cycle time of assoc
pzrbski ISCA
COEN6741
Chap 5.39
11/10/2003
Example: Avg. Memory Access Time
vs.. Miss Rate
Example: assume CCT = 1.10 for 2-way, 1.12 for
4-way, 1.14 for 8-way vs.. CCT direct mapped
Cache Size Associativity
(KB) 1-way 2-way 4-way 8-way
1 2.33 2.15 2.07 2.01
2 1.98 1.86 1.76 1.68
4 1.72 1.67 1.61 1.53
8 1.46 1.48 1.47 1.43
16 1.29 1.32 1.32 1.32
32 1.20 1.24 1.25 1.27
64 1.14 1.20 1.21 1.23
128 1.10 1.17 1.18 1.20
(Red means A.M.A.T. not improved by more associativity)
COEN6741
Chap 5.40
11/10/2003
3. Victim Cache
Fast Hit Time + Low Conflict
=> Victim Cache
How to combine fast hit time
of direct mapped
yet still avoid conflict misses?
Add buffer to place data
discarded from cache
Jouppi [1990]: 4-entry victim
cache removed 20% to 95% of
conflicts for a 4 KB direct
mapped data cache
Used in Alpha, HP machines
To Next Lower Level In
Hierarchy
DATA TAGS
One Cache line of Data Tag and Comparator
COEN6741
Chap 5.41
11/10/2003
4. Pseudo-Associativity
How to combine fast hit time of Direct Mapped and have the
lower conflict misses of 2-way SA cache?
Divide cache: on a miss, check other half of cache to see if
there, if so have a pseudo-hit (slow hit)
Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles
Better for caches not tied directly to processor (L2)
Used in MIPS R1000 L2 cache, similar in UltraSPARC
Hit Time
Pseudo Hit Time
Miss Penalty
Time
COEN6741
Chap 5.42
11/10/2003
5. Hardware Prefetching of Instructions &
Data
E.g., Instruction Prefetching
Alpha 21064 fetches 2 blocks on a miss
Extra block placed in stream buffer
On miss check stream buffer
Works with data blocks too:
Jouppi [1990] 1 data stream buffer got 25% misses from
4KB cache; 4 streams got 43%
Palacharla & Kessler [1994] for scientific programs for 8
streams got 50% to 70% of misses from
2 64KB, 4-way set associative caches
Prefetching relies on having extra memory
bandwidth that can be used without penalty
COEN6741
Chap 5.43
11/10/2003
6. Software Prefetching Data
Data Prefetch
Load data into register (HP PA-RISC loads)
Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)
Special prefetching instructions cannot cause faults; a form of
speculative execution
Prefetching comes in two flavors:
Binding prefetch: Requests load directly into register.
Must be correct address and register!
Non-Binding prefetch: Load into cache.
Can be incorrect. Faults?
Issuing Prefetch Instructions takes time
Is cost of prefetch issues < savings in reduced misses?
Higher superscalar reduces difficulty of issue bandwidth
COEN6741
Chap 5.44
11/10/2003
7. Compiler Optimizations
McFarling [1989] reduced caches misses by 75%
on 8KB direct mapped cache, 4 byte blocks in software
Instructions
Reorder procedures in memory so as to reduce conflict misses
Profiling to look at conflicts(using tools they developed)
Data
Merging Arrays: improve spatial locality by single array of compound elements
vs.. 2 arrays
Loop Interchange: change nesting of loops to access data in order stored in
memory
Loop Fusion: Combine 2 independent loops that have same looping and some
variables overlap
Blocking: Improve temporal locality by accessing blocks of data repeatedly
vs.. going down whole columns or rows
COEN6741
Chap 5.45
11/10/2003
Merging Arrays Example
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];
/* After: 1 array of stuctures */
struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];
Reducing conflicts between val & key;
improve spatial locality
COEN6741
Chap 5.46
11/10/2003
Loop Interchange Example
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Sequential accesses instead of striding
through memory every 100 words; improved
spatial locality
COEN6741
Chap 5.47
11/10/2003
Loop Fusion Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
2 misses per access to a & c vs.. one miss per
access; improve spatial locality
COEN6741
Chap 5.48
11/10/2003
Blocking Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};
Two Inner Loops:
Read all NxN elements of z[]
Read N elements of 1 row of y[] repeatedly
Write N elements of 1 row of x[]
Capacity Misses a function of N & Cache Size:
2N
3
+ N
2
=> (assuming no conflict; otherwise )
Idea: compute on BxB submatrix that fits
COEN6741
Chap 5.49
11/10/2003
Blocking Example
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1) {
r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;
};
B called Blocking Factor
Capacity Misses from 2N
3
+ N
2
to 2N
3
/B+N
2
Conflict Misses Too?
COEN6741
Chap 5.50
11/10/2003
Reducing Conflict Misses by Blocking
Conflict misses in caches not FA vs.. Blocking size
Lam et al [1991] a blocking factor of 24 had a fifth the misses
vs.. 48 despite both fit in cache
Blocking Factor
0
0.05
0.1
0 50 100 150
Fully Associative Cache
Direct Mapped Cache
COEN6741
Chap 5.51
11/10/2003
Performance Improvement
1 1.5 2 2.5 3
compress
cholesky
(nasa7)
spice
mxm (nasa7)
btrix (nasa7)
tomcatv
gmty (nasa7)
vpenta (nasa7)
merged
arrays
loop
interchange
loop fusion blocking
Summary of Compiler Optimizations to
Reduce Cache Misses (by hand)
COEN6741
Chap 5.52
11/10/2003
Improving Cache Performance
COEN6741
Chap 5.53
11/10/2003
Reducing Miss Penalty
Four techniques
Read priority over write on miss
Early Restart and Critical Word First on miss
Non-blocking Caches (Hit under Miss, Miss under Miss)
Second Level Cache
Can be applied recursively to Multilevel Caches
Danger is that time to DRAM will grow with multiple levels in
between
First attempts at L2 caches can make things worse, since
increased worst case is worse
Out-of-order CPU can hide L1 data cache miss (35
clocks), but stall on L2 miss (40100 clocks)?
CPUtime = IC CPI
Execution
+
Memory accesses
Instruction
Miss rate Miss penalty
|
\
|
.
Clock cycle time
COEN6741
Chap 5.54
11/10/2003
1. Read Priority over Write on Miss
write
buffer
CPU
in out
DRAM
(or lower mem)
Write Buffer
COEN6741
Chap 5.55
11/10/2003
1. Read Priority over Write on Miss
Write-through w/ write buffers => RAW conflicts
with main memory reads on cache misses
If simply wait for write buffer to empty, might increase read
miss penalty (old MIPS 1000 by 50% )
Check write buffer contents before read;
if no conflicts, let the memory access continue
Write-back want buffer to hold displaced blocks
Read miss replacing dirty block
Normal: Write dirty block to memory, and then do the read
Instead copy the dirty block to a write buffer, then do the
read, and then do the write
CPU stall less since restarts as soon as do read
COEN6741
Chap 5.56
11/10/2003
2. Early Restart and Critical Word First
Dont wait for full block to be loaded before
restarting CPU
Early restartAs soon as the requested word of the block
arrives, send it to the CPU and let the CPU continue execution
Critical Word FirstRequest the missed word first from memory
and send it to the CPU as soon as it arrives; let the CPU continue
execution while filling the rest of the words in the block. Also
called wrapped fetch and requested word first
Generally useful only in large blocks,
Spatial locality => tend to want next sequential
word, so not clear if benefit by early restart
block
COEN6741
Chap 5.57
11/10/2003
3. Non-blocking Caches
Non-blocking cache or lockup-free cache allow data
cache to continue to supply cache hits during a miss
requires F/E bits on registers or out-of-order execution
requires multi-bank memories
hit under miss reduces the effective miss penalty
by working during miss vs.. ignoring CPU requests
hit under multiple miss or miss under miss may
further lower the effective miss penalty by
overlapping multiple misses
Significantly increases the complexity of the cache controller as
there can be multiple outstanding memory accesses
Requires multiples memory banks (otherwise cannot support)
Pentium Pro allows 4 outstanding memory misses
COEN6741
Chap 5.58
11/10/2003
Value of Hit Under Miss for SPEC
FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26
Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19
8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss
Hit Under i Misses
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
e
q
n
t
o
t
t
e
s
p
r
e
s
s
o
x
l
i
s
p
c
o
m
p
r
e
s
s
m
d
l
j
s
p
2
e
a
r
f
p
p
p
p
t
o
m
c
a
t
v
s
w
m
2
5
6
d
o
d
u
c
s
u
2
c
o
r
w
a
v
e
5
m
d
l
j
d
p
2
h
y
d
r
o
2
d
a
l
v
i
n
n
n
a
s
a
7
s
p
i
c
e
2
g
6
o
r
a
0->1
1->2
2->64
Base
Integer
Floating Point
Hit under n Misses
0->1
1->2
2->64
Base
COEN6741
Chap 5.59
11/10/2003
4. Add a Second-level Cache
L2 Equations
AMAT = Hit Time
L1
+ Miss Rate
L1
x Miss Penalty
L1
Miss Penalty
L1
= Hit Time
L2
+ Miss Rate
L2
x Miss Penalty
L2
AMAT = Hit Time
L1
+
Miss Rate
L1
x (Hit Time
L2
+ Miss Rate
L2
+ Miss Penalty
L2
)
Definitions:
Local miss rate misses in this cache divided by the total number of
memory accesses to this cache (Miss rate
L2
)
Global miss ratemisses in this cache divided by the total number of
memory accesses generated by the CPU
Global Miss Rate is what matters
COEN6741
Chap 5.60
11/10/2003
Comparing Local and Global Miss Rates
32 KByte 1st level cache;
Increasing 2nd level cache
Global miss rate close to
single level cache rate
provided L2 >> L1
Dont use local miss rate
L2 not tied to CPU clock
cycle!
Cost & A.M.A.T.
Generally Fast Hit Times
and fewer misses
Since hits are few, target
miss reduction
Linear
Log
Cache Size
Cache Size
COEN6741
Chap 5.61
11/10/2003
Reducing Misses:
Which apply to L2 Cache?
Reducing Miss Rate
1. Reduce Misses via Larger Block Size
2. Reduce Conflict Misses via Higher Associativity
3. Reducing Conflict Misses via Victim Cache
4. Reducing Conflict Misses via Pseudo-Associativity
5. Reducing Misses by HW Prefetching Instr, Data
6. Reducing Misses by SW Prefetching Data
7. Reducing Capacity/Conf. Misses by Compiler
Optimizations
COEN6741
Chap 5.62
11/10/2003
Relative CPU Time
Block Size
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
16 32 64 128 256 512
1.36
1.28
1.27
1.34
1.54
1.95
L2 Cache Block Size & A.M.A.T.
32KB L1, 8 byte path to memory
COEN6741
Chap 5.63
11/10/2003
Improving Cache Performance
COEN6741
Chap 5.64
11/10/2003
1. Small and Simple Caches
Why Alpha 21164 has 8KB Instruction and
8KB data cache + 96KB second level cache?
Small data cache and clock rate
Direct Mapped, on chip
COEN6741
Chap 5.65
11/10/2003
2. Avoiding Address Translation
Send virtual address to cache? Called Virtually
Addressed Cache or just Virtual Cache vs.. Physical
Cache
Every time process is switched logically must flush the cache;
otherwise get false hits
Cost is time to flush + compulsory misses from empty cache
Dealing with aliases (sometimes called synonyms);
Two different virtual addresses map to same physical address
I/O must interact with cache, so need virtual address
Solution to aliases
HW guarantees every cache block has unique physical address
SW guarantee : lower n bits must have same address;
as long as covers index field & direct mapped, they must be
unique; called page coloring
Solution to cache flush
Add process identifier tag that identifies process as well as
address within process: cant get a hit if wrong process
COEN6741
Chap 5.66
11/10/2003
Virtually Addressed Caches
CPU
TB
$
MEM
VA
PA
PA
Conventional
Organization
CPU
$
TB
MEM
VA
VA
PA
Virtually Addressed Cache
Translate only on miss
Synonym Problem
CPU
$ TB
MEM
VA
PA
Tags
PA
Overlap $ access
with VA translation:
requires $ index to
remain invariant
across translation
VA
Tags
L2 $
COEN6741
Chap 5.67
11/10/2003
Pipeline Tag Check and Update Cache as separate
stages; current write tag check & previous write cache
update
Only STORES in the pipeline; empty during a miss
Store r2, (r1) Check r1
Add --
Sub --
Store r4, (r3) M[r1]<-r2 & check r3
Delayed Write Buffer; must be checked on reads;
either complete write or read from buffer
3. Pipelined Writes
COEN6741
Chap 5.68
11/10/2003
Case Study: MIPS R4000
8 Stage Pipeline:
IFfirst half of fetching of instruction; PC selection happens
here as well as initiation of instruction cache access.
ISsecond half of access to instruction cache.
RFinstruction decode and register fetch, hazard checking and
also instruction cache hit detection.
EXexecution, which includes effective address calculation, ALU
operation, and branch target computation and condition
evaluation.
DFdata fetch, first half of access to data cache.
DSsecond half of access to data cache.
TCtag check, determine whether the data cache access hit.
WBwrite back for loads and register-register operations.
What is impact on Load delay?
Need 2 instructions between a load and its use!
COEN6741
Chap 5.69
11/10/2003
Case Study: MIPS R4000
IF IS
IF
RF
IS
IF
EX
RF
IS
IF
DF
EX
RF
IS
IF
DS
DF
EX
RF
IS
IF
TC
DS
DF
EX
RF
IS
IF
WB
TC
DS
DF
EX
RF
IS
IF
TWO Cycle
Load Latency
IF IS
IF
RF
IS
IF
EX
RF
IS
IF
DF
EX
RF
IS
IF
DS
DF
EX
RF
IS
IF
TC
DS
DF
EX
RF
IS
IF
WB
TC
DS
DF
EX
RF
IS
IF
THREE Cycle
Branch Latency
(conditions evaluated
during EX phase)
Delay slot plus two stalls
Branch likely cancels delay slot if not taken
COEN6741
Chap 5.70
11/10/2003
R4000 Performance
Not ideal CPI of 1:
Load stalls (1 or 2 clock cycles)
Branch stalls (2 cycles + unfilled slots)
FP result stalls: RAW data hazard (latency)
FP structural stalls: Not enough FP hardware (parallelism)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
e
q
n
t
o
t
t
e
s
p
r
e
s
s
o
g
c
c
l
i
d
o
d
u
c
n
a
s
a
7
o
r
a
s
p
i
c
e
2
g
6
s
u
2
c
o
r
t
o
m
c
a
t
v
Base Load stalls Branch stalls FP result stalls FP structural
stalls
COEN6741
Chap 5.71
11/10/2003
What is the Impact of What Youve
Learned About Caches?
1960-1985: Speed
= (no. operations)
1990
Pipelined
Execution &
Fast Clock Rate
Out-of-Order
execution
Superscalar
Instruction Issue
1998: Speed =
(non-cached memory accesses)
What does this mean for
Compilers?,Operating Systems?, Algorithms?
Data Structures?
1
10
100
1000
1
9
8
0
1
9
8
1
1
9
8
2
1
9
8
3
1
9
8
4
1
9
8
5
1
9
8
6
1
9
8
7
1
9
8
8
1
9
8
9
1
9
9
0
1
9
9
1
1
9
9
2
1
9
9
3
1
9
9
4
1
9
9
5
1
9
9
6
1
9
9
7
1
9
9
8
1
9
9
9
2
0
0
0
DRAM
CPU
Processor-Memory
Performance Gap:
(grows 50% / year)
COEN6741
Chap 5.72
11/10/2003
Cache Optimization Summary
Technique MR MP HT Complexity
Larger Block Size + 0
Higher Associativity + 1
Victim Caches + 2
Pseudo-Associative Caches + 2
HW Prefetching of Instr/Data + 2
Compiler Controlled Prefetching + 3
Compiler Reduce Misses + 0
Priority to Read Misses + 1
Early Restart & Critical Word 1st + 2
Non-Blocking Caches + 3
Second Level Caches + 2
Better memory system + 3
Small & Simple Caches + 0
Avoiding Address Translation + 2
Pipelining Caches + 2
m
i
s
s

r
a
t
e
h
i
t

t
i
m
e
m
i
s
s
p
e
n
a
l
t
y
COEN6741
Chap 5.73
11/10/2003
Cache Cross Cutting Issues
Superscalar CPU & Number Cache Ports
must match: number memory
accesses/cycle?
Speculative Execution and non-faulting
option on memory/TLB
Parallel Execution vs.. Cache locality
Want far separation to find independent operations vs..
want reuse of data accesses to avoid misses
I/O and consistency of data between cache
and memory
Caches => multiple copies of data
Consistency by HW or by SW?
Where connect I/O to computer?
COEN6741
Chap 5.74
11/10/2003
0.01%
0.10%
1.00%
10.00%
100.00%
AlphaSort Espresso Sc Mdljsp2 Ear Alvinn Mdljp2 Nasa7
M
i
s
s

R
a
t
e
I $
D $
L2
Alpha Memory Performance: Miss
Rates of SPEC92
8K
8K
2M
I$ miss = 2%
D$ miss = 13%
L2 miss = 0.6%
I$ miss = 1%
D$ miss = 21%
L2 miss = 0.3%
I$ miss = 6%
D$ miss = 32%
L2 miss = 10%
COEN6741
Chap 5.75
11/10/2003
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
AlphaSort Espresso Sc Mdljsp2 Ear Alvinn Mdljp2
C
P
I
L2
I$
D$
I Stall
Other
Alpha CPI Components
Instruction stall: branch mispredict (green);
Data cache (blue); Instruction cache (yellow); L2$ (pink)
Other: compute + reg conflicts, structural conflicts
COEN6741
Chap 5.76
11/10/2003
Predicting Cache Performance from Different
Prog. (ISA, compiler, ...)
4KB Data cache miss
rate 8%,12%, or
28%?
1KB Instr cache miss
rate 0%,3%,or 10%?
Alpha vs.. MIPS
for 8KB Data $:
17% vs.. 10%
Why 2X Alpha v.
MIPS?
0%
5%
10%
15%
20%
25%
30%
35%
1 2 4 8 16 32 64 128
Cache Size (KB)
Miss
Rate
D: tomcatv
D: gcc
D: espresso
I: gcc
I: espresso
I: tomcatv
D$, Tom
D$, gcc
D$, esp
I$, gcc
I$, esp
I$, Tom
COEN6741
Chap 5.77
11/10/2003
Main Memory Background
Performance of Main Memory:
Latency: Cache Miss Penalty
Access Time: time between request and word arrives
Cycle Time: time between requests
Bandwidth: I/O & Large Block Miss Penalty (L2)
Main Memory is DRAM: Dynamic Random Access Memory
Dynamic since needs to be refreshed periodically (8 ms, 1% time)
Addresses divided into 2 halves (Memory as a 2D matrix):
RAS or Row Access Strobe
CAS or Column Access Strobe
Cache uses SRAM: Static Random Access Memory
No refresh (6 transistors/bit vs.. 1 transistor /bit, area is 10X)
Address not divided: Full addreess
Size: DRAM/SRAM 4-8,
Cost/Cycle time: SRAM/DRAM 8-16
COEN6741
Chap 5.78
11/10/2003
Main Memory Deep Background
Out-of-Core, In-Core, Core Dump?
Core memory?
Non-volatile, magnetic
Lost to 4 Kbit DRAM (today using 64Kbit
DRAM)
Access time 750 ns, cycle time 1500-3000 ns
COEN6741
Chap 5.79
11/10/2003
DRAM Logical Organization (4 Mbit)
Square root of bits per RAS/CAS
Column Decoder
Sense Amps & I/O
Memory Array
(2,048 x 2,048)
A0A10
11
D
Q
Word Line
Storage
Cell
COEN6741
Chap 5.80
11/10/2003
DRAM Physical Organization (4 Mbit)
Block
Row Dec.
9 : 512
Row
Block
Row Dec.
9 : 512
Column Address
Block
Row Dec.
9 : 512
Block
Row Dec.
9 : 512
Block 0 Block 3
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
D
Q
Address
8 I/Os
8 I/Os
COEN6741
Chap 5.81
11/10/2003
4 Key DRAM Timing Parameters
t
RAC
: minimum time from RAS line falling to the valid
data output.
Quoted as the speed of a DRAM when buy
A typical 4Mb DRAM t
RAC
= 60 ns
Speed of DRAM since on purchase sheet?
t
RC
: minimum time from the start of one row access
to the start of the next.
t
RC
= 110 ns for a 4Mbit DRAM with a t
RAC
of 60 ns
t
CAC
: minimum time from CAS line falling to valid
data output.
15 ns for a 4Mbit DRAM with a t
RAC
of 60 ns
t
PC
: minimum time from the start of one column
access to the start of the next.
35 ns for a 4Mbit DRAM with a t
RAC
of 60 ns
COEN6741
Chap 5.82
11/10/2003
DRAM Performance
A 60 ns (t
RAC
) DRAM can
perform a row access only every 110 ns (t
RC
)
perform column access (t
CAC
) in 15 ns, but time
between column accesses is at least 35 ns (t
PC
).
In practice, external address delays and turning
around buses make it 40 to 50 ns
These times do not include the time to drive
the addresses off the microprocessor nor the
memory controller overhead!
COEN6741
Chap 5.83
11/10/2003
DRAM History
DRAMs: capacity +60%/yr, cost 30%/yr
2.5X cells/area, 1.5X die size in 3 years
98 DRAM fab line costs $2B
DRAM only: density, leakage v. speed
Rely on increasing no. of computers & memory
per computer (60% market)
SIMM or DIMM is replaceable unit
=> computers use any generation DRAM
Commodity, second source industry
=> high volume, low profit, conservative
Little organization innovation in 20 years
Order of importance: 1) Cost/bit 2) Capacity
First RAMBUS: 10X BW, +30% cost => little impact
COEN6741
Chap 5.85
11/10/2003
DRAM Future: 1 Gbit DRAM
Mitsubishi Samsung
Blocks 512 x 2 Mbit 1024 x 1 Mbit
Clock 200 MHz 250 MHz
Data Pins 64 16
Die Size 24 x 24 mm 31 x 21 mm
Metal Layers 3 4
Technology 0.15 micron 0.16 micron
Wish could do this for Microprocessors!
Main Memory Performance
Simple:
CPU, Cache, Bus,
Memory same width
(32 or 64 bits)
Wide:
CPU/Mux 1 word; Mux/
Cache, Bus, Memory N
words (Alpha: 64 bits &
256 bits; UtraSPARC 512)
Interleaved:
CPU, Cache, Bus 1 word:
Memory N Modules
(4 Modules); example is
word interleaved
COEN6741
Chap 5.86
11/10/2003
Main Memory Performance
Timing model (word size is 32 bits)
1 to send address,
6 access time, 1 to send data
Cache Block is 4 words
Simple M.P. = 4 x (1+6+1) = 32
Wide M.P. = 1 + 6 + 1 = 8
Interleaved M.P. = 1 + 6 + 4x1 = 11
COEN6741
Chap 5.87
11/10/2003
Independent Memory Banks
Memory banks for independent accesses
vs.. faster sequential accesses
Multiprocessor
I/O
CPU with Hit under n Misses, Non-blocking Cache
Superbank: all memory active on one block
transfer (or Bank)
Bank: portion within a superbank that is word
interleaved (or Subbank)
COEN6741
Chap 5.88
11/10/2003
Independent Memory Banks
How many banks?
number banks number clocks to access word in
bank
For sequential accesses, otherwise will return to
original bank before it has next word ready
(like in vector case)
Increasing DRAM => fewer chips => harder to have
banks
COEN6741
Chap 5.89
11/10/2003
Avoiding Bank Conflicts
Lots of banks
int x[256][512];
for (j = 0; j < 512; j = j+1)
for (i = 0; i < 256; i = i+1)
x[i][j] = 2 * x[i][j];
Even with 128 banks, since 512 is multiple of 128,
conflict on word accesses
SW: loop interchange or declaring array not power of
2 (array padding)
HW: Prime number of banks
bank number = address mod number of banks
address within bank = address / number of words in bank
modulo & divide per memory access with prime no. banks?
address within bank = address mod number words in bank
bank number? easy if 2
N
words per bank
COEN6741
Chap 5.90
11/10/2003
Chinese Remainder Theorem
As long as two sets of integers ai and bi follow these rules
and that ai and aj are co-prime if i j, then the integer x has only one
solution (unambiguous mapping):
bank number = b
0
, number of banks = a
0
(= 3 in example)
address within bank = b
1
, number of words in bank = a
1
(= 8 in example)
N word address 0 to N-1, prime no. banks, words power of 2
bi = x mod ai, 0 s bi < ai, 0 s x < a0 a1 a2 .
Fast Bank Number
Seq. Interleaved Modulo Interleaved
Bank Number: 0 1 2 0 1 2
Address
within Bank: 0 0 1 2 0 16 8
1 3 4 5 9 1 17
2 6 7 8 18 10 2
3 9 10 11 3 19 11
4 12 13 14 12 4 20
5 15 16 17 21 13 5
6 18 19 20 6 22 14
7 21 22 23 15 7 23
COEN6741
Chap 5.91
11/10/2003
Fast Memory Systems: DRAM specific
Multiple CAS accesses: several names (page mode)
Extended Data Out (EDO): 30% faster in page mode
New DRAMs to address gap;
what will they cost, will they survive?
RAMBUS: startup company; reinvent DRAM interface
Each Chip a module vs.. slice of memory
Short bus between CPU and chips
Does own refresh
Variable amount of data returned
1 byte / 2 ns (500 MB/s per chip)
Synchronous DRAM: 2 banks on chip, a clock signal to DRAM,
transfer synchronous to system clock (66 - 150 MHz)
Intel claims RAMBUS Direct (16 b wide) is future PC memory
Niche memory or main memory?
e.g., Video RAM for frame buffers, DRAM + fast serial output
COEN6741
Chap 5.92
11/10/2003
DRAM Latency >> BW
More App Bandwidth =>
Cache misses
=> DRAM RAS/CAS
Application BW =>
Lower DRAM Latency
RAMBUS, Synch DRAM
increase BW but higher
latency
EDO DRAM < 5% in PC
D
R
A
M
D
R
A
M
D
R
A
M
D
R
A
M
Bus
I$ D$
Proc
L2$
COEN6741
Chap 5.93
11/10/2003
Main Memory Summary
Wider Memory
Interleaved Memory: for sequential or
independent accesses
Avoiding bank conflicts: SW & HW
DRAM specific optimizations: page mode &
Specialty DRAM
DRAM future less rosy?
COEN6741
Chap 5.94
11/10/2003
DRAM Crossroads?
After 20 years of 4X every 3 years, running
into wall? (64Mb - 1 Gb)
How can keep $1B fab lines full if buy fewer
DRAMs per computer?
Cost/bit 30%/yr if stop 4X/3 yr?
What will happen to $40B/yr DRAM industry?
COEN6741
Chap 5.95
11/10/2003
DRAMs per PC over Time
M
i
n
i
m
u
m

M
e
m
o
r
y

S
i
z
e
DRAM Generation
86 89 92 96 99 02
1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb
4 MB
8 MB
16 MB
32 MB
64 MB
128 MB
256 MB
32 8
16 4
8 2
4 1
8 2
4 1
8 2
COEN6741
Chap 5.96
11/10/2003
Virtual Memory
A virtual memory is a memory hierarchy,
usually consisting of at least main memory
and disk, in which the processor issues all
memory references as effective addresses in
a flat address space.
All translations to primary and secondary
addresses are handled transparently, thus
providing the illusion of a flat address
space.
Recall that disk accesses may require
100,000 clock cycles to complete, due to
the slow access time of the disk subsystem.
Basic Issues in VM System Design
size of information blocks that are transferred from
secondary to main storage (M)
block of information brought into M, and M is full, then some region
of M must be released to make room for the new block -->
replacement policy
which region of M is to hold the new block --> placement policy
missing item fetched from secondary memory only on the occurrence
of a fault --> demand load policy
Paging Organization
virtual and physical address space partitioned into blocks of equal size
page frames
pages
pages
reg
cache
mem
disk
frame
Addressing and Accessing a
Two-Level Hierarchy
The
computer
system, HW
or SW,
must
perform any
address
translation
that is
required:
Two ways of forming the address: Segmentation and Paging.
Paging is more common. Sometimes the two are used together,
one on top of the other. More about address translation and paging next ...
Miss
System
address
Hit
Address in
secondary
memory
Memory management unit (MMU)
Address in
primary
memory
Block
Word
Primary
level
Secondary
level
Translation function
(mapping tables,
permissions, etc.)
Paging vs. Segmentation
Block
System address
Lookup table
Word
Block
Primary address
Paging
Word
Block
System address
Lookup table
Word
Base address
Primary address
Segmentation
Word +
Paging Organization
frame 0
1
7
0
1024
7168
P.A.
Physical
Memory
1K
1K
1K
Addr
Trans
MAP
page 0
1
31
1K
1K
1K
0
1024
31744
unit of
mapping
also unit of
transfer from
virtual to
physical
memory
Virtual Memory
Address Mapping
VA page no. disp
10
Page Table
index
into
page
table
Page Table
Base Reg
V
Access
Rights
PA
+
table located
in physical
memory
physical
memory
address
actually, concatenation
is more likely
V.A.
Segmentation
Organization
Notice that each segments virtual address starts at 0, different from its
physical address.
Repeated movement of segments into and out of physical memory will
result in gaps between segments. This is called external fragmentation.
Compaction routines must be occasionally run to remove these fragments.
Main memory
Segment 1
Segment 5
Gap
Segment 6
Physical
memory
addresses
Virtual
memory
addresses
0000
0
0
0
0
0
FFF
Segment 9
Segment 3
Gap
COEN6741
Chap 5.102
11/10/2003
Translation Lookaside Buffer
A way to speed up translation is to use a special cache
of recently used page table entries -- this has many
names, but the most frequently used is Translation
Lookaside Buffer or TLB
Virtual Address Physical Address Dirty Ref Valid Access
Really just a cache on the page table mappings
TLB access time comparable to cache access time
(much less than main memory access time)
COEN6741
Chap 5.103
11/10/2003
Translation Lookaside Buffers
CPU
TLB
Lookup
Cache
Main
Memory
VA PA
miss
hit
data
Trans-
lation
hit
miss
20 t t 1/2 t
Translation
with a TLB
Just like any other cache, the TLB can be organized
as fully associative, set associative, or direct mapped
TLBs are usually small, typically not more than 128 -
256 entries even on high end machines. This permits
fully Associative lookup on these machines.
Most mid-range machines use small n-way set associative
organizations.
COEN6741
Chap 5.104
11/10/2003
Address Translation and Cache
Page table is a large data structure in memory
Two memory accesses for every load, store, or
instruction fetch!!!
Virtually addressed cache?
synonym problem
Cache the address translations?
If index is physical part of address, can start
tag access in parallel with translation so that can
compare to physical tag
CPU
Trans-
lation Cache
Main
Memory
VA PA
miss
hit
data
Overlapped Cache & TLB Access
TLB Cache
10 2
00
4 bytes
index
1 K
page # disp
20
12
assoc
lookup
32
PA
Hit/
Miss
PA Data Hit/
Miss
=
IF cache hit AND (cache tag = PA) then deliver data to CPU
ELSE IF [cache miss OR (cache tag = PA)] and TLB hit THEN
access memory with the PA from the TLB
ELSE do standard VA translation
COEN6741
Chap 5.106
11/10/2003
Memory Hierarchy (Summary)
The memory hierarchy: from fast and expensive to
slow and cheap: Registers Cache Main Memory
Disk
At first, consider just two adjacent levels in the
hierarchy
The cache: High speed and expensive
Direct mapped, associative, set associative
Virtual memorymakes the hierarchy transparent
Translate the address from CPUs logical address to the physical
address where the information is actually stored
The TLB helps in speeding the address translation process
Memory managementhow to move information back
and forth
COEN6741
Chap 5.107
11/10/2003
Practical Memory Hierarchy
Issue is NOT inventing new mechanisms
Issue is taste in selecting between many
alternatives in putting together a memory
hierarchy that fit well together
e.g., L1 Data cache write through, L2 Write back
e.g., L1 small for fast hit time/clock cycle,
e.g., L2 big enough to avoid going to DRAM?
COEN6741
Chap 5.108
11/10/2003
TLB and Virtual Memory
Caches, TLBs, Virtual Memory all understood by
examining how they deal with 4 questions: 1)
Where can block be placed? 2) How is block found?
3) What block is repalced on miss? 4) How are
writes handled?
Page tables map virtual address to physical address
TLBs make virtual memory practical
Locality in data => locality in addresses of data,
temporal and spatial
TLB misses are significant in processor performance
funny times, as most systems cant access all of 2nd level cache
without TLB misses!
Today VM allows many processes to share single
memory without having to swap all processes to
disk; today VM protection is more important than
memory hierarchy
DAP Spr.98 UCB 35
Alpha 21064
Separate Instr & Data
TLB & Caches
TLBs fully associative
TLB updates in SW
(Priv Arch Libr)
Caches 8KB direct
mapped, write thru
Critical 8 bytes first
Prefetch instr. stream
buffer
2 MB L2 cache, direct
mapped, WB (off-chip)
256 bit path to main
memory, 4 x 64-bit
modules
Victim Buffer: to give
read priority over write
4 entry write buffer
between D$ & L2$
Stream
Buffer
Write
Buffer
Victim Buffer
Instr Data

Computer Architecture Cache Design

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computer Architecture Cache Design

Uploaded by

Copyright:

Available Formats

COEN6741

You might also like