Caches

Memory System Design
Bharadwaj Amrutur
ECE Dept.
IISc Bangalore.
Outline
References:
Computer Architecture a Quantitative Approach, Hennessy

& Patterson
Topics
Memory hierarchy
Cache
Main Memory
Disk
Virtual memory
Power considerations
Multi-core considerations
View from the processor

Clk
MemOp
Processor
Address
Memory
WriteData
ReadData
Memory Operations (MemOp)

(DLX)
Load
Store
(Other RISC Processors)
Prefetch
Load/Store coprocessor
Cache Flush
Synchronization
Address is 32bits or 64bits (modern processors)

Data bus width is 64 (accesses can be in
bytes, 32bits, 64bits)
The Gap
CPU
Moores Law
100
10
1
Less Law?
Proc
60%/yr.
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
7%/yr.
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Performance
1000
From Kubiatowicz/UCB
Closing the gap
Use fast high speed RAMS close to the processor
Caches
Takes up ~ 90% transistors in the processor chip!

Proc
Processor Registers
L1 $
Bigger
Faster
L2 $
Main Memory (DRAM)
Disk
5
Memory Hierarchy Characteristics

Integration
Box
PCB
Chip/
Package
<1mm
Chip
Proc
16-128 64-bit
Regs
cycle latency, ~ 1000 Gb/s
Few mm
4KB-32KB
L1 $
1 cycle latency, ~ 400 Gb/s

Few cms
1MB - 8MB
L2 $
5-10 cycles latency, ~200 Gb/s

Few Inches
128MB - 4GB
Main Memory (DRAM)
40-100 cycles latency, ~50Gb/s
Many Inches
80GB-few TB
Disk
1000s of cycles, ~1Gb/s

6
Memory Hierarchy
Exercise
Find Power/Mbps/bit for each layer of the memory hierarchy
Plot Power/Mbps versus Bit as well as Bit0.5
Which is better?
Register File
RWL0
WWL0
Read
Decoder
Write
Decoder
RWL31
W63
W0
Read
Address
R0
WWL31
R63
Write
8
Address
Register File
Can add more ports
Wire dominated
Register file cell can be 10x bigger than SRAM cell
(used in L1/L2 cache)
Hence small in size
Register files are explicitly visible to the processor
One switch, bitline per cell, and one decoder
Unlike caches
Access latency can be clock cycle to allow for

reading and execution (or execution and writeback) in
same cycle.
Easy to scale up word width (64/128/256/512)
Power cost
Cache concept
Small, fast storage to exploit
Spatial and Temporal Locality
Found in other places: File caches, Name Caches etc.
Consider the memory as a sequence of blocks
Also known as lines
The block can contain multiple bytes.
Cache allows storage of a subset of the blocks from main memory
Cache is first searched to satisfy the memory access request.
Main Memory
A hit will return fast. A miss will incur a penalty.

0 1 2 3 4 5 6 7 8 9 10 1112 1314 15
Cache
Main memory blocks are temporarily

stored in the Cache
0 1 2 3
10
Average Memory Access Time

Program Execution Time is given as:
ALU ops
MemAccess
CPU time =IC
CPI Aluops
AMAT Cycletime
Instr
Inst
Average Memory Access Time (AMAT) is given as:
AMAT=HitTimeMissRateMissPenalty
HitTime and MissPenalty are in number of clock cycles
IC is Instruction Count in the program
To reduce AMAT, reduce HitTime, MissRate and MissPenalty
HitTime is usually the lowest possible of 1 cycle
MissPenalty is a function of the upper levels of the memory hierarchy
MissRate is a function of Cache Size & Associativity
which also impacts Cycletime : Hence an optimization problem
11
Exercise
Write the corresponding equation for the energy

consumed by a program
12
Cache issues
Where should a block be placed in the cache?
How is a blocked searched for in the cache?
Which block should be replaced on a cache miss?
What to do on a write?
13
Block Placement: Direct Map

Direct Mapped
Main Memory
0 1 2 3 4 5 6 7 8 9 10 111213 14 15
Cache
0 1 2 3
The main memory
blocks which map
to specific cache blocks
are:
0
4
8
12
1
5
9
13
2
6
10
14
3
7
11
15
The formula is:

14
Direct Mapped: Placement

Direct Mapped
Main Memory
0 1 2 3 4 5 6 7 8 9 10 111213 14 15
Cache
0 1 2 3
The main memory
blocks which map
are:
0
4
8
12
1
5
9
13
2
6
10
14
3
7
11
15
The formula is:

BlockAddress mod CacheSize
15
(CacheSize is in Blocks)
Direct Mapped: Search

Direct Mapped
Main Memory
0 1 2 3 4 5 6 7 8 9 10 111213 14 15
31
0
Tag
Cache Tag
CacheIndex ByteSel
Cache Data
0 1 2 3
The main memory
blocks which map
are:
0
4
8
12
1
5
9
13
2
6
10
14
3
7
11
15
16
Tag
Decoder
Data
=
Hit/Miss
What is missing?
31
0
Tag
CacheIndex ByteSel
17
Valid
Tag
Decoder
Data
=
Hit/Miss
31
0
Tag
CacheIndex ByteSel
18
Block Placement: 2-way Associative

Direct Mapped
Main Memory
0 1 2 3 4 5 6 7 8 9 10 111213 14 15
Set 0 Set 1
Cache
0 1 2 3
The main memory
blocks which map
are:
0
4
8
12
2
6
10
14
1
5
9
13
3
7
11
15
Within each set, the blocks

can be in either of the locations
The formula is:
19
Block Placement: 2-way Associative

Direct Mapped
Main Memory
0 1 2 3 4 5 6 7 8 9 10 111213 14 15
Set 0 Set 1
Cache
0 1 2 3
The main memory
blocks which map
are:
0
4
8
12
2
6
10
14
1
5
9
13
3
7
11
15
Within each set, the blocks

can be in either of the locations
The formula is:
SetNumber = BlockAddress mod NumSets
20
2-Way Associative: Search

31
0
CacheIndex ByteSel
Tag
Valid
Decoder
Tag
Data
Valid
Decoder
Tag
Tristate Driver
Hit/Miss_Set0
Exercises:
a) Complete the wiring
b) How do you generate the final Hit/Miss signal
c) Extend the design to a Fully Associative Cache
d) What happens to MissRate with associativity
e) What happens to MissRate with size
f) What happens to cycle time with Associativity and Size?
Data
Tristate Driver
Hit/Miss_Set1
21

Caches

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Caches

Uploaded by

Copyright:

Available Formats

Memory System Design

Computer Architecture a Quantitative Approach, Hennessy

View from the processor

Memory Operations (MemOp)

Address is 32bits or 64bits (modern processors)

Closing the gap

Use fast high speed RAMS close to the processor

Takes up ~ 90% transistors in the processor chip!

Memory Hierarchy Characteristics

cycle latency, ~ 1000 Gb/s

1 cycle latency, ~ 400 Gb/s

5-10 cycles latency, ~200 Gb/s

Main Memory (DRAM)

40-100 cycles latency, ~50Gb/s

1000s of cycles, ~1Gb/s

Find Power/Mbps/bit for each layer of the memory hierarchy

Plot Power/Mbps versus Bit as well as Bit0.5

Can add more ports

Hence small in size

Register files are explicitly visible to the processor

One switch, bitline per cell, and one decoder

Access latency can be clock cycle to allow for

Small, fast storage to exploit

Spatial and Temporal Locality

Found in other places: File caches, Name Caches etc.

Consider the memory as a sequence of blocks

Also known as lines

The block can contain multiple bytes.

Cache allows storage of a subset of the blocks from main memory

Cache is first searched to satisfy the memory access request.

A hit will return fast. A miss will incur a penalty.

Main memory blocks are temporarily

Average Memory Access Time

Write the corresponding equation for the energy

Where should a block be placed in the cache?

How is a blocked searched for in the cache?

Which block should be replaced on a cache miss?

Block Placement: Direct Map

The formula is:

Direct Mapped: Placement

The formula is:

Direct Mapped: Search

Direct Mapped: Search

Direct Mapped: Search

Block Placement: 2-way Associative

Within each set, the blocks

Block Placement: 2-way Associative

Within each set, the blocks

2-Way Associative: Search

You might also like