You are on page 1of 21

Memory System Design

Bharadwaj Amrutur
ECE Dept.
IISc Bangalore.

Outline

References:

Computer Architecture a Quantitative Approach, Hennessy


& Patterson

Topics

Memory hierarchy

Cache

Main Memory

Disk

Virtual memory

Power considerations

Multi-core considerations

View from the processor


Clk

MemOp
Processor

Address

Memory

WriteData
ReadData

Memory Operations (MemOp)


(DLX)
Load
Store
(Other RISC Processors)
Prefetch
Load/Store coprocessor
Cache Flush
Synchronization

Address is 32bits or 64bits (modern processors)


Data bus width is 64 (accesses can be in
bytes, 32bits, 64bits)

The Gap

CPU

Moores Law

100
10
1

Less Law?

Proc
60%/yr.

Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
7%/yr.

1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000

Performance

1000

From Kubiatowicz/UCB

Closing the gap

Use fast high speed RAMS close to the processor

Caches

Takes up ~ 90% transistors in the processor chip!


Proc

Processor Registers
L1 $
Bigger

Faster
L2 $
Main Memory (DRAM)
Disk
5

Memory Hierarchy Characteristics


Integration
Box

PCB

Chip/
Package

<1mm

Chip

Proc

16-128 64-bit
Regs

cycle latency, ~ 1000 Gb/s

Few mm
4KB-32KB

L1 $

1 cycle latency, ~ 400 Gb/s


Few cms

1MB - 8MB

L2 $

5-10 cycles latency, ~200 Gb/s


Few Inches

128MB - 4GB

Main Memory (DRAM)

40-100 cycles latency, ~50Gb/s

Many Inches
80GB-few TB

Disk

1000s of cycles, ~1Gb/s


6

Memory Hierarchy

Exercise

Find Power/Mbps/bit for each layer of the memory hierarchy

Plot Power/Mbps versus Bit as well as Bit0.5

Which is better?

Register File
RWL0

WWL0
Read
Decoder

Write
Decoder

RWL31

W63

W0
Read
Address

R0

WWL31
R63

Write
8
Address

Register File

Can add more ports

Wire dominated
Register file cell can be 10x bigger than SRAM cell
(used in L1/L2 cache)

Hence small in size

Register files are explicitly visible to the processor

One switch, bitline per cell, and one decoder

Unlike caches

Access latency can be clock cycle to allow for


reading and execution (or execution and writeback) in
same cycle.
Easy to scale up word width (64/128/256/512)

Power cost

Cache concept

Small, fast storage to exploit

Spatial and Temporal Locality

Found in other places: File caches, Name Caches etc.

Consider the memory as a sequence of blocks

Also known as lines

The block can contain multiple bytes.

Cache allows storage of a subset of the blocks from main memory

Cache is first searched to satisfy the memory access request.

Main Memory

A hit will return fast. A miss will incur a penalty.


0 1 2 3 4 5 6 7 8 9 10 1112 1314 15

Cache

Main memory blocks are temporarily


stored in the Cache
0 1 2 3

10

Average Memory Access Time


Program Execution Time is given as:

ALU ops
MemAccess
CPU time =IC
CPI Aluops
AMAT Cycletime
Instr
Inst
Average Memory Access Time (AMAT) is given as:

AMAT=HitTimeMissRateMissPenalty
HitTime and MissPenalty are in number of clock cycles
IC is Instruction Count in the program
To reduce AMAT, reduce HitTime, MissRate and MissPenalty
HitTime is usually the lowest possible of 1 cycle
MissPenalty is a function of the upper levels of the memory hierarchy
MissRate is a function of Cache Size & Associativity
which also impacts Cycletime : Hence an optimization problem

11

Exercise

Write the corresponding equation for the energy


consumed by a program

12

Cache issues

Where should a block be placed in the cache?

How is a blocked searched for in the cache?

Which block should be replaced on a cache miss?

What to do on a write?

13

Block Placement: Direct Map


Direct Mapped

Main Memory

0 1 2 3 4 5 6 7 8 9 10 111213 14 15

Cache
0 1 2 3
The main memory
blocks which map
to specific cache blocks
are:

0
4
8
12

1
5
9
13

2
6
10
14

3
7
11
15

The formula is:


14

Direct Mapped: Placement


Direct Mapped

Main Memory

0 1 2 3 4 5 6 7 8 9 10 111213 14 15

Cache
0 1 2 3
The main memory
blocks which map
to specific cache blocks
are:

0
4
8
12

1
5
9
13

2
6
10
14

3
7
11
15

The formula is:


BlockAddress mod CacheSize
15
(CacheSize is in Blocks)

Direct Mapped: Search


Direct Mapped

Main Memory

0 1 2 3 4 5 6 7 8 9 10 111213 14 15

31

0
Tag

Cache Tag

CacheIndex ByteSel

Cache Data
0 1 2 3
The main memory
blocks which map
to specific cache blocks
are:

0
4
8
12

1
5
9
13

2
6
10
14

3
7
11
15

16

Direct Mapped: Search

Tag

Decoder

Data

=
Hit/Miss
What is missing?
31

0
Tag

CacheIndex ByteSel

17

Direct Mapped: Search

Valid

Tag

Decoder

Data

=
Hit/Miss

31

0
Tag

CacheIndex ByteSel

18

Block Placement: 2-way Associative


Direct Mapped

Main Memory

0 1 2 3 4 5 6 7 8 9 10 111213 14 15

Set 0 Set 1
Cache
0 1 2 3
The main memory
blocks which map
to specific cache blocks
are:

0
4
8
12

2
6
10
14

1
5
9
13

3
7
11
15

Within each set, the blocks


can be in either of the locations
The formula is:

19

Block Placement: 2-way Associative


Direct Mapped

Main Memory

0 1 2 3 4 5 6 7 8 9 10 111213 14 15

Set 0 Set 1
Cache
0 1 2 3
The main memory
blocks which map
to specific cache blocks
are:

0
4
8
12

2
6
10
14

1
5
9
13

3
7
11
15

Within each set, the blocks


can be in either of the locations
The formula is:
SetNumber = BlockAddress mod NumSets

20

2-Way Associative: Search


31

0
CacheIndex ByteSel

Tag

Valid

Decoder

Tag

Data

Valid

Decoder

Tag

Tristate Driver

Hit/Miss_Set0

Exercises:
a) Complete the wiring
b) How do you generate the final Hit/Miss signal
c) Extend the design to a Fully Associative Cache
d) What happens to MissRate with associativity
e) What happens to MissRate with size
f) What happens to cycle time with Associativity and Size?

Data

Tristate Driver

Hit/Miss_Set1

21

You might also like