You are on page 1of 32

Memory Hierarchy

Haresh Dagale
Dept of ESE
Motivation
Memory Technologies
Access Time vs Cost
• SRAM : Levels closer to the CPU Technology Access Time Cost (ratio)
• DRAM : Main Memory SRAM 5 – 25 ns 100
• Magnetic Disk : Largest and
Slowest level DRAM 60 – 120 ns 5

Disk 10 – 20 ms 0.1
Memory
▪ Programmer’s dream:
• Unlimited amount of fast memory (..that works at the same speed as the processor..)
▪ Hardware designer’s response:
• Create the illusion of a vast memory that can be accessed without making the
processor wait (..on an average..)
▪ How could this illusion be created?
• Programs access a relatively small portion of the address space at any instant of time
- “principle of locality”
▪ Temporal locality
• If an item is referenced it will tend to be referenced again soon.
▪ Spatial locality
• If an item is referenced, items whose addresses are close by will tend to be
referenced soon.
Memory Hierarchy
• To take advantage of the
temporal locality, memory is
build as a hierarchy of levels
• Faster and smaller memory close
to the processor
• Slower and larger (less expensive)
memory below the first level
Memory Organization
• All data is stored at the lowest
level
• A level closer to the processor
is a subset of any level further
away
• Data is copied between two
adjacent levels at a time
• The minimum unit of
information that is
transferred from one level to
another is called a block
• The block size must be larger
than one word
• to take advantage of the spatial
locality
Cache
▪ Intermediary between processor and memory
▪ A standard feature in all modern processors
▪ Most CPU designs use two levels of cache:
▪ “Level 1” or “Primary” cache (also ▪ “Level 2” or “Secondary” cache - also
called internal cache when it is called external cache when it is
implemented off-chip
implemented on-chip)
▪ L2 cache is usually implemented
• Usually implemented on-chip and separately from the processor using fast
runs at the same clock rate as the static RAM (SRAM)
processor • Varies in size from 2Kb up to (?) Mb
• In some processors, L1 cache is • The communication between this
divided into separate I-cache and D- cache and the CPU is usually via a
dedicated bus to ease the traffic
cache congestion with other subsystems
• The L1 caches varies in size from ▪ Recent trend is to build L2 cache also on
2Kb up to 64Kb chip and yet another level (L3) off-chip.
Cache
Cache Organization
▪ The cache is divided into slots (or lines), each containing a block of
data and a Tag field.
Cache line bits:
▪ Data field: block of data (multiples of words)
▪ Tag field: the upper portion of the
address,
• bits that are not used as an index for the
cache
• required to identify whether a word in the
cache corresponds to the requested word
▪ Dirty bit: data written to cache but not
to external memory
• Instruction cache lines do not have this bit
because it is read-only
▪ Valid bit: cache line is not empty or has
not been deleted
▪ Lock bit: cache line can be accessed but
not replaced
Direct-mapped Cache
Cache Associativity
2-way Set Associative Cache
Cache Miss
▪ Type of misses
• Compulsory misses (or cold-start misses)
• Increase the block size?
• Capacity misses
• Increase the cache size. Additional hardware, address resolution
• Conflict misses
• Reduce swapping of blocks in and out
▪ Design considerations
• Block size
• Replacement policy
Updating Memory
▪ How to update main memory if cached data is modified?
▪ Write-through
• data is written immediately to the main memory
• causes more traffic on the bus
▪ Write-back (or copy-back)
• data is delayed until block replacement occurs
• complex to implement
Write Through

• Data is written to the cache and also to


Write hit external memory
• Data is written only to external memory -
Write miss data cache is not changed
Copy Back

Write hit • Data is written only to the cache

• The line of data in external memory is loaded into


Write miss the cache and the data is written only to the cache
Simple FSM for cache controller
Valid CPU request
Compare
Tag,
Idle Cache hit
Valid and
Dirty

Cache miss
Old block is dirty

Memory

Write
Allocate
Back
Memory update Update
buffer
Cache Coherency Problem
▪ Main memory is shared among the processors and I/O subsystems
• Individual caches improve performance by storing frequently used data in faster
memory
▪ The view of memory through the cache could be different from the view of
memory through the I/O subsystem
▪ Since all processors share the same address space, it is possible for more
than one processor to cache an address (or data item) at a time
▪ If one processor updates the data item without informing the other
processors, inconsistencies may result and cause incorrect executions
Coherency /Consistency
▪ Coherence and Consistency are two complimentary issues though both
define the behaviour of reads and writes to memory locations
▪ The Coherence model defines what value can be returned by a read
▪ The Consistency model defines when a written value must be seen by a
read
▪ A simple definition of coherency:

A memory system is said to be coherent if any read of a data item


returns the most recently written value of that data item
Coherence
▪ More formal definition
• A read by processor P to a location X after a write by P to X, with no writes of X by
another processor occurring between the write and the read by P, always returns the
value written by P.
• A read by a processor P to location X after a write by another processor Q to X
returns the written value if the read and write are sufficiently separated in time and
no other writes to X occur between the two accesses.

▪ Writes to same location are serialized


• Two writes to the same location by any two processors are seen in the same order
by all processors
Enforcing Coherence
▪ For correct execution, coherence must be enforced among the caches
▪ Two major factors influence the selection of coherence enforcing strategy:
• Performance
• Implementation cost
▪ Four primary design issues are:
• Coherence detection strategy
• How to detect incoherent caches?
• Coherence enforcement strategy
• Updating / invalidating entries
• Precision of block-sharing information
• How sharing information is stored?
• Cache block size
Enforcing Coherence
▪ Mechanisms to make caches consistent:
• Write-update (WU)
• Write-invalidate (WI)
• Hybrid protocols, competitive-update (CU)
▪ Performance of WU and WI vary depending on the application and the
number of writes
▪ Hybrid protocols switch between WU and WI based on the number of
writes to a block
Hardware Protocols - Snooping
▪ Snooping protocols rely on a shared bus between the processors for
coherence
▪ On a processor write, the write is passed through the cache to main
memory on the bus
• Any processor caching the address may update or invalidate its cache entry as
appropriate
• Snooping protocols do not scale well beyond 32 processors because of the shared
bus
• The choice between WU, WI, and CU is especially important to reduce
communication
Snooping (1/3)

The most popular protocol to maintain cache coherency


Snooping (2/3)
▪ Write invalidate
• The writing processor issues an invalidation signal over the
bus
• All caches check to see if they have a copy
• If so, they must invalidate the block containing the word
• The writing processor is then free to update the local data
until another processor asks for it
▪ Write update
• The writing processor broadcasts new data over the bus
• All copies are updated with the new value
Shared data has lower spatial and temporal locality than other
types of data. Shared data misses often dominate cache behaviour
even though they may be just 10% of the data accesses
Snooping (3/3)
▪ On a write
• All caches check to see if they have a copy and then act to either invalidate or update
their copy to the new value as per the snooping protocol in use
▪ On a read miss
• All caches check to see if they have a copy of the requested block and take
appropriate action
• e.g. supplying data to the cache that missed
▪ Every bus transaction checks cache address tags
• The address tag portion of the cache is duplicated to get an extra read port for
snooping
• Snooping does not interfere with the processor’s access to the cache
MSI Protocol
STORE LOAD

M
LOAD FLUSH
STORE L_REQ
S_REQ
STORE

S
S_REQ
S_REQ
FLUSH

S_REQ
L_REQ
LOAD L_REQ

I
MESI Protocol

▪ Modified state is same as “Valid Dirty”


▪ Shared and Exclusive states imply clean data – memory has up-to-
date version of the data
• Exclusive state implies that this is the only copy of the data
• A write to data in the exclusive state does not require an invalidation
• Shared state implies that there are multiple copies of the data
MESI
LOAD LOAD
Transitions on processor
requests
STORE S_REQ STORE
Current Processor Next
M S State Request State

I Load E
Load S
L_REQ (S)

STORE LOAD Store M


STORE
S_REQ E Load E
Store M
M Load M

LOAD
E I Store M
S Store M
LOAD L_REQ (S`)
Load S
MESI
Transitions on requests from other
L_REQ
L_REQ FLUSH cache controller
FLUSH

M S
Current Request from other Next
State Cache Controller State
S_REQ
FLUSH E Store I
Load S
S_REQ S Load S
L_REQ FLUSH
Store I
FLUSH
M Store I
Load S

E I
S_REQ FLUSH
Cache Design
• Design a cache-memory system for a
processor with 8 bit data bus. It has 4
MBytes of RAM and 16 Kbytes of on-
chip cache. The cache is 4-way set Cache
associative. Assume that cache-line
(cache-blocks) is 128 bytes long.
• Minimum address bus width ?
Main
• The tag field ? Memory
set
• Index ?
• Offset ?
• Number of Sets
• Number of possible( competing)
memory-block per set ?
• Bits required to address 4Mb
Cache Design Solution
▪ RAM: Minimum 22 bits required to address 4 Mbytes of memory
▪ Number bits required to identify the byte cache-block = offset
▪ Offset = No. of bits required to address a byte in 128 byte = 7 bits
▪ Number of sets =
• Cache size / (# of cachelines per set * length of cache block)
• = 16k / ( 4 * 128) = 32
• Therefore, index field length = log232 = 5
▪ We have 32 sets with each set having 4 cache-block.
• Therefore, for a particular set possible cache blocks:
• Total memory-blocks = 4096 MB /128 bytes = 32000
• Number of competing memory-blocks for a particular set:
• 32000/32 (Total cache blocks / Total number of sets available) = 1000

You might also like