You are on page 1of 50

Cache Memory and Virtual Memory

1. The role of cache memory


2. Cache memory components
3. Cache memory architecture
4. Cache memory organization
5. Pentium’s cache memory
6. Cache memory features identification
7. Virtual memory

Reference: 1
http://arstechnica.com/articles/paedia/cpu/caching.ars/2
2
The memory hierarchy
Level Access Time Typical Size Technology Managed By
Registers 1-3 ns 1 KB ? Custom CMOS Compiler

Level 1 Cache (on-chip) 2-8 ns 8 KB-128 KB SRAM Hardware

Level 2 Cache (off-chip) 5-12 ns 0.5 MB - 8 MB SRAM Hardware

Main Memory 10-60 ns 64 MB - 1 GB DRAM Operating System


3,000,000 - 3
Operating
Hard Disk 20 - 100 GB Magnetic
10,000,000 ns System/User
• What is CM?
- A small capacity memory (SRAM) that store the most recently
accessed memory location from the main memory (DRAM)
- The CM concept appear to the IBM S/360-85 (1968) system

• Why is useful CM?


- To the actual processors the instr. reading time >>> execution time
e.g. Tacc ~ 10-60ns DRAM ; Texec =1CLK ~ 10ns - Pentium 100
==>> bottleneck to the processor input ; Tacc <10 ns CM
• How a small memory may improve the system performances?
- The principle is “locality of reference” ( reference to area)
4
• Memory hierarchy works due to the locality of reference

Localization principle
• The programs access a relatively small area of address space at a time
• 90/10 rule: 90% of accesses are made ​in 10% of memory locations

Location Types
- Time: if a sequence has been accessed recently, it tends to be accessed
again soon
- Space: If a sequence has been accessed recently, neighboring
sequences tend to be accessed soon

• The basic idea of ​cache hierarchy:


- Place a copy of your most frequently accessed data at higher levels in
the memory hierarchy
- processor look for the nearest data copy
5
Spatial locality
• Spatial locality is the easiest type of locality to understand, because most of us have
used media applications like mp3 players, DVD players, and other types of apps
whose datasets consist of large, ordered files.
• Consider a MP3 file, which has a bunch of blocks of data that are consumed by the
processor in sequence from the file's beginning to its end. If the CPU is running
Winamp and it has just requested second 1:23 of a 5 minute MP3 file, then you can
be reasonably certain that next it's going to want seconds 1:24, 1:25, and so on. This
is the same with a DVD file, and with many other types of media files like images,
Autocad drawings, and Quake levels. All of these applications operate on large
arrays of sequentially ordered data that gets ground through in sequence by the
CPU again and again.

In the picture, the red cells are related chunks of data in


the memory array. This picture shows a program with
fairly good spatial locality, since the red cells are
clumped closely together. In an application with poor
spatial locality, the red cells would be randomly
distributed among the unrelated blue cells.

6
Temporal locality
• Consider a simple Photoshop filter that inverts an image to produce a
negative one; there's a small piece of code that performs the same
inversion on each pixel starting at one corner and going in sequence
all the way across and down to the opposite corner.
• This code is just a small loop that gets executed repeatedly on each
pixel, so it's an example of code that is reused again and again. Media
apps, games, and simulations, since they use lots of small loops that
iterate through very large datasets, have excellent temporal locality
for code.
• However, it's important to note that these kinds of apps have
extremely poor temporal locality for data.

7
Remarks.
Returning to our MP3 example, a music file is usually played through
once in sequence and none of its parts are repeated. This being the
case, it's actually kind of a waste to store any of that file in the cache,
since it's only going to stop off there temporarily before passing through
to the CPU. When an app fills up the cache with data that doesn't really
need to be cached because it won't be used again and as a result winds
up bumping out of the cache data that will be reused, that app is said to
"pollute the cache."

Media apps, games, and the like are big cache polluters, which is why
they weren't too affected by the original Celeron's lack of cache.
Because they were streaming data through the CPU at a very fast rate,
they didn't actually even care that their data wasn't being cached. Since
this data wasn't going to be needed again anytime soon, the fact that it
wasn't in a readily accessible cache didn't really matter.

8
• How efficient is this mechanism?
(P100) – 16KB CM ~ 90% required addresses are in CM
so 90% are fast accesses (SRAM)

• Why not to replace all the DRAM with SRAM?

- cost + power

9
Systems relative performances depending
on the cache dimension (www. Intel.com)

10
CPU 2. Cache memory components, related terms
 SRAM is the static RAM memory
block that keeps the data/code

Cache  Tag RAM (TRAM) is a small part of


Memory SRAM that stores the addresses of the
data stored in the SRAM

DRAM • Cache Controller manages the


access to the cache memory, its tasks:
- to supervise the data flow requested by
the processor;
- to refresh SRAM and TRAM;
- to implement the writing method;
System Interface
- to decide if the request is a „cache
miss” or „cache hit”
Basic model of the cache
11
Cache Memory access mechanism 12
•Cache Hit: searched data is in a block in the cache (example: Block X)
– Hit Rate: the fraction of memory access to found data in the cache
– Hit Time: SRAM access time + Time to determine hit/miss
•Cache Miss: data needs to be retrieved from lower level (Block Y)
– Miss Rate = 1 - (Hit Rate)
– Miss Penalty: Time to fetch a block from lower level memory
• Cache consistency - CM is a copy of a small main memory area, it is
important that it reflects always the content of the main memory, so:
(SRAM) ≡ (DRAM)

13
• Snooping operation - the supervision of the address lines done by the
cache controller for a transfer;
• Snarf - the update operation, when the cache takes over the informa-
tion from the data lines;
• „Snoop-Snarf” processes allow the cache to keep its consistency

To describe the cache inconsistency:


•„Dirty Data” which means that data are modified in the cache (CM)
but not in the main memory;
• „Stale Date” when the data are modified in the main memory (DRAM)
but not in the cache.

14
3. The CACHE Architecture

Cache has two features: - a READ architecture


- a WRITE technique (policy)

READ architecture : - "look aside"


- "look through“

WRITE technique: - "write-back"


- "write through”

15
The „Look Aside” architecture

CPU
SRAM The „look aside” CM is:
(Cache) - simple
- cheap
- provides a good response
time in case of “cache miss”
Cache
Controller

Tag
RAM

System Interface
16
CPU „Look Through” Architecture

Tag
SRAM Cache RAM
Cache Controller

This architecture is:

- more complex …
- the access to the memory is slower, because
System Interface the main memory is accessed only after the
cache access fail
- more expensive
17
WRITE techniques
•“write back”- the cache memory works like a buffer;
- When the processor initiates a writing cycle, the cache memory
receives the data and finalizes the cycle, and then when the system bus
is available, the cache memory writes the data into the main memory;
- provides maximum performance, allowing the processor to continue
working, while the main memory is updated later;
- Complex, expensive

•“write through” - the processor writes into the main memory through
the cache memory. The cache updated its content, but the writing cycle
continues until data is stored into the main memory;
- less complex and cheaper;
- its performance is poorer because the processor must wait until the
main memory stores the new data.
18
4. Cache memory organization
DRAM Memory

Cache Page
Cache Memory
-
-
-
- Line m

-
-
-

Cache Page
Line 2
Cache Page
Line 1
Cache Page Line 0 19
DRAM Memory
Line m
-
-
Cache Memory
-
-
Line k
-
-
-
-

Line 2
Line 2
Line 1
Line 1
Line 0 Line 0

Fully-Associative Cache
!!!: - best performances
- High Complexity
- Cache<4KB 20
Direct mapped cache 21
!!!: Simple, cheap, poor performances
22
1 KB Direct Mapped Cache, 32B blocks

23
DRAM Memory
Line n

-
Page m
- Cache
-
Memory
Line n
Way 0 Way 1
-
Line 0
Line Pag
n1 Line n Line n
-
-Page k
- - -
- - -
Page 0- - -
-
-
Line 0
Line 0 Line 0
Line 0

Set-Associative cache memory 24


!!! : most used
25
5. The Pentium processors cache memory
CPU • The Pentium processors cache memory is
implemented slightly differently from the principles
L1
Cache
presented above:
Memory - the cache memory is on the same chip with the
processor, so no external hardware is necessary for
using the cache, reducing the costs of the system,
better speed
L2 - an external interface can be designed on 64 bits
Cache while the internal interface and the processor buffer
Memory
works on 256 bits
DRAM

- the cache is divided into two separate components


System to increase performances: data cache and code cache
Interface
26
Processor Cache Dimension

80486DX 8 Ko L1

Pentium 16 Ko L1

Pentium Pro 16 Ko L1; 256/512 Ko L2

Pentium II 32 Ko L1; 256/512 Ko L2

Pentium III 32 Ko L1; 256/512 Ko L2

Pentium 4 20 Ko L1; 512 Ko L2


Pentium 4 EE + 2MB L3
Cache memory dimension for different Intel processors
• Both caches have the structure of a two-way set-associative cache memory
• The cache line size is of 32 bytes or 256 bits;
• A cache line is loaded by a burst of four read operations from the processor 64-bit
data bus;
• Each cache way contains 128 cache lines. The cache page size is of 4Kbytes or 128
lines;
• The write method for the Pentium processor allows software to control the cache
memory operating mode (the processor control register CR0 >>>CD (Cache Disable) 27
and NW (Not Write through)).
Number of Caches
• Single cache is found initially off-chip and with increasing
chip density some space became available for on-chip cache
• The system benefits from that by reducing external bus
activity (speeds-up the execution time).
• Because the bus of the on-chip cache is shorter than the
system bus as result of few devices are connected to it, the
on-chip bus is faster (less delay).
• During this period of using on-chip cache the system bus
will be available for other activities

28
Is still the external cache desirable?
• Shortly, the answer is yes and this organization is called two
level cache.
• Internal cache is L1 and the external cache is L2.
• With no L2 cache (SRAM), in case of MISS the CPU has to
access the RAM or ROM directly through the system bus
(slow, and performance decreasing).
• Many systems are using separate bus between L2 and the
processor to reduce burden on the system bus.
• The continuous shrinking of the processor components, many
system place L2 on the processor chip (improving the
performance).

29
Unified against Split cache
• At start L1 cache is used for both of data and
instruction.
• Recently become common to split cache into one for
data and the one for instructions.
—Reason for unified cache are:
—For specific size it will have higher Hit rate (as result of
balancing data and instructions automatically).
—Only one cache has to implemented and design.

30
Unified against Split cache
• Split caches, eliminate contention particularly for
supper scalar processor (PowerPC and Pentium)
which emphasis parallel instruction and pre-fetching
of predicted future instructions. This is very
important for any design depends on pipelining.

31
Intel Cache Evolution
Processor on which feature
Problem Solution first appears

External memory slower than the system bus. Add external cache using faster 386
memory technology.

Increased processor speed results in external bus Move external cache on-chip, 486
becoming a bottleneck for cache access. operating at the same speed as
the processor.
Internal cache is rather small, due to limited space on Add external L2 cache using faster 486
chip technology than main memory

Contention occurs when both the Instruction Prefetcher Create separate data and Pentium
and the Execution Unit simultaneously require access to instruction caches.
the cache. In that case, the Prefetcher is stalled while
the Execution Unit’s data access takes place.

Create separate back-side bus that Pentium Pro


runs at higher speed than the main
Increased processor speed results in external bus (front-side) external bus. The BSB is
becoming a bottleneck for L2 cache access. dedicated to the L2 cache.

Move L2 cache on to the Pentium II


processor chip.
Some applications deal with massive databases and Add external L3 cache. Pentium III
must have rapid access to large amounts of data. The
on-chip caches are too small.
Move L3 cache on-chip. Pentium32
4
Pentium 4 Block Diagram

33
Pentium 4 Core Processor
• Fetch/Decode Unit
— Fetches instructions from L2 cache
— Decode into micro-ops
— Store micro-ops in L1 cache
• Out of order execution logic
— Schedules micro-ops
— Based on data dependence and resources
— May speculatively execute
• Execution units
— Execute micro-ops
— Data from L1 cache
— Results in registers
• Memory subsystem
— L2 cache and systems bus 34
Pentium 4 Design Reasoning
• Decodes instructions into RISC like micro-ops before L1 cache
• Micro-ops fixed length
— Superscalar pipelining and scheduling
• Pentium instructions long & complex
• Performance improved by separating decoding from scheduling &
pipelining
— (More later – ch14)
• Data cache is write back
— Can be configured to write through
• L1 cache controlled by 2 bits in register
— CD = cache disable
— NW = not write through
— 2 instructions to invalidate (flush) cache and write back then invalidate
• L2 and L3 8-way set-associative
— Line size 128 bytes

35
6. Cache memory features identification
• CPUID instruction allows to return data and features about internal cache so:
for EAX= 2, CPUID instr. Load registers EAX, EBX, ECX şi EDX with descriptors that show
the cache and TLB features.

Reg.\ bits 31-24 23-16 15-8 7-0


EAX 66h 5Bh 50h 01h
EBX 00h 00h 00h 00h
ECX 00h 00h 00h 00h
EDX 00h 7Ah 70h 40h
Descriptors returned by CPUID instr. for a P4 processor (EAX=2)
• (66h) Data cache, 8kB, 4-way asscociative, 64-byte lines
• (5Bh) Data TLB, 4K&4MB pages, 64 inputs
• (50h) Instructions TLB, 4K&2MB/4MB pages, 64 inputs
• (7Ah) L2 cache, 256KB, 8-way asscociative, 64-byte lines
• (70h) Instructions trace cache, 12K μOps, 4-way associative
• (40h) Does not have level L2 cache (P6 family ) or L3 (P4) 36
7. Virtual Memory
- The basic abstraction provided by the OS for memory management
- VM requires both hardware and OS support
● Hardware support: memory management unit (MMU) and
translation lookaside buffer (TLB)
● OS support: virtual memory system to control the MMU and TLB

37
Motivation for Virtual Memory
1.Use DRAM as a cache for the hard disk
–Address space of a process can exceed DRAM physical size
–Sum of address spaces of processes can exceed DRAM size

2.Simplify memory management


–Multiple processes resident in main memory
•Each process with its own address space
–Only “active” code and data are actually in memory
•Allocate more memory to process as needed

3.Provide protection
–One process can’t interfere with another because they operate in
different address spaces
–User process can’t access privileged information
• different sections of address spaces have different permissions.
38
A System with Physical Memory Only
Examples: early PCs, nearly all embedded systems etc

• Addresses generated by the CPU point directly to bytes in physical


39
memory
A System with Virtual Memory

Address Translation: Hardware converts virtual addresses to physical


addresses via an OS-managed lookup table (page table) 40
Paging
● Virtual memory unit called a page
● Physical memory unit called a frame (or sometimes page frame)

41
Page Faults (similar to “Cache Misses”)
•What if an object is on disk rather than in memory?
–Page table entry indicates virtual address not in memory
–OS exception handler invoked to move data from disk into memory

•current process suspends, others can resume

•OS has full control over placement, etc.

42
Virtual Address Translation
Virtual-to-physical address translation performed by MMU
● Virtual address is broken into a virtual page number and an offset
● Mapping from virtual page to physical frame provided by a page table

43
Page Table Entries (PTEs)
-Typical PTE format (depends on CPU architecture!)

Various bits accessed by MMU on each page access:


● Valid bit (V): Whether the corresponding page is in memory
● Modify bit (M): Indicates whether a page is “dirty” (modified)
● Reference bit (R): Indicates whether a page has been accessed (read or
written)
● Protection bits: Specify if page is readable, writable, or executable
● Page frame number: Physical location of page in RAM

44
Page Tables store the virtual-to-physical address mappings.
 They are located In memory!
 The MMU has a special register called the page table base pointer.
 This points to the physical memory address of the top of the page
table for the currently-running process.
● On every memory access, must have a separate access to consult the
page tables!

Solution: Translation Lookaside Buffer (TLB)

45
TLB
• Very fast (but small) cache directly on the CPU
• Pentium 6 systems have separate data and instruction TLBs, 64 entries each
• TLB caches most recent virtual to physical address translations
• Implemented as fully associative cache
• Any address can be stored in any entry in the cache
• All entries searched “in parallel” on every address translation
• A TLB miss requires that the MMU actually try to do the address translation

46
• Memory Management Unit (MMU)
- Hardware that translates a virtual address to a physical address
- Each memory reference is passed through the MMU -translate a
virtual address to a physical address

• Translation Lookaside Buffer (TLB)


- Cache for MMU virtual-to-physical address translations
- Just an optimization – but an important one!

47
• The main advantage of virtual memory systems is the ability to upload
and execute a process that requires a greater amount of memory than
what is available by loading the split and then execute .
• The advantage is the system's ability to eliminate external
fragmentation

• The downside is that virtual memory systems tend to be slow and


require additional support from hardware complex system of address
translation .
• Speed ​of Execution of a process in a virtual memory system can match,
but never exceed the speed of execution of the same process with
Virtual Memory off.
• Frequent accesses to the hard disk shortens the lifespan of the device

48
50
51

You might also like