Professional Documents
Culture Documents
Practice Set 5
Memory Hierarchy
Problem 2
You are building a computer system around a processor with in-order execution that runs at 1
GHz and has a CPI of 1, excluding memory accesses. The only instructions that read or write data
from/to memory are loads (20% of all instructions) and stores (5% of all instructions).
The memory system for this computer has a split L1 cache. Both the I-cache and the D-cache are
direct mapped and hold 32 KB each. The I-cache has a 2% miss rate and 64 byte blocks, and the
D-cache is a write-through, no-write-allocate cache with a 5% miss rate and 64 byte blocks. The
hit time for both the I-cache and the D-cache is 1 ns. The L1 cache has a write buffer. 95% of
writes to L1 find a free entry in the write buffer immediately. The other 5% of the writes have to
wait until an entry frees up in the write buffer (assume that such writes arrive just as the write
buffer initiates a request to L2 to free up its entry and the entry is not freed up until the L2 is done
with the request). The processor is stalled on a write until a free write buffer entry is available.
The L2 cache is a unified write-back, write-allocate cache with a total size of 512 KB and a block
size of 64-bytes. The hit time of the L2 cache is 15ns. Note that this is also the time taken to write
a word to the L2 cache. The local hit rate of the L2 cache is 80%. Also, 50% of all L2 cache
blocks replaced are dirty. The 64-bit wide main memory has an access latency of 20ns (including
the time for the request to reach from the L2 cache to the main memory), after which any number
of bus words may be transferred at the rate of one bus word (64-bit) per bus cycle on the 64-bit
wide 100 MHz main memory bus. Assume inclusion between the L1 and L2 caches and assume
there is no write-back buffer at the L2 cache. Assume a write-back takes the same amount of time
as an L2 read miss of the same size.
While calculating any time values (such as hit time, miss penalty, AMAT), please use ns
(nanoseconds) as the unit of time. For miss rates below, give the local miss rate for that cache.
By miss penaltyL2, we mean the time from the miss request issued by the L2 cache up to the time
the data comes back to the L2 cache from main memory.
Part A
Computing the AMAT (average memory access time) for instruction accesses.
i. Give the values of the following terms for instruction accesses.
hit timeL1, miss rateL1, hit timeL2, miss rateL2
hit timeL1 = 1 processor cycle = 1 ns
miss rateL1= 0.02
hit timeL2 = 15 ns
miss rateL2 = 1 0.8 = 0.2
ii. Give the formula for calculating miss penaltyL2, and compute the value of miss
penaltyL2.
miss penaltyL2 = memory access latency + time to transfer one L2 cache block
Transfer rate of memory bus = 64 bits / bus cycle = 64 bits / 10 ns = 8 bytes / 10 ns = 0.8
bytes / ns
Time to transfer one L2 cache block = 64 bytes / 0.8 bytes = 80 ns.
So, miss penaltyL2 = 20 + 80 = 100 ns
However, 50% of all replaced blocks are dirty and so they need to be written back to main
memory. This takes another 100 ns.
Therefore, miss penaltyL2 = 100 + 0.5 x 100 = 150 ns
iii. Give the formula for calculating the AMAT for this system using the five terms whose
values you computed above and any other values you need.
AMAT = hit timeL1 + miss rateL1 x (hit timeL2 + miss rateL2 x miss penaltyL2)
iv. Plug in the values into the AMAT formula above, and compute a numerical value for
AMAT for instruction accesses.
AMAT = 1 + 0.02 x (15 + 0.2 x 150) = 1.9 ns
Part B
Computing the AMAT for data reads.
i.
ii. Calculate the value of the AMAT for data reads using the above value, and other values
you need.
AMAT = hit timeL1 + miss rateL1 x (hit timeL2 + miss rateL2 x miss penaltyL2)
AMAT = 1 + 0.05 x (15 + 0.2 x 150)
= 3.25 ns
Part C
Computing the AMAT for data writes.
i. Give the value of miss penaltyL2 for data writes.
miss penaltyL2 = miss penaltyL2 for the data read case
So, miss penaltyL2 = 150 ns
(Assuming that after the block is read into the L2 cache from the main memory, no further
time is spent writing to it. In other words the time to write to it is included in the 150 ns
value. This value of 150 ns is used in the solutions for all subsequent parts, but using a
value of 151 ns is also perfectly acceptable.)
Note: Here a value of 151 ns is also equally acceptable.
(Assuming that one additional cycle (1ns) is spent writing to the block once it has arrived in
the L2 cache.)
ii. Give the value of write timeL2Buff for a write buffer entry being written to the L2 cache.
As, the L2 cache hit rate is 80%, only 20% of the write buffer writes will miss in the L2
cache and will thus incur the miss penaltyL2.
So, write timeL2Buff = hit timeL2 + 0.2 x miss penaltyL2
1 x 15 + 0.2 x 150
= 45 ns
iii. Calculate the value of the AMAT for data writes using the above two values, and any
other values that you need. Only include the time that the processor will be stalled. Hint:
There are two cases to be considered here depending upon whether the write buffer is full
or not.
There are two cases to consider here. In 95% of the cases the write buffer will have empty
space, so the processor will only need to wait 1 cycle. In the remaining 5% of the cases,
the write buffer will be full, and the processor will have to wait for the additional time
taken for a buffer entry to be written to the L2 cache, which is write timeL2Buff.
Part D
Compute the overall CPI, including memory accesses (instructions plus data). Assume that
there is no overlap between the latencies of instruction and data accesses.
The CPI excluding memory accesses = 1
We are given that 20% of the instructions are data reads (loads), and 5% are data writes
(stores).
Also, note that 100% of the instructions require an instruction fetch.
Since, on this system one clock cycle is 1 ns, we can use the AMAT values directly.
So, CPI including memory accesses
= 1 + (AMAT for instructions 1) + (0.2 x AMAT for data reads - 1) + (0.05 x AMAT
for data writes - 1)
= 1 + (1.9 1) + 0.2 x (3.25 1) + 0.05 x (3.25 1)
= 2.46
Note: We are subtracting 1 cycle (1ns) from all of the AMAT times (instruction, data
read and data write), because in the pipeline 1 cycle of memory access is already
accounted for in the CPI of 1.
Problem 3
Way prediction allows an associative cache to provide the hit time of a direct-mapped cache. The
MIPS R10000 processor uses way prediction to achieve a different goal: reduce the cost of the
chip package. The R10000 hardware includes an on-chip L1 cache, on-chip L2 tag comparison
circuitry, and an on-chip L2 way prediction table. L2 tag information is brought on chip to detect
an L2 hit or miss. The way prediction table contains 8K 1-bit entries, each corresponding to two
L2 cache blocks. L2 cache storage is built external to the processor package, is 2-way associative,
and may have one of several block sizes.
a. How can way prediction reduce the number of pins needed on the R10000 package to
read L2 tags and data, and what is the impact on performance compared to a package
with a full complement of pins to interface to the L2 cache?
Solution:
When way prediction is not used, the chip would need to access L2 tags for both associative
ways. Ideally, this would be done in parallel; thus, the R10000 and L2 chips would need
enough pins to bring both tags onto the processor for comparison. With way prediction, we
need only bring the tag for the way that was predicted; in the less likely case where the
predicted way is incorrect, we could load the other tag with minimal penalty.
b. What is the performance drawback of just using the same smaller number of pins but not
including way prediction?
Solution:
To use the smaller number of pins without way prediction, we would check the tags for the
two ways one after the other. Now, when we have a hit, on average half the time we will get
the correct tag first, and half the time we will get the correct tag second. With the way
prediction, we were getting the correct tag a high fraction of the time, so the average L2
access time will be higher.
c. Assume that the R10000 uses most-recently used way prediction. What are reasonable
design choices for the cache state update(s) to make when the desired data is in the
predicted way, the desired data is in the non-predicted way, and the desired data is not in
the L2 cache? Please fill in your answers in the following table.
Solution:
Cache Access Case
Desired data is in the
predicted way
Desired data is in the
non-predicted way
Desired data is not in
the L2 cache
Cache data
No change
No change
No change
d. For a 1024 KB L2 cache with 64-byte blocks and 8-way set associativity, how many way
prediction table entries are needed?
Solution:
The number of blocks in the L2 cache = 1024KB / 64B = 16K
The number of sets in the L2 cache = 16K / 8 = 2K
Thus, the number of way prediction table entries needed = 2K
e. For an 8 MB L2 cache with 128-byte blocks and 2-way set associativity, how many way
prediction table entries are needed?
Solution:
The number of blocks in the L2 cache = 8MB / 128B = 64K
The number of sets in the L2 cache = 64K / 2 = 32K
Thus, the number of way prediction table entries needed = 32K
f.
What is the difference in the way that the R10000 with only 8K way prediction table
entries will support the cache in part d) versus the cache in part e)? Hint: Think about the
similarity between a way prediction table and a branch prediction table.
Solution:
Since the R10000 way prediction table has 8K entries, it can easily support the cache in part
d). However, this table is too small to accommodate all of the 16K entries required by the
cache in part e). One idea is to make each prediction entry in part e), correspond to two
different sets. However, this introduces the possibility of interference, just like we have seen
previously with branch history tables.
Problem 4
Consider the following piece of code:
register int i,j;
/* i, j are in the processor registers
*/
register float sum1, sum2, a[64][64], b[64][64];
for ( i = 0; i < 64; i++ )
{
for ( j = 0; j < 64; j++ ){
sum1 += a[i][j];
}
/* 1 */
/* 2 */
/* 3 */
/* 4 */
/* 5 */
}
Assume the following:
There is a perfect instruction cache; i.e., do not worry about the time for any instruction
accesses.
Both int and float are of size 4 bytes.
Assume that only the accesses to the array locations a[i][j] and b[i][2*j] generate loads to the
data cache. The rest of the variables are all allocated in registers.
Assume a fully associative, LRU data cache with 32 lines, where each line has 16 bytes.
Initially, the data cache is empty.
The arrays a and b are stored in row major form.
To keep things simple, we will assume that statements in the above code are executed
sequentially. The time to execute lines (1), (2), and (4) is 4 cycles for each invocation. Lines
(3) and (5) take 10 cycles to execute and an additional 40 cycles to wait for the data if there is
a data cache miss.
There is a data prefetch instruction with the format prefetch(array[index]). This prefetches the
entire block containing the word array[index] into the data cache. It takes 1 cycle for the
processor to execute this instruction and send it to the data cache. The processor can then go
ahead and execute subsequent instructions. If the prefetched data is not in the cache, it takes
40 cycles for the data to get loaded into the cache.
Assume that the arrays a and b both start at cache line boundaries.
a. How many cycles does the above code fragment take to execute if we do NOT use
pefetching? Also calculate the average number of cycles per outer-loop iteration.
Solution:
Number of cycles taken by line 1 = 64 x 4 = 256
Number of cycles taken by line 2 = 64 x 64 x 4 = 16384
/* (1) */
/* (2) */
/* (3-0) */
/* (3-1) */
/* (3-2) */
/* (3-3) */
/* (3-4) */
Problem 5
Consider a system with the following processor components and policies:
A direct-mapped L1 data cache of size 4KB and block size of 16 bytes, indexed
and tagged using physical addresses, and using a write-allocate, write-back policy
A fully-associative data TLB with 4 entries and an LRU replacement policy
Part A
Which bits of the virtual address are used to obtain a virtual to physical translation from
the TLB? Explain exactly how these bits are used to make the translation, assuming there
is a TLB hit.
Solution
The virtual address is 40 bits long. Because the virtual page size is 1MB = 2^20 bytes,
and memory is byte addressable, the virtual page offset is 20 bits. Thus, the first 4020=20 bits are used for address translation at the TLB. Since the TLB is fully associative,
all of these bits are used for the tag; i.e., there are no index bits.
When a virtual address is presented for translation, the hardware first checks to see if the
20 bit tag is present in the TLB by comparing it to all other entries simultaneously. If a
valid match is found (i.e., a TLB hit) and no protection violation occurs, the page frame
number is read directly from the TLB.
Part B
Which bits of the virtual or physical address are used as the tag, index, and block offset
bits for accessing the L1 data cache? Explicitly specify which of these bits can be used
directly from the virtual address without any translation.
Solution
Since the cache is physically indexed and physically tagged, all of the bits from accessing
the cache must come from the physical address. However, since the lowest 20 bits of the
virtual address form the page offset and are therefore not translated, these 20 bits can be
used directly from the virtual address. The remaining 12 bits (of the total of 32 bits in the
physical address) must be used after translation.
Since the block size is 16 bytes = 2^4 bytes, and memory is byte addressable, the lowest
4 bits are used as block offset.
Since the cache is direct mapped, the number of sets is 4KB/16 bytes = 2^8. Therefore, 8
bits are needed for the index.
The remaining 32-8-4 = 20 bits are needed for the tag.
20 bits
Tag
8 bits
Index
4 bits
Offset
As mentioned above, the index and offset bits can be used before translation while the tag
bits must await the translation for the 12 uppermost bits.
Part C
The following lists part of the page table entries corresponding to a few virtual addresses
(using hexadecimal notation). Protection bits of 01 imply read-only access and 11 implies
read/write access. Dirty bit of 0 implies the page is not dirty. Assume the valid bits of all
the following entries are set to 1.
1
2
3
4
5
6
Virtual
number
FFFFF
FFFFE
FFFFD
FFFFC
FFFFB
FFFFA
11
11
11
11
11
01
Dirty bits
0
0
0
0
0
0
The following table lists a stream of eight data loads and stores to virtual addresses by the
processor (all addresses are in hexadecimal). Complete the rest of the entries in the table
corresponding to these loads and stores using the above information and your solutions to
parts A and B. For the data TLB hit, data cache hit, and protection violation columns,
specify yes or no. Assume initially the data TLB and data cache are both empty.
1
2
3
4
5
6
7
8
Data
cache
hit?
Protection
violation?
Dirty
bit
Data
cache
hit?
Protection
violation?
Dirty
bit
Solution
Processor load/store to Corresponding
virtual address
physical
address
1
2
3
4
5
6
7
8
CFCABAC1
CBAECAB1
CFCBAAE3
CAACEBC3
CACAAFA1
CBAAABC9
CFCBAAE2
CCAABAC4
AC
AB
AE
BC
FA
BC
AE
AC
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
No
Yes
No
No
No
No
No
No
No
No
Yes
Problem 6
Consider a 4-way set-associative L1 data cache with a total of 64 KB of data. Assume
the cache is write-back and with block size of 16 bytes. Further, the cache is virtuallyindexed and physically-tagged meaning that the index field of the address comes from the
virtual address generated by the CPU, and the tag field comes from the physical address
that the virtual address is translated into. The data TLB in the system has 128 entries and
is 2-way set associative. The physical and virtual addresses are 32 bits and 50 bits,
respectively. Both the cache and the memory are byte addressable. The physical
memory page size is 4 kilobytes.
Part A
Show the bits from the physical and virtual addresses that are used as the block offset,
index, and tag bits for the cache. Similarly, show the bits from the two addresses that are
used as the page offset, index, and tag bits for the TLB.
Solution
For the TLB, we only use the virtual address. Each page has 4 KB or 212 bytes, and we
need 12 bits to be able to uniquely address each of these bytes. So the page offset is the
least significant (right most) 12 bits of the address. The TLB has 128 entries divided in
64 or 26 sets, and we need 6 bits to be able to uniquely address each of these 64 sets. So
the next 6 bits represent the index. As stated the virtual address is 50 bits long, so the
remaining 32 most significant bits make up the tag field of the address. This is illustrated
below.
1
1
0
0
1
1
0
0
Tag Since the cache is physically tagged, the tag bits come from the physical address,
using bits 14 to 31.
Physical
Virtual
Virtual
Tag (18 bits)
Index (10 bits)
Block Offset (4 bits)
31------------------14 13--------------4 3-------------0
Part B
Can the cache described for this problem have synonyms? If no, explain in detail why
not. If yes, first explain in detail why and then explain how and why you could eliminate
this problem by changing the cache configuration, but not changing the total data storage
in the cache? Would you expect this cache to perform worse or better than the original
design? Please provide a justification for your answer.
Solution
Yes, the original cache can have synonyms because 2 bits of the virtual page-frame
number are used in the index. These bits could result in the synonym problem since
different values for these bits may map to the same physical address. To eliminate this
problem, the cache needs to use only virtual bits that correspond directly to physical bits
(the page offset) for the index. This can be done by reducing the number of index bits
and increasing the number of tag bits. Since we need to reduce the number of index bits
by two we make the cache 4-way 22 = 16-way set associative. A cache lookup will
then use bits 12-31 as tag, bits 4-11 as index, and bits 0-3 as block offset. Since bits 0-11
are the page offset, we effectively use only physical address bits for indexing. Since the
cache is physically tagged, the cache lookup uses only physical address bits and there can
be no synonyms.
Cache look up with new configuration
Tag (20 bits)
Index (8 bits)
Block Offset (4 bits)
31------------------------12 11----------4 3---------------0
Translation of Page-Frame Address
Page Offset
For whether the new cache will perform better, reasonable answers will be accepted. The
preferred response is that the new cache has the advantage that the cache can be indexed
completely in parallel with the TLB access since the page offset is already aligned to the
boundary of the cache's tag and index fields. Therefore, the page offset can be used in its
un-translated form to index the cache. The disadvantage of this approach is that a 16-way
associative cache is difficult to build and will require many tag comparisons to operate in
parallel. Therefore, whether the new cache is faster depends on whether the advantage
outweighs the disadvantage.
Problem 7
A graduate student comes to you with the following graph. The student is performing
experiments by varying the amount of data accessed by a certain benchmark. The only
thing the student tells you of the experiments is that their system uses virtual memory, a
data TLB, only one level of data cache, and the data TLB maps a much smaller amount
of data than can be contained in the data cache. You may assume that there are no
conflict misses in the caches and TLB. Further assume that instructions always fit in the
instruction TLB and an L1 instruction cache.
Execution Time
6
5
4
3
2
1
a
b
Data Size (KB)
Part A
Give an explanation for the shape of the curve in each of the regions numbered 1 through
7.
Solution
1: Execution time slowly increases (performance decreases) due to increasing data size
but remains at a roughly similar level.
2: At this point, the TLB overflows and execution time sharply increases to handle the
increased TLB misses.
3: Execution time again slowly increases due to increasing data size and plateaus at a
higher level than before due to overhead from TLB misses.
4: At this point, the data cache overflows, causing a high frequency of cache misses and
execution time again sharply increases.
5: Execution time again slowly increases due to increasing data size and plateaus at a
high level due to overhead from retrieving data directly from main memory due to cache
misses.
6: Execution time again sharply increases due to physical memory filling up and
thrashing occurring between disk and physical memory.
7: Execution time is very high due to overhead from TLB misses, cache misses and
virtual memory thrashing. It is slowly increasing due to increasing data size.
Part B
From the graph, can you make a reasonable guess at any of the following system
properties? If so, what are they? If not, why not? Explain your answers. (Note: your
answers can be in terms of a, b, and c).
(i)
Number of TLB entries
(ii)
Page size
(iii) Physical memory size
(iv)
Virtual memory size
(v)
Cache size
Solution
There is no reasonable guess for page size and virtual memory size. There is also no
reasonable guess for the number of TLB entries since it depends on the page size.
It is acceptable if you guess that the cache size is b KB and the physical memory size is c
KB, since these are the points at which the execution time shows significant
degradations. However, these quantities are actually only upper bounds, since the actual
size of these structures depends on the temporal and spatial reuse in the access stream.
(The actual size depends on a property known as the working set of the application.)
Problem 8
Consider a memory hierarchy with the following parameters. Main memory is interleaved
on a word basis with four banks and a new bank access can be started every cycle. It
takes 8 processor clock cycles to send an address from the cache to main memory; 50
cycles for memory to access a block; and an additional 25 cycles to send a word of data
back from memory to the cache. The memory bus width is 1 word. There is a single level
of data cache with a miss rate of 2% and a block size of 4 words. Assume 25% of all
instructions are data loads and stores. Assume a perfect instruction cache; i.e., there are
no instruction cache misses. If all data loads and stores hit in the cache, the CPI for the
processor is 1.5.
Part A
Suppose the above memory hierarchy is used with a simple in-order processor and the
cache blocks on all loads and stores until they complete. Compute the miss penalty and
resulting CPI for such a system.
Solution
Miss penalty = 8 + 50 + 25*4 = 158 cycles.
CPI = 1.5 + (0.25* 0.02 * 158) = 2.29
Part B
Suppose we now replace the processor with an out-of-order processor and the cache with
a non-blocking cache that can have multiple load and store misses outstanding. Such a
configuration can overlap some part of the miss penalty, resulting in a lower effective
penalty as seen by the processor. Assume that this configuration effectively reduces the
miss penalty (as seen by the processor) by 20%. What is the CPI of this new system and
what is the speedup over the system in Part A?
Solution
Effective miss penalty = 0.80 * 158 = 126 cycles.
CPI = 1.5 + (0.25 * .02 * 126) = 2.13
Speedup over the system in part A is 2.29/2.13 = 1.08.
Part C
Start with the system in Part A for this part. Suppose now we double the bus width and
the width of each memory bank. That is, it now takes 50 cycles for memory to access the
block as before, and the additional 20 cycles now send a double word of data back from
memory to the cache. What is the miss penalty now? What is the CPI? Is this system
faster or slower than that in Part B?
Solution
Miss penalty = 8 + 50 + 25*2 = 108 cycles.
CPI = 1.5 + (0.25 * .02 * 108) = 2.04.
This system is slightly faster than that in part B.