Professional Documents
Culture Documents
Topics:
Multi-cycle Datapath
Pipeline
Hazards
Memory Hierarchy
Cache
Virtual Memory
I/O
Part I. Modified True or False: Write T if true and modify the underlined word/phrase if false.
_______ 1. If the main memory size is M and the cache size is N, M/N memory blocks map to
one cache block. (Direct-mapped cache)
_______ 3. The valid bit in the page table identifies whether the data is in the main memory or
in the secondary storage.
_______ 4. For a multi-cycle datapath, R-type instructions take more clock cycles to finish than
Load.
_______ 5. If Processor A has a higher clock frequency than Processor B, Processor A has
higher performance than Processor B.
_______ 6. There are 3 bits allocated for the index field in a fully associative mapping for a 32
byte memory and 8 byte cache.
_______ 8. Memory-mapped I/O is usually faster than direct memory access interfacing.
_______ 9. Control hazards arise when operands of the next instruction depend on the previous
one.
2. Explain the differences between using interrupts and polling in memory-mapped I/O.
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
3. What are the advantages and disadvantages of pipelined processors over single-cycle and
multi-cycle processors?
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
5. Explain why miss rate, for whichever cache associativity, will never reach 0%.
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
Part III. Problem Solving
Problem 1. A single-cycle datapath was divided into 5 stages using intermediate registers. The
resulting worst-case delays for each stage are indicated in the table below. Given an instruction
mix of 5 branch, 20 load, 35 store and 40 arithmetic instructions, determine:
IF Stage 10ns
ID Stage 5ns
WB Stage 10ns
a) The minimum clock frequency of the original single-cycle datapath before it was divided
into stages.
c) The total execution time of the 5-stage pipelined processor. Assume that no hazards will
occur during the execution.
d) Compare the performance of the multi-cycle datapath against the single-cycle datapath.
e) Compare the performance of the 5-stage pipelined datapath against the single-cycle
datapath.
Problem 3. Consider a 16-way set-associative cache. Data words are 64 bits long and are
half-word addressable. The cache holds 2 MB data. Each block holds 16 data words. Physical
addresses are 64 bits long. Determine the number of bits for tag, index, and offset.
Part IV. Cache
Given an 8-word byte addressable cache in different associativity organizations, describe what
happens if the following memory reads were performed:
Tag Index 00
Address Hit/Miss?
Accessed
Direct-mapped 2-way 4-way Full associativity
(LRU) (LRU) (FIFO) (FIFO)
0x10
0xA0
0x30
0xA8
0x30
0xA0
0xB8
0x10
0xA0
0xA8
Answers:
Part I. Modified True or False:
1. True.
2. Volatile. Primary/Main memories are volatile (lose data when off) while secondary
memories (HDDs, SSDs, SD cards, etc.) are non-volatile.
3. True.
4. Less. R-type instructions take 4 cycles to finish (Inst Fetch -> Operand Fetch -> Execute
-> Write Back) while Load instructions take 5 cycles to finish (Inst Fetch -> Operand
Fetch -> Addr Generate -> Memory Access -> Write Back).
5. Does not necessarily have higher performance or has a lower clock period (or any
similar answer). If Processor A and B are both single-cycle processors, the statement can
indeed be true but since it was not mentioned, it cannot be assumed as such. If Processor
A is a single-cycle processor that has a clock frequency of 1 GHz while Processor B is a
multi-cycle processor that has a clock frequency of 4 GHz, their performance will still
depend on their individual CPIs and the instructions that they will execute. (See Part III.
Problem 1.d.)
6. 0. In fully associative cache, the index bits disappear leaving only the tag bits. Think of
the parking lot analogy. If we had a parking lot and 100 cars and only 10 parking spaces.
In an N-way associative cache, you assign 2 parking spaces for every 20 cars
(Example: Assign 2 parking spaces for cars whose plate numbers end in either 0 or 1 etc.)
In a real life example, the mouse pointer runs on interrupts. Imagine if it is polled
continuously every single time the user tries to move it. It is inefficient. However it is
easier to implement in software. A good example of polling is a simple while loop
running forever until a certain condition is met.
3. Pipeline vs single-cycle and multi-cycle:
The simplest one is the single-cycle processor. In terms of cycles per instruction (CPI), it
only needs one cycle to finish one instruction. You can already perform basic
computations using a single-cycle processor however, it just because it has a CPI of 1
does not guarantee that it is the fastest. Remember non-idealities. Every path that the data
has to go through in the circuit has a certain delay. The path with the largest delay is
called the critical path. In any processor you cannot have a clock period shorter than the
delay of the critical path else you’ll have erroneous output. This means that single-cycle
processors have a CPI of 1 but also a slower clock when compared to the other types of
processors.
To enable the use of faster clocks, multi-cycle processors can be implemented. This is
done by dividing the processor into small chunks using flip-flops. Instead of just a single
large path with large delays, the path is broken down into smaller chunks. This means
that the critical path is smaller than the critical path of a single-cycle processor. Again,
having a very high clock speed does not guarantee higher processing speed compared to
other processors. (High processing speed means there is a balance between CPI and clock
frequency) In a multi-cycle processor, the clock frequency is high but the CPI is no
longer just 1 (In the lectures, it is typical to have different CPI per instruction) and it will
vary depending on the conventions followed.
Lastly, we have the pipeline processor. It is much like the multi-cycle processor in a way
that it also separates the processor into smaller chunks. The difference being: a
multi-cycle processor waits until one instruction is finished before it starts the next
instruction. In a pipeline processor however, the processor does not wait for the current
instruction to finish. After the first clock cycle finishes, the next instruction immediately
starts. This allows more instructions to finish in a shorter amount of time but also allows
hazards to arise. Pipelining does not guarantee higher performance than other processors.
It is ultimately dependent on the CPI, clock frequency and the instructions that are to be
executed.
4. CPU accesses the TLB for address translation, TLB misses. The TLB gets the page table
from the main memory, but the data is not there. (valid bit = 0, data in secondary
memory) Page fault is sent to the CPU. Data is transferred to the main memory from the
secondary memory. Main memory will then update the page table. (valid bit = 1) The
updated page table is then transferred to the TLB. The TLB will then get the data from
the cache using the translated address. The cache will miss, since the data was not there
before. Data is transferred to the cache from the main memory. Data is then transferred
from the CPU from the cache.
5. The first access to a block is never in the cache. This is what is known as a
compulsory/cold start/first reference miss. Even if the cache is infinite, it will always
miss at the beginning, ensuring that the miss rate will always be greater than zero.
10 ns + 5 ns + 20 ns + 15 ns + 10 ns = 50 ns
To get the clock frequency, simply get the inverse of the clock period:
c) The total execution time of the 5-stage pipelined processor. Assume that no hazards will
occur during the execution.
The clock period is similar to the multi-cycle processor: 20 ns
d) Compare the performance of the multi-cycle datapath against the single-cycle datapath.
The single-execution time, using the values we have computed before, is:
Execution Timesingle-cycle
= # instructions * Tclk
= 100 * 50 ns
= 5000 ns
Speedup
= ET1 / ET2
= 8300 ns / 5000 ns
= 1.66
> The single-cycle datapath is 1.66 times faster than the multi-cycle datapath.
Note:
Despite the multi-cycle datapath’s faster clock period (20 ns), the single-cycle datapath still has
the better performance (50 ns clock period). This shows how performance is also dependent on
the instructions and the CPI, not just the clock frequency/period.
e) Compare the performance of the 5-stage pipelined datapath against the single-cycle
datapath.
From the previous items:
ETsingle-cycle = 5000 ns
ETpipeline = 2080 ns
> The pipelined datapath is 2.40 times faster than the single-cycle datapath.
Problem 3. Consider a 16-way set-associative cache. Data words are 64 bits long and are
half-word addressable. The cache holds 2 MB data. Each block holds 16 data words. Physical
addresses are 64 bits long. Determine the number of bits for tag, index, and offset.
There are 16 data words per block, the bits for the offset should then be 4. (16 = 24) Because
the data is half-word addressable, an additional bit of offset is needed.
> The offset should be 5 bits long.
Given that the cache is 2MB (221 bytes), we can determine the number of blocks in the cache:
221 bytes * (1 block / 16 words) * (1 word / 64 bits) * (8 bits / 1 byte)
= 214 blocks
Tag Index 00
Address Hit/Miss?
Accessed
Direct-mapped 2-way 4-way Full associativity
(LRU) (LRU) (FIFO) (FIFO)