You are on page 1of 13

EEE 105 Samplex 2nd Exam

Topics:
Multi-cycle Datapath
Pipeline
Hazards
Memory Hierarchy
Cache
Virtual Memory
I/O

Part I. Modified True or False:​​ Write T if true and modify the underlined word/phrase if false.

_______ 1. If the main memory size is M and the cache size is N, ​M/N​ memory blocks map to
one cache block. (Direct-mapped cache)

_______ 2. Primary storage or main memory is ​non-volatile​.

_______ 3. The ​valid bit​ in the page table identifies whether the data is in the main memory or
in the secondary storage.

_______ 4. For a multi-cycle datapath, R-type instructions take ​more​ clock cycles to finish than
Load.

_______ 5. If Processor A has a higher clock frequency than Processor B, Processor A ​has
​higher performance​ than Processor B.

_______ 6. There are ​3​ bits allocated for the index field in a fully associative mapping for a 32
byte memory and 8 byte cache.

_______ 7. In a multi-cycle datapath, instruction fetching is done in the ​second​ cycle.

_______ 8. Memory-mapped I/O is usually ​faster​ than direct memory access interfacing.

_______ 9. ​Control​ hazards arise when operands of the next instruction depend on the previous
one.

_______ 10. Data Hazards ​mostly​ occur in Single Cycle Datapaths.


Part II. Essay.

1. Give ways to resolve data hazards.


________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

2. Explain the differences between using interrupts and polling in memory-mapped I/O.
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

3. What are the advantages and disadvantages of pipelined processors over single-cycle and
multi-cycle processors?
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

4. Describe what happens during a TLB miss and page fault.


________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

5. Explain why miss rate, for whichever cache associativity, will never reach 0%.
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
Part III. Problem Solving

Problem 1. A single-cycle datapath was divided into 5 stages using intermediate registers. The
resulting worst-case delays for each stage are indicated in the table below. Given an instruction
mix of ​5 branch, 20 load, 35 store and 40 arithmetic instructions​​, determine:

IF Stage 10ns

ID Stage 5ns

EXE Stage 20ns

MEM Stage 15ns

WB Stage 10ns

a) The minimum clock frequency of the original single-cycle datapath before it was divided
into stages.

b) The total execution time of multi-cycle processor.

c) The total execution time of the 5-stage pipelined processor. Assume that no hazards will
occur during the execution.
d) Compare the performance of the multi-cycle datapath against the single-cycle datapath.

e) Compare the performance of the 5-stage pipelined datapath against the single-cycle
datapath.

Problem 2. ​Compute the average memory access time.


Given: TLB access = 1cc, Cache access = 1cc, MM access = 10cc, SM access =100cc
a) 100% TLB Hit, 100% cache hit, 0% page fault

b) 100% TLB Hit, 80% cache hit, 0% page fault

c) 90% TLB Hit, 70% cache hit, 0% page fault


d) 80% TLB Hit, 60% cache hit, 10% page fault

Problem 3. Consider a 16-way set-associative cache. Data words are 64 bits long and are
half-word addressable. The cache holds 2 MB data. Each block holds 16 data words. Physical
addresses are 64 bits long. Determine the number of bits for tag, index, and offset.
Part IV. Cache

Given an 8-word byte addressable cache in different associativity organizations, describe what
happens if the following memory reads were performed:

Note that the memory address is divided and arranged as follows:

Tag Index 00

Address Hit/Miss?
Accessed
Direct-mapped 2-way 4-way Full associativity
(LRU) (LRU) (FIFO) (FIFO)

0x10

0xA0

0x30

0xA8

0x30

0xA0

0xB8

0x10

0xA0

0xA8
Answers:
Part I. Modified True or False:
1. True.
2. Volatile. ​Primary/Main memories are volatile (lose data when off) while secondary
memories (HDDs, SSDs, SD cards, etc.) are non-volatile.
3. True.
4. Less. ​R-type instructions take 4 cycles to finish (Inst Fetch -> Operand Fetch -> Execute
-> Write Back) while Load instructions take 5 cycles to finish (Inst Fetch -> Operand
Fetch -> Addr Generate -> Memory Access -> Write Back).
5. Does not necessarily have higher performance ​or ​has a lower clock period ​(or any
similar answer)​. ​If Processor A and B are both single-cycle processors, the statement can
indeed be true but since it was not mentioned, it cannot be assumed as such. If Processor
A is a single-cycle processor that has a clock frequency of 1 GHz while Processor B is a
multi-cycle processor that has a clock frequency of 4 GHz, their performance will still
depend on their individual CPIs and the instructions that they will execute. (See Part III.
Problem 1.d.)
6. 0. ​In fully associative cache, the index bits disappear leaving only the tag bits. Think of
the parking lot analogy. If we had a parking lot and 100 cars and only 10 parking spaces.

In a directly-mapped cache, you assign 1 parking space for every 10 cars


(Example: Cars with plate numbers ending in 0 will park in space 1, etc.)

In an N-way associative cache, you assign 2 parking spaces for every 20 cars
(Example: Assign 2 parking spaces for cars whose plate numbers end in either 0 or 1 etc.)

In a Fully associative cache, It is a first come first serve basis.


If there's a slot available then anyone can take that slot.
(Example: Typical parking)
7. True.
8. Slower. ​In​ ​memory-mapped I/O interfacing, the CPU directly handles I/O tasks. This
makes it slower because the CPU has to wait for slower I/O interfaces. In direct memory
access interfacing, simple processors/state machines are built to handle I/O tasks and run
in parallel with the main processor. The main processor will not have to waste time
handling I/O tasks in this case.
9. Data. ​Control hazards arise from the need to make a decision based on the results of a
previous instruction (i.e. branch instructions).
10. Never. ​Data hazards, and hazards in general, only occur in pipeline processors.
Part II. Essay.
1. Software approach: arrange the instructions so that data hazards will not occur or insert
NOPs between instructions that result in data hazards. Hardware approach: stall the
pipeline until the hazard is resolved.
2. Interrupts vs polling: We can explain the concept of polling and interrupt in the case of
“readiness”. Suppose you are an operating system and you are running multiple
tasks/processes. In polling you are continuously asking a process if it is done (ready). In
using interrupts, you only need to wait for the process to send a signal telling you that it
is done (ready). From this example it is clear that polling would require the OS to stop
everything else that it is doing just to check on a certain process. Interrupts however
allow the OS to run normally until such time that it sends a signal alerting the OS that it
is ready.

In a real life example, the mouse pointer runs on interrupts. Imagine if it is polled
continuously every single time the user tries to move it. It is inefficient. However it is
easier to implement in software. A good example of polling is a simple while loop
running forever until a certain condition is met.
3. Pipeline vs single-cycle and multi-cycle:

The simplest one is the single-cycle processor. In terms of cycles per instruction (CPI), it
only needs one cycle to finish one instruction. You can already perform basic
computations using a single-cycle processor however, it just because it has a CPI of 1
does not guarantee that it is the fastest. Remember non-idealities. Every path that the data
has to go through in the circuit has a certain delay. The path with the largest delay is
called the critical path. In any processor you cannot have a clock period shorter than the
delay of the critical path else you’ll have erroneous output. This means that single-cycle
processors have a CPI of 1 but also a slower clock when compared to the other types of
processors.

To enable the use of faster clocks, multi-cycle processors can be implemented. This is
done by dividing the processor into small chunks using flip-flops. Instead of just a single
large path with large delays, the path is broken down into smaller chunks. This means
that the critical path is smaller than the critical path of a single-cycle processor. Again,
having a very high clock speed does not guarantee higher processing speed compared to
other processors. (High processing speed means there is a balance between CPI and clock
frequency) In a multi-cycle processor, the clock frequency is high but the CPI is no
longer just 1 (In the lectures, it is typical to have different CPI per instruction) and it will
vary depending on the conventions followed.
Lastly, we have the pipeline processor. It is much like the multi-cycle processor in a way
that it also separates the processor into smaller chunks. The difference being: a
multi-cycle processor waits until one instruction is finished before it starts the next
instruction. In a pipeline processor however, the processor does not wait for the current
instruction to finish. After the first clock cycle finishes, the next instruction immediately
starts. This allows more instructions to finish in a shorter amount of time but also allows
hazards to arise. Pipelining does not guarantee higher performance than other processors.
It is ultimately dependent on the CPI, clock frequency and the instructions that are to be
executed.
4. CPU accesses the TLB for address translation, TLB misses. The TLB gets the page table
from the main memory, but the data is not there. (valid bit = 0, data in secondary
memory) Page fault is sent to the CPU. Data is transferred to the main memory from the
secondary memory. Main memory will then update the page table. (valid bit = 1) The
updated page table is then transferred to the TLB. The TLB will then get the data from
the cache using the translated address. The cache will miss, since the data was not there
before. Data is transferred to the cache from the main memory. Data is then transferred
from the CPU from the cache.
5. The first access to a block is never in the cache. This is what is known as a
compulsory/cold start/first reference miss. Even if the cache is infinite, it will always
miss at the beginning, ensuring that the miss rate will always be greater than zero.

Part III. Problem Solving


Problem 1.
a) The minimum clock frequency of the original single-cycle datapath before it was divided
into stages.
Adding the worst-case delays, we get the minimum clock period:

10 ns + 5 ns + 20 ns + 15 ns + 10 ns = 50 ns

To get the clock frequency, simply get the inverse of the clock period:

f​clk​ = T​clk​-1​ = 1 / 50 ns = ​20 MHz

b) The total execution time of multi-cycle processor.


The clock period will be the worst-case of all the stages (EXE stage = 20 ns):

[3(5)​for branch​ + 5(20)​for load​ + 4(35)​for store​ + 4(40)​for arithmetic​] 20 ns


= [15 + 100 + 140 + 160] 20 ns
= 415(20 ns)
=​8300 ns

c) The total execution time of the 5-stage pipelined processor. Assume that no hazards will
occur during the execution.
The clock period is similar to the multi-cycle processor: 20 ns

Pipeline Execution Time Formula:

[# instructions + # pipeline stages - 1] T​clk


= [ (5 + 20 + 35 + 40) + 5 - 1 ] 20 ns
= [100 + 5 - 1] 20 ns
= 104(20 ns)
= ​2080 ns

d) Compare the performance of the multi-cycle datapath against the single-cycle datapath.
The single-execution time, using the values we have computed before, is:

Execution Time​single-cycle
= # instructions * T​clk
= 100 * 50 ns
= 5000 ns

Comparing the execution times of the processors:

Speedup
= ET​1​ / ET​2
= 8300 ns / 5000 ns
= 1.66

> ​The single-cycle datapath is 1.66 times faster than the multi-cycle datapath.

Note:
Despite the multi-cycle datapath’s faster clock period (20 ns), the single-cycle datapath still has
the better performance (50 ns clock period). This shows how performance is also dependent on
the instructions and the CPI, not just the clock frequency/period.
e) Compare the performance of the 5-stage pipelined datapath against the single-cycle
datapath.
From the previous items:

ET​single-cycle​ = 5000 ns

ET​pipeline​ = 2080 ns

Comparing the two:


Speedup
= ET​1​ / ET​2
= 5000 ns / 2080 ns
= 2.40

> ​The pipelined datapath is 2.40 times faster than the single-cycle datapath.

Problem 2. ​Compute the average memory access time.


Given: TLB access = 1cc, Cache access = 1cc, MM access = 10cc, SM access =100cc
a) 100% TLB Hit, 100% cache hit, 0% page fault
TLB access Cache access
1*1cc + 1*1cc = ​2cc

b) 100% TLB Hit, 80% cache hit, 0% page fault


TLB access Cache access MM access
1*1cc + 1*1cc + 0.2*10cc = ​4cc

c) 90% TLB Hit, 70% cache hit, 0% page fault


TLB access MM access Cache access MM access
1*1cc + 0.1*10cc + 1*1cc + 0.3*10cc = ​6cc

d) 80% TLB Hit, 60% cache hit, 10% page fault


TLB access MM access SM access Cache access MM access
1*1cc + 0.2*10cc + 0.1*0.2*100cc + 1*1cc + 0.4*10cc = ​10cc

Problem 3. Consider a 16-way set-associative cache. Data words are 64 bits long and are
half-word addressable. The cache holds 2 MB data. Each block holds 16 data words. Physical
addresses are 64 bits long. Determine the number of bits for tag, index, and offset.

There are 16 data words per block, the bits for the offset should then be 4. (16 = 2​4​) Because
the data is half-word addressable, an additional bit of offset is needed.
> The ​offset​​ should be ​5 bits​​ long.

Given that the cache is 2MB (2​21​ bytes), we can determine the number of blocks in the cache:
2​21​ bytes * (1 block / 16 words) * (1 word / 64 bits) * (8 bits / 1 byte)
= 2​14​ blocks

Since there are 16 blocks per set:


2​14​ blocks * (1 set / 2​4​ blocks) = 2​10​ sets
> The ​index​​ should be ​10 bits​​ long.

The remaining bits form the tag:


64 - 5 - 10 = 49
> The​ tag​​ should be ​49 bits​​ long.

Answers: ​Offset = 5 bits, Index = 10 bits, Tag = 49 bits

Part IV. Cache


Given an 8-word byte addressable cache in different associativity organizations, describe what
happens if the following memory reads were performed:

Note that the memory address is divided and arranged as follows:

Tag Index 00

Address Hit/Miss?
Accessed
Direct-mapped 2-way 4-way Full associativity
(LRU) (LRU) (FIFO) (FIFO)

0x10 Miss (100) Miss (00)1 Miss (0)1 Miss (1)

0xA0 Miss (000) Miss (00)2 Miss (0)2 Miss (2)

0x30 Miss (100) Miss (00)1 Miss (0)3 Miss (3)


Replace 0x10 Replace 0x10

0xA8 Miss (010) Miss (10)1 Miss (0)4 Miss (4)

0x30 Hit (100) Hit (00)1 Hit (0)3 Hit (3)

0xA0 Hit (000) Hit (00)2 Hit (0)2 Hit (2)

0xB8 Miss (110) Miss (10)2 Miss (0)1 Miss (5)


Replace 0x10

0x10 Miss (100) Miss (00)1 Miss (0)2 Hit (1)


Replace 0x30 Replace 0x30 Replace 0xA0

0xA0 Hit (000) Hit (00)2 Miss 0(3) Hit (2)


Replace 0x30

0xA8 Hit (010) Hit (10)1 Hit 0(4) Hit (4)

You might also like