Design Issues SMT and CMP Architectures

DESIGN ISSUES:
SMT and CMP Architectures
Why are Design issues important?

They determine the performance measures
of each processor in a precise manner.
The issue slots usage limitations and its
issues also determine the performance
Why Multithreading Today?

ILP is exhausted, TLP is in.
Large performance gap bet. MEM and
PROC.
Too many transistors on chip
More existing MT applications Today.
Multiprocessors on a single chip.
Long network latency, too.
DESIGN CHALLENGES OF SMT

Impact of fine grained scheduling on single
thread performance?
A preferred thread approach sacrifices
throughput and single threaded performance
Unfortunately with a preferred thread, the
processor is likely to sacrifice some throughput
Reason for loss of throughput?

Pipeline is less likely to have a mix of
instructions from several threads resulting in
a greater probability that either empty slots
or a stall will occur
Design Challenges
Larger register file needed to hold multiple
contexts.
Not affecting clock cycle time, especially in
Instruction issue- more candidate instructions need
to be considered
Instruction completion- choosing which instructions
to commit may be challenging
Ensuring that cache and TLP conflicts

generated by SMT do not degrade performance
Observation
There are mainly two observations
Potential performance overhead due to
multithreading is small
Efficiency of current superscalar is low with the
room for significant improvement
A SMT processor works well if

Number of compute intensive threads does
not exceed the number of threads supported
in SMT.
Threads have highly different charecteristics
For eg; 1 thread doing mostly integer
operations and another doing mostly floating
point operations.
It does not work well if

Threads try to utilize the same functional
units
Assignment problems
Eg; a dual core processor system, each
processor having 2 threads simultaneously
2 computer intensive application processes
might end up on the same processor instead of
different processors
The problem here is the operating system

does not see the difference between the
SMT and real processors !!!
Transient Faults
Faults that persist for a short duration

Cause: cosmic rays (e.g., neutrons)
Effect: knock off electrons, discharge capacitor
Solution
no practical absorbent for cosmic rays
1 fault per 1000 computers per year (estimated fault
rate)
Future is worse
smaller feature size, reduce voltage, higher transistor
count, reduced noise margin
Processor Utilization vs. Latency
R = the run length to a long latency event

L = the amount of latency
Fault Detection via SMT

R1 (R2)
R1 (R2)
THREAD
THREAD
Input
Replication
Output
Comparison
Memory covered by ECC

RAID array covered by parity
Servernet covered by CRC
Simultaneous Multithreading (SMT)

thread1
thread2
Instruction
Scheduler
Functional
Units
Simultaneous & Redundantly

Threaded Processor (SRT)
SRT = SMT + Fault Detection

+ Less hardware compared to replicated
microprocessors
SMT needs ~5% more hardware over uniprocessor
SRT adds very little hardware overhead to existing SMT
+ Better performance than complete replication

better use of resources
+ Lower cost
avoids complete replication
SRT Design Challenges

Lockstepping doesnt work
SMT may issue same instruction from redundant
threads in different cycles
Must carefully fetch/schedule instructions

from redundant threads
branch misprediction
cache miss
Transient Fault Detection in CMPs
CRT borrows the detection scheme from the SMT-based Simultaneously and
Redundantly Threaded (SRT) processors and applies the scheme to CMPs.
replicated two communicating threads (leading & trailing threads)
compare the results of the two.
CRT executes the leading and trailing threads on different processors to

achieve load balancing and to reduce the probability of a fault corrupting both
threads
Transient Fault Detection in CMPs
detection is based on replication but to which extent?

replicates register values (in register file in each core)
but not memory values
CRTs leading thread commits stores only after checking, so that memory is guaranteed to
be correct.
CRT compares only stores and uncached loads, but not register values, of the two threads.
An incorrect value caused by a fault propagates through computations and is eventually

consumed by a store, checking only stores suffices for detection; other instructions commit
without checking.
CRT uses a store buffer (StB) in which the leading thread places its committed store values
and addresses. The store values and addresses of the trailing thread are compared against
the StB entries to determine whether a fault has occurred. (one checked store reaches to
the cache hierarchy)
Transient Fault Recovery for CMPs

Unlike CRT, CRTR must not allow any
trailing instruction to commit before it is
checked for faults, so that the register state
of the trailing thread may be used for
recovery.
However, the leading thread in CRTR may
commit register state before checking, as in
CRT.
This asymmetric commit strategy allows CRTR to

employ a long slack to absorb inter-processor
latencies.
As in CRT, CRTR commits stores only after checking.
In addition to communicating branch outcomes, load
addresses, load values, store addresses, and store
values like CRT, CRTR also communicates register
values.
Performance Evaluation
Forwarding: IP Forward
Authentication: MD5
Encryption: 3DES
SS
FGMT
CMP
SMT
Workloads have little ILP
Need to exploit packet-level parallelism
CMP and SMT do just that
Systems must support some form of

concurrent packet-level parallelism
SMT and CMP are nearly equivalent, with
SMT always coming out ahead
We can see that SS and FGMT have similar
performance, CMP and SMT have similar
performance
The latter two are scalable with increased
parallelism
Challenges with this approach
I-Cache:
Instruction bandwidth
I-Cache misses: Since instructions are being grabbed from many different
contexts, instruction locality is degraded and the I-cache miss rate rises.
Register file access time:

Register file access time increases due to the fact that the regfile had to
significantly increase in size to accommodate many separate contexts.
In fact, the HEP and Tera use SRAM to implement the regfile, which
means longer access times.
Single thread performance

Single thread performance significantly degraded since the context is
forced to switch to a new thread even if none are available.
Very high bandwidth network, which is fast and wide

Retries on load empty or store full
To maximize SMT performance

Issue slots
Functional units
Renaming registers

Design Issues SMT and CMP Architectures

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Design Issues SMT and CMP Architectures

Uploaded by

Copyright:

Available Formats

DESIGN ISSUES:

SMT and CMP Architectures

Why are Design issues important?

Why Multithreading Today?

DESIGN CHALLENGES OF SMT

Reason for loss of throughput?

Ensuring that cache and TLP conflicts

A SMT processor works well if

It does not work well if

The problem here is the operating system

Faults that persist for a short duration

Processor Utilization vs. Latency

R = the run length to a long latency event

Fault Detection via SMT

Memory covered by ECC

Simultaneous Multithreading (SMT)

Simultaneous & Redundantly

SRT = SMT + Fault Detection

+ Better performance than complete replication

SRT Design Challenges

Must carefully fetch/schedule instructions

Transient Fault Detection in CMPs

CRT executes the leading and trailing threads on different processors to

Transient Fault Detection in CMPs

detection is based on replication but to which extent?

An incorrect value caused by a fault propagates through computations and is eventually

Transient Fault Recovery for CMPs

This asymmetric commit strategy allows CRTR to

Systems must support some form of

Challenges with this approach

Register file access time:

Single thread performance

Very high bandwidth network, which is fast and wide

To maximize SMT performance

You might also like