Professional Documents
Culture Documents
Ingeniera de Computadores
Curso 20062007
Autor:
Rafael Ubal Tena
Directores:
Julio Sahuquillo Borras
Pedro Lopez Rodrguez
Contents
Contents
Abstract
iii
Introduction
1.1 Out-of-Order Retirement in Monothreaded Processors . . . . . . .
1.2 Out-of-Order Retirement with Support for Multiple Threads . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
3
.
.
.
.
5
6
8
10
11
.
.
.
.
.
.
.
.
.
.
.
.
12
13
13
16
16
16
17
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18
18
18
19
20
21
22
23
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Exterimental Results
5.1 Evaluation of the VB Microarchitecture . . . . . . . . . . . . . .
5.1.1 Exploring the Potential of the VB Microarchitecture . . .
5.1.2 Exploring the Behavior in a Modern Microprocessor . . .
5.2 Impact on Performance of Supporting Precise Floating-Point Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Evaluation of the VB-MT Microarchitecture . . . . . . . . . . . .
5.3.1 Multithreading Granularity . . . . . . . . . . . . . . . . .
5.3.2 Multithreading Scalability: Performance vs. Complexity .
5.3.3 Fetch Policies . . . . . . . . . . . . . . . . . . . . . . . .
5.3.4 Resources occupancy . . . . . . . . . . . . . . . . . . . .
32
33
34
37
38
40
Related Work
42
Conclusions
7.1 Contributions and Future Work . . . . . . . . . . . . . . . . . . .
7.2 Publications Related with this Work . . . . . . . . . . . . . . . .
44
45
45
ii
25
25
26
28
Abstract
Current superscalar processors commit instructions in program order by using
a reorder buffer (ROB). The ROB provides support for speculation, precise exceptions, and register reclamation. However, committing instructions in program
order may lead to significant performance degradation if a long latency operation
blocks the ROB head.
Several proposals have been published to deal with this problem. Most of them
retire instructions speculatively. However, as speculation may fail, checkpoints
are required in order to rollback the processor to a precise state, which requires
both extra hardware to manage checkpoints and the enlargement of other major
processor structures, which in turn might impact the processor cycle.
This work focuses on out-of-order commit in a nonspeculative way, thus avoiding checkpoints. To this end, we replace the ROB with a structure called Validation Buffer (VB). This structure keeps dispatched instructions until they are nonspeculative or mispeculated, which allows an early retirement. By doing so, the
performance bottleneck is largely alleviated. An aggressive register reclamation
mechanism targeted to this microarchitecture is also devised. As experimental results show, the VB structure is much more efficient than a typical ROB since, with
only 32 entries, it achieves a performance close to an in-order commit microprocessor using a 256-entry ROB.
The present work also makes an exhaustive analysis of out-of-order retirement
of instructions on multithreaded processors. Superscalar processors exploit instruction level parallelism by issuing multiple instructions per cycle. However,
issue width is usually wasted because of instruction dependencies. On the other
hand, multithreaded processors reduce this waste by providing support to the concurrent execution of instructions from multiple threads, thus exploiting both instruction and thread level parallelism. Additionally, out-of-order commit (OOC)
processors overlap the execution of the long latency instructions with potentially
lots of subsequent ones, so this microarchitectures also help to reduce the issue
waste.
We analyze the impact on performance of unifying both multithreading and
OOC techniques, by combining the three main paradigms of multithreading
iii
fine grain (FGMT), coarse grain (CGMT), and simultaneous (SMT) with the
Validation Buffer microarchitecture, which retires instructions out of order. From
the experimental results, we conclude that: (i) an OOC-SMT processor achieves
the same performance as a conventional SMT with half the amount of hardware
threads; (ii) an OOC-FGMT processor outperforms a conventional SMT processor, which requires a more complex issue logic; and (iii) the use of OOC allows
optimized fetch policies (DCRA) for SMT processors to improve their job, almost
completely removing the issue width waste.
iv
Chapter 1
Introduction
1.1 Out-of-Order Retirement in Monothreaded
Processors
Current high-performance microprocessors execute instructions out-of-order to
exploit instruction level parallelism (ILP). To support speculative execution, provide precise exceptions, and register reclamation, a reorder buffer (ROB) structure
is used [1]. After being decoded, instructions are inserted in program order in the
ROB, where they are kept while being executed and until retired in the commit
stage. The key to support speculation and precise exceptions is that instructions
leave the ROB also in program order, that is, when they are the oldest ones in
the pipeline. Consequently, if a branch is mispredicted or an instruction raises
an exception there is a guarantee that, when the offending instruction reaches the
commit stage, all the previous instructions have already been retired and none of
the subsequent ones have done it. Therefore, to recover from that situation, all the
processor has to do is to abort the latter ones.
This behavior is conservative. For instance, when the ROB head is blocked by
a long latency instruction (e.g., a load that misses in the L2), subsequent instructions cannot release their ROB entries. This happens even if these instructions are
independent from the long latency one and they have been completed. In such a
case, since the ROB has a finite size, as long as instruction decoding continues,
the ROB may become full, thus stalling the processor for a valuable number of cycles. Register reclamation is also handled in a conservative way because physical
registers are mapped for longer than their useful lifetime. In summary, both the
advantages and the shortcomings of the ROB come from the fact that instructions
are committed in program order.
A naive solution to address this problem is to enlarge the ROB size to accommodate more instructions in flight. However, as ROB-based microarchitectures
1
serialize the release of some critical resources at the commit stage (e.g., physical
registers or store queue entries), these resources should be also enlarged. This
resizing increases the cost in terms of area and power, and it might also impact the
processor cycle [2].
To overcome this drawback, some solutions that commit instructions out of
order have been published. These proposals can be classified into two approaches
depending on whether instructions are speculatively retired or not. Some proposals falling into the first approach, like [3], allow the retirement of the instruction
obstructing the ROB head by providing a speculative value. Others, like [4] or [5],
replace the normal ROB with alternative structures to speculatively retire instructions out of order. As speculation may fail, these proposals need to provide a
mechanism to recover the processor to the correct state. To this end, the architectural state of the machine is checkpointed. Again, this implies the enlargement
of some major microprocessor structures, for instance, the register file [5] or the
load/store queue [4], because completed instructions cannot free some critical resources until their associated checkpoint is released.
Regarding the nonspeculative approach, Bell and Lipasti [6] propose to scan
a few entries of the ROB, as many as the commit width, and those instructions
satisfying certain conditions are allowed to be retired. None of these conditions
imposes an instruction to be the oldest one in the pipeline to be retired. Hence,
instructions can be retired out of program order. However, in this scenario, the
ROB head may become fragmented after the commit stage, and thus the ROB must
be collapsed for the next cycle. Collapsing a large structure is costly in time and
could adversely impact the microprocessor cycle. As a consequence, this proposal
is unsuitable for large ROB sizes, which is the current trend. Moreover, the small
number of instructions scanned at the commit stage significantly constrains the
potential that this proposal could achieve.
In this work we propose the Validation Buffer (VB) microarchitecture, which
is also based on the nonspeculative approach. This microarchitecture uses a FIFOlike table structure analogous to the ROB. The aim of this structure is to provide
support for speculative execution, exceptions, and register reclamation. While in
the VB, instructions are speculatively executed. Once all the previous branches
and supported exceptions are resolved, the execution mode of the instructions
changes either to nonspeculative or mispeculated. At that point, instructions are
allowed to leave the VB. Therefore, instructions leave the VB in program order
but, unlike for the ROB, they do not remain in the VB until retirement. Instead,
they remain in the VB only until the execution mode of such instruction is resolved, either nonspeculative or mispeculated. Consequently, instructions leave
the VB at different stages of their execution: completed, issued, or just decoded
and not issued. For instance, instructions following a long latency memory reference instruction could leave the VB as soon as its memory address is success2
fully calculated and no page fault has been risen. This work discusses how the
VB microarchitecture works, focusing on how it deals with register reclamation,
speculative execution, and exceptions.
The first main contribution (Chapter 2) of this work is the proposal of an aggressive out-of-order retirement microarchitecture without checkpointing. This
microarchitecture decouples instruction tracking for execution purposes and for
resource reclamation purposes. The proposal outperforms the existing proposal
that does not perform checkpoints, since it achieves more than twice its performance using smaller VB/ROB sizes. On the other hand, register reclamation cannot be handled as done in current microprocessors [7, 8, 9] because no ROB is
used. Therefore, we devise an aggressive register reclamation method targeted
to this architecture. Experimental results show that the VB microarchitecture increases the ILP while requiring less complexity in some major critical resources
like the register file and the load/store queue.
not force an instruction to be the oldest one in the pipeline in order to be retired.
In this way, long latency operations (e.g., a L2 miss) do not block the ROB when
they reach the ROB head; instead, long memory latencies are overlapped with
the retirement of subsequent instructions which do not depend on the memory
operation. Thus, these architectures mainly attack the vertical waste, although
horizontal waste is indirectly also improved. In addition, we will show that OOC
processors can make better use of resources and are more performance-cost effective than IOC processors. Quite recently, out-of-order retirement has been investigated on superscalar processors and chip multiprocessors (CMPs), but, to the best
of our knowledge, no research has focused on multithreaded processors.
As second main contribution of this work (Chapter 3), we analyze the impact of retiring instructions in an out-of-order fashion in the three main models of
multithreading: FGMT, CFMT, and SMT. To this end, we selected our own proposal the Validation Buffer Microarchitecture as the OOC base architecture,
and extended it to support the execution of multiple threads. Experimental results
provide three main conclusions:
First, a VB-based SMT processor requires in most cases half the amount
of hardware threads than an ROB-based SMT processor to achieve similar
performance. In other words, performance can be maintained in VB-based
SMT when reducing the number of hardware threads, thus saving all hardware resources to track their status.
Second, a VB-based FGMT processor outperforms an ROB-based SMT
processor. In this case, performance can be sustained while simplifying the
issue logic, which can be translated in shorter issue delays or lower power
consumption of instruction schedulers.
Third, existing fetch policies for SMT processors provide complementary
advantages to the out-of-order retirement benefits. A high-performance
SMT design could implement both techniques if area, power consumption
and hardware constraints allow it.
The rest of this work is structured as follows. Chapter 2 gives a detailed view
of the Validation Buffer microarchitecture. Chapter 3 deals with out-of-order retirement in multithreaded processors, and explains a set of existing instruction
fetch policies that will be used for evaluation purposes. Chapter 4.1 describes the
simulation framework used to model all proposed techniques. Chapter 5 performs
an exhaustive evaluation of both VB and VB-MT architectures, and Chapters 6
and 7 provide citations to related works and some concluding remarks, respectively.
Chapter 2
The Validation Buffer
Microarchitecture
The commit stage is typically the latest one of the microarchitecture pipeline. At
this stage, a completed instruction updates the architectural machine state, frees
the used resources and exits the ROB. The mechanism proposed in this work allows instructions to be retired early, as soon as it is known that they are nonspeculative. Notice that these instructions may not be completed. Once they are completed, they will update the machine state and free the used resources. Therefore,
instructions will exit the pipeline in an out-of-order fashion.
The necessary conditions to allow an instruction to be committed out-or-order
are [6]: i) the instruction is completed; ii) WAR hazards are solved (i.e., a write
to a particular register cannot be permitted to commit before all prior reads of
that architected register have completed); iii) previous branches are successfully
predicted; iv) none of the previous instructions is going to raise an exception,
and v) the instruction is not involved in memory replay traps. The first condition
is straightforwardly met by any proposal at the writeback stage. The last three
conditions are handled by the Validation Buffer (VB) structure, which replaces
the ROB and contains the instructions whose conditions are not known yet. The
second condition is fulfilled by the devised register reclamation method (see Section 2.1).
The VB deals with the speculation related conditions (iii, iv and v) by decomposing code into fragments or epochs. The epoch boundaries are defined by
instructions that may initiate an speculative execution, referred to as epoch initiators (e.g., branches or potentially exception raiser instructions). Only those
instructions whose previous epoch initiators have completed and confirmed their
prediction are allowed to modify the machine state. We refer to these instructions
as validated instructions.
Instructions reserve an entry in the VB when they are dispatched, that is, they
5
enter in program order in the VB. Epoch initiator instructions are marked as such
in the VB. When an epoch initiator detects a mispeculation, all the following instructions are cancelled. When an instruction reaches the VB head, if it is an
epoch initiator and it has not completed execution yet, it waits. When it completes, it leaves the VB and updates machine state, if any. Non epoch-initiator
instructions that reach the VB head can leave it regardless of their execution state.
That is, they can be either dispatched, issued or completed. However, only those
not cancelled instructions (i.e., validated) will update the machine state. On the
other hand, cancelled instructions are drained to free the resources they occupy
(see Section 2.2).
Notice that when an instruction leaves the VB, if it is already completed, it is
not consuming execution resources in the pipeline. Thus, it is analogous to a normal retirement when using the ROB. Otherwise, unlike the ROB, the instruction
is retired from the VB but it remains in the pipeline until it is completed.
The proposed microarchitecture can support a wide range of epochs initiators.
At least, epoch initiators according to the three speculative related conditions are
supported. Therefore, branches and memory reference instructions (i.e., the address calculation part) act as epoch initiators. In other words, branch speculation,
memory replay traps (see Section 2.4) and exceptions related with address calculation (e.g., page faults, invalid addresses) are supported by design.
It is possible to include more instructions in the set of epoch initiators. For
instance, in order to support precise floating-point exceptions, floating-point instructions should be included in this set. As instructions are able to leave the VB
only when their epoch initiators validate their epoch, a high percentage of epoch
initiators might reduce the performance benefits of the VB. We can use user definable flags to enable or disable support for precise exceptions. If the corresponding
flag is enabled, the instruction that may generate a given type of exception will
force a new epoch when it is decoded. In fact, a program could dynamically enable or disable these flags during its execution. For instance, it can be enabled
when the compiler suspects that an arithmetic exception may be raised.
Actions
RST [p].pending readers + +
1,
RATret contains a delayed copy of a validated RATf ront . That is, it matches
the RATf ront table at the time the exiting (as valid) instruction was renamed. So,
a simple method to implement the recovery mechanism (restoring the mapping to
a precise state) is to wait until the offending instruction reaches the VB head, and
then copying RATret into RATf ront . Alternative implementations can be found
in [4].
Register Status Table Recovery. The recovery mechanism must also undo
the modifications performed by the cancelled instructions in any of the three fields
of the RST .
Concerning the valid remapping field, we describe two possible techniques
to restore its values. The first technique squashes from the VB those entries corresponding to instructions younger than the offending instruction when this one
reaches the VB head. At that point, the RATret contains the physical registers
identifiers that we use to restore the correct mapping. The remaining physical
registers must be freed. To this end, all valid remapping entries are initially set
to 1 (necessary condition to be freed). Then, the RATret is scanned looking for
physical registers whose valid remapping entry must be reset.
The second technique relies on the following observation. Only the physical
registers that were allocated (i.e., mapped to a logical register) by instructions
younger than the offending one must be freed. Therefore, instead of squashing
the VB contents when the offending instruction reaches the VB head, as instructions in the VB are cancelled, they are drained. These instructions must set to
1 the valid remapping entry of their current mapping. Notice that in this case,
the valid remapping flag is used to free the registers allocated by the current
mapping, instead of the previous mapping like in normal operation. While the
cancelled instructions are being drained, new instructions can enter the renaming
stage, provided that the RATf ront has been already recovered. Therefore, the VB
draining can be overlapped with subsequent new processor operations.
Regarding to the pending readers field, it cannot be just reset as there can
already be valid pending readers in the issue queue. Thus, each pending readers
entry must be decremented as many as the number of cancelled pending readers
for the corresponding physical register. To this end, the issue logic must allow to
detect those instructions younger than the offending instruction, that is, the cancelled pending readers. This can be implemented by using a bitmap mask in the issue queue to identify which instructions are younger than a given branch [16]. The
cancelled instructions must be drained from the issue queue to correctly handle
(i.e. decrement) their pending readers entries. Notice that this logic can be also
used to handle the completed field, by enabling a cancelled instruction to set the
entry of its destination physical register. Alternatively, it is also possible to simply let the cancelled instructions execute to correctly handle the pending readers
and completed fields.
9
11
Chapter 3
Multithreaded Validation Buffer
(VB-MT) Microarchitecture
Superscalar processors effectively exploit instruction level parallelism of a single
thread. To this end, multiple instructions can be issued in an out-of-order fashion
in the same cycle. Nevertheless, issue ports are usually wasted because of instruction dependencies (i.e., the available parallelism), thus, adversely impacting the
performance. Two kinds of waste are distinguishable [10]: vertical waste, when
no instruction is issued, and horizontal waste, when some instruction is issued
without completely filling the issue width.
Resource utilization can be improved by providing support to the execution
of multiple threads, that is, by exploiting both instruction and thread level parallelism. There are three main multithreading models implemented in current
processors: fine grain (FGMT), coarse grain (CGMT), and simultaneous multitreading (SMT). All of them reduce vertical waste but only SMT reduces horizontal waste [10] by issuing instructions from multiple threads in the same cycle,
achieving best performance gains. Nevertheless, this is done at expenses of adding
complexity to the issue logic, which is a critical point in current microprocessors.
Multithreaded architectures represent an important segment in the industry. For
instance, the Alpha 21464, the Intel Pentium 4 [9], the IBM Power 5 [11], the
Sun Niagara [12], and the Intel Montecito [13] are commercial microprocessors
included in this group.
On the other hand, out-of-order commit prevents long latency operations (e.g.,
a L2 cache miss) of blocking the ROB when they reach its head; instead, long
memory latencies are overlapped with the retirement of subsequent instructions
which do not depend on the memory operation. Thus, these architectures mainly
attack the vertical waste, although horizontal waste is indirectly also improved.
In this chapter, we deeply analyze the impact of retiring instructions in an outof-order fashion in the three main models of multithreading: FGMT, CGMT, and
12
Figure 3.1: VB-MT architecture diagram with all storage resources shared among
threads
VB Baseline
IPC
1.9
ROB Baseline
1.85
1.8
1.75
eg
s.
Sh
ar
ed
Sh
ar
ed
ROB
Ph
.R
IQ
Q
IF
Sh
ar
ed
Sh
ar
ed
ac
he
1.7
VB
portions, and ROB portions can only be shifted when the associated head and tail
pointers are properly aligned. In the latter case, instructions from different threads
are intermingled across the ROB, so non completed instructions from a thread may
prevent completed instructions from another from exiting the ROB. Moreover, the
recovery process should cancel instructions selectively, forcing interleaved gaps
to remain in the ROB until they are retired. A deeper study of the effects of the
ROB/VB sharing strategies is planned as for future work, so we assume in this
paper private ROBs/VBs for all experiments.
of the load instruction, and the prediction is given by the most significant bit of
the saturating counter. Whenever a load misses the data cache, the corresponding
counter is reset, while it is incremented when the associated load hits the cache.
17
Chapter 4
The Simulation Framework
Multi2Sim
This section describes Multi2Sim [23], the simulation framework that has been
used to model the architectural designs proposed in this work. Multi2Sim integrates a model of processor cores, memory hierarchy and interconnection network in a tool that enables their evaluation. The simulator has been extended to
model the VB microarchitecture, both for monothreaded and multithreaded environments, including all instruction fetch policies cited in Chapter 3.
18
these actions, but an application-only tool should manage program loading during
its initialization.
Executable File Loading. The executable files output by gcc follow the ELF
(Executable and Linkable Format) specification. An ELF file is made up of a
header and a set of sections. Some Linux distributions include the library libbfd,
which provides types and functions to list the sections of an ELF file and track
their main attributes (starting address, size, flags and content). When the flags of
an ELF section indicate that it is loadable, its contents are copied into memory
after the corresponding starting address.
Program Stack. The next step of the program loading process is to initialize
the process stack. The aim of the program stack is to store function local variables
and parameters. During the program execution, the stack pointer ($sp register) is
managed by the own program code. However, when the program starts, it expects
some data in it, namely the program arguments and environment variables, which
must be placed by the program loader.
Register File. The last step is the register file initialization. This includes the
$sp register, which has been progressively updated during the stack initialization,
and the PC and NPC registers. The initial value of the PC register is specified in
the ELF header of the executable file as the program entry point. The NPC register
is not explicitly defined in the MIPS32 architecture, but it is used internally by the
simulator to handle the branch delay slot.
20
duces user code which handles parallelism by means of the described subset of
machine instructions and system calls. However, the fact of having thread management code mingled with application code must be taken into account, as it
constitutes a certain overhead which could affect final results. Further details on
this consideration can be found in [23].
timeslice
switchoneventtimeslice/
multiple
fetch priority-
equal/icount
decode kind
shared/
shared/
issue kind
shared/
timeslice/ timeslice
timeslice/
replicated
replicated
timeslice
shared/
replicated
timeslice
retire kind
timeslice
timeslice
timeslice/
replicated
23
a multicore processor starts with the memory hierarchy. When caches are shared
among cores, some contention can exist when they are accessed simultaneously.
In contrast, when they are private per core, a coherence protocol (e.g. MOESI
[29]) is implemented to guarantee memory consistency. Multi2Sim implements
in its current version a split-transaction bus as interconnection network, extensible
to any other topology of on-chip networks.
The number of interconnects and their location vary depending on the sharing
strategy of data and instruction caches. Figure 4.2 shows three possible schemes
of sharing L1 and L2 caches (t = private per thread, c = private per core, s =
shared), and the resulting interconnects for a dual-core dual-thread processor.
24
Chapter 5
Exterimental Results
In this chapter, a very detailed evaluation of the Validation Buffer microarchitecture is presented, splitting it up into two main sections. In the first section, the
VB architecture is evaluated in monothreaded environments, comparing it with
a baseline processor and other existing out-of-order retirement proposals. In the
second section, an exhaustive study is performed over the VB-MT architecture,
using a different baseline processor and investigating the evolution of different
performance metrics on various multithreaded scenarios.
a) SpecInt Benchmarks
b) SpecFP Benchmarks
the IOC and Scan processors to match the IPC achieved by the VB proposal.
Figure 5.2 presents the IPC achieved by each benchmark for a 32-entry
ROB/VB. Results present minor differences across the integer benchmarks, thus,
hereafter, performance analysis will focus on floating-point workloads. Loads
and floating-point instructions are the main sources of IPC differences, as these
instructions could potentially block the ROB for long. To provide insight into
this fact, Table 5.2 shows the percentage of these instructions, the L1 miss rate,
and the percentage of time that the retirement of instructions from the ROB/VB
is blocked. Results demonstrate that the VB microarchitecture effectively reduces
the blocking time, so improving performance. This can be appreciated by observing that those applications showing high blocked time differences also show
high IPC differences (see Figure 5.2). Of course, those applications with a low
percentage of both floating-point instructions and cache miss rate would slightly
benefit or not at all from our proposal.
For example, one of the highest differences and speedups is obtained by the
swim workload, which has a high percentage (i.e., 43%) of floating-point instructions as well as a relatively high miss rate (9%). In this case, a 32-entry ROB is
blocked by 84% of time, while a 32-entry VB is blocked only half of the time. On
the other hand, applications having a small percentage of floating-point instructions in the executed interval and small miss rates (i.e., mesa, equake, and fma3d)
27
Figure 5.2: IPC for SpecInt and SpecFP benchmarks assuming a 32-entry
ROB/VB.
Instructions (%)
f.point
load
30
23
43
27
59
32
52
30
13
27
27
40
20
27
35
41
35
27
66
13
35
30
64
19
22
24
L1 miss
rate (%)
1
9
3
5
1
6
34
11
5
10
1
0
1
29
30
31
Regarding the VB occupancy, differences are really high, as the VB occupancy is, on average, lower than one third the occupancy of the ROB. Notice that
the highest IPC benefits appear in those applications whose VB requirements are
smaller than the instruction queue ones (e.g., swim or mgrid, see Figures 5.4 and
5.5). Therefore, in these cases, the VB microarchitecture effectively alleviates the
retirement of instructions from the pipeline, allowing more instructions to be decoded and increasing ILP. On the contrary, when the VB requirements are larger
than the ones of the instruction queue like happens when using a ROB (e.g., mesa
or equake) the benefits are smaller since the ROB is not acting as the main performance bottleneck. Results also show the effectiveness of the proposed register
reclamation mechanism (see Figure 5.6). Finally, the LSQ occupancy is lower
in the VB microarchitecture (see Figure 5.7). This is because in ROB-based machines a LSQ entry cannot be released until all previous instructions have been
committed. In contrast, in the VB microarchitecture a LSQ entry only needs to
wait until all the previous instructions have been validated, which is a weaker
condition.
As the proposed architecture implements both out-of-order retirement and an
aggressive register reclamation method, one might think that performance benefits may come from both sides. To isolate which part comes from the VB and
which one from the register reclamation method, we ran simulations assuming an
unbounded register file. Figure 5.8 shows the results. As observed, the register
mechanism itself slightly affects the overall performance, thus one can conclude
that almost all the benefits come from the fact that instructions are out-of-order retired. As opposite, IOC and Scan, improve their performance with an unbounded
amount of physical registers, but even in these cases, the performance of a 16entry VB is still better.
32
33
Configuration
8-wide fetch, 8-wide issue, 8-wide commit
32 entry private IQs, 24 entry private LSQs, 32 entry private
ROBs
8 Int Add (2/1), 2 Int Mult (1/1), 2 Int Div (20/19), 4 Ld/St (2/1),
8 FP Add (4/2), 2 FP Mult (8/1), 2 FP Div (40/20)
128 entry private files
32KB, 2-way, 64 byte line, per-thread private, 1 cycle hit time
32KB, 2-way, 64 byte line, per-thread private, 2 cycles hit time
1MB, 8-way, 64 byte line, shared among threads, 10 cycles hit
time
1024 entry, 2-way
McFarling, private per thread, 4K entry gShare, 4K entry bimodal
200 cycles
16K, 4-way, shared among threads
Mix Name
Mix 0
Mix 1
Mix 2
Mix 3
Mix 4
Mix 5
Mix 6
Mix 7
Benchmarks
wupwise, eon
apsi, eon, fma3d, gcc
art, gzip
art, gzip, wupwise, twolf
applu, ammp
applu, ammp, art, mcf
gcc, gzip
wupwise, mgrid
35
IPC
3
2
1
ROB-CGMT
ROB-FGMT
ROB-SMT
VB-CGMT
ea
n
7
ix
.M
H
Benchmarks Mix
6
m
ix
ix
m
ix
3
m
ix
m
ix
m
1
ix
m
ix
VB-FGMT
VB-SMT
Figure 5.10: Performance for different multithread designs in the ROB/VB architectures for different benchmark mixes.
The results corresponding to this experiment are shown in Figure 5.10. The
last group of bars represents the average values for each design. Comparing
the behaviour of the VB-MT architecture in its different variants with the ROBSMT processor, we can observe that VB-CGMT is about 5% slower than ROBSMT, while VB-FGMT and VB-SMT outperform ROB-SMT by about 16.4% and
19.7%, respectively.
Mixes 2 and 6 show a flat behaviour both when substituting the ROB by the VB
and when improving the multithreading paradigm. This fact corroborates results
shown in Section 5.1.1, where it was shown that specific benchmarks do not obtain
benefits neither from the VB nor from enlarging the ROB. This situation is caused
by a lack of instruction level parallelism aggravated by a high L1 miss rate as well
as by a high branch misprediction rate. Thread level parallelism is also affected
by this fact, preventing SMT to outperform CGMT or FGMT.
An interesting observation is the average performance improvement of VBFGMT, which is a simple design of multithreading, over ROB-SMT, which introduces more complex hardware in the issue stage to schedule instructions from
different threads in the same cycle. The reason is that the benefits obtained by
filling empty issue slots with instructions from various threads in ROB-SMT is
compensated in VB-FGMT with the extra performance gained from the efficient
management of the VB structure, which prevents the pipeline from stalling so
often.
36
Benchmark gcc
Benchmark mgrid
2.5
Throughput (IPC)
Throughput (IPC)
2
1.5
1
0.5
0
1
3
4
5
6
Number of Threads
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
1
3
4
5
6
Number of Threads
Benchmark art
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
Throughput (IPC)
Throughput (IPC)
Benchmark wupwise
3
4
5
6
Number of Threads
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
1
3
4
5
6
Number of Threads
Figure 5.11: Scalability of different multithread designs for ROB/VB architectures with a single replicated benchmark.
marks.
The other three graphs correspond to the floating-point benchmarks mgrid,
wupwise and art, and show extremely contrasting results. Looking at the top right
graph (mgrid), one can observe important effects when n is increased. On one
hand, CGMT provides neither gain nor loss of performance up to 3 threads. In
this case, the benefits of multithreading come from the fact of avoiding software
context switches, which are not necessary in a system with n logical processors
executing n software contexts. However, a further increase of n has negative
effects on the global IPC, shown more clearly in the case of VB-CGMT.
The FGMT and SMT curves belonging to the ROB architecture (in all graphs)
show well known effects of multithreading. While a fine grain design reaches
better performance with ascending values for n up to a maximum of 4, an SMT
design is capable of exploiting the thread level parallelism in a more scalable
manner.
The FGMT and SMT curves belonging to the VB architecture show a similar
evolution when n 4, where SMT immediately outperforms FGMT. Nevertheless, when n > 4, they differ from the ROB curves in the sense that the SMT
scalability is not noticed anymore. The reason is that VB reduces for floatingpoint benchmarks the probability of a processor pipeline to get stalled due to lack
of space in the ROB, which supposes a strong bottleneck alleviation, so we get an
early and sharp increase of instruction sources when n is increased. Since functional units utilization with 4 threads is already high in VB-SMT, an indiscriminate choice of instructions to enter the pipeline only results in a worse instruction
scheduling, and no performance improvement is achieved. In this case, sophisticated resource allocation policies such as DCRA would be necessary to maintain
SMT scalability.
Finally, it is important to compare the VB-FGMT and the ROB-SMT curves
(mgrid and wupwise). As results of Section 5.3.1 already showed, VB-FGMT
provides, on average, better IPC for the evaluated benchmarks mixes, which try
to be representative of current multithreaded processors [36] ranging from 2 to
4 threads. The simpler VB-FGMT implementation can still be used reaching
a higher performance up to approximately 4 threads. With a higher number of
threads, the ROB-SMT scalability is imposed, and a VB-SMT design is needed to
keep the advantages of the VB architecture.
IPC
e
Av
er
ag
ix
7
m
ix
6
m
ix
5
m
ix
4
m
ix
3
m
ix
2
m
ix
1
m
ix
0
Benchmarks mix
ROB-RR
ROB-ICOUNT
ROB-PDG
ROB-DCRA
VB-RR
VB-ICOUNT
VB-PDG
VB-DCRA
Figure 5.12: Evaluation of fetch policies for the ROB and VB architectures
implementing DCRA, which shares the instruction fetch queue, instruction queue
and load-store queue among hardware threads. The reason is that DCRA does not
only assign different and variable fetch slots to threads, but also obtains benefits by
dynamically assigning different number of entries of shared resources to threads.
On one hand, Figure 5.12 shows the pronounced advantage of a sophisticated
instruction fetch policy in SMT processors. As the average values suggest, any
fetch policy other than RR provides better benefits than the replacement of a ROB
by a VB. On the other hand, Figure 5.12 illustrates that fetch policies advantages
also apply to the VB architecture. If we compare the advanced fetch policies (the
three right bars of the Average group) against the naive VB-RR policy, we obtain
on average 28.4%, 32.9% and 40.2% benefits for VB-ICOUNT, VB-PDG and
VB-DCRA, respectively. Comparing these three policies versus the ROB-DCRA
policy (the best ROB-based policy), the performance speedup reaches 12%, 15.6%
and 21.9%, respectively.
Although we need some improved fetch policy in the VB microarchitecture in
order to enhance a ROB-DCRA architecture, there is no need to implement the
most effective one (VB-DCRA), which might require more complex hardware.
Instead, the instruction counters added by ICOUNT are sufficient to make the VBbased approach rise over ROB-DCRA. However, we can also see that the ability of
retiring instructions out of order, combined with optimized fetch policies, allows
the greatest performance improvement, as both techniques contribute with their
orthogonal potential.
39
Percentage of Cycles
1
0.8
0.6
0.4
0.2
A
C
-D
VB
-P
T
N
O
-R
VB
VB
BO
-IC
VB
A
R
G
PD
B-
O
R
B-
IC
B-
Fetch Policy
8/7 Issue Slots
6/5 Issue Slots
0 Issue Slots
Figure 5.13: Filled issue slots for different SMT architectures and fetch policies.
ROB-DCRA
VB-ICOUNT
Probability
Probability
0.1
0.2
0.3
0.4
0.5
0.6
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
ROB-DCRA
VB-ICOUNT
0.2
0.4
0.6
0.8
cost-effective solution, reaching higher performance than the most complex fetch
policy for ROB-SMT, but implementing a simple fetch policy in VB-SMT.
Additionally to the issue slots, we have investigated the occupancy of storage resources, focusing on the most efficient ROB design (ROB-DCRA) and the
design with the simplest efficient fetch policy on VB (VB-ICOUNT). Figure 5.14
shows the occupancy of the instruction queue (IQ) and the load-store queue (LSQ)
for these designs. The curves in the graphs are to be interpreted as the probability
(Y-axis) for a resource of having an occupation equal or greater than a specific
fraction of its entries (X-axis).
As one can observe, VB-ICOUNT causes a lower occupancy both in the IQ
and the LSQ. In the case of the IQ (Figure 5.14a), one can appreciate that only
the 50% of the IQ entries are being used in ROB-DCRA, meaning that the IQ
is over-dimensioned. However, the unused fraction of the IQ grows up to almost 70% in the case of VB-ICOUNT. Something similar happens with the LSQ
(Figure 5.14b), which could be implemented a 20% smaller, without pactically affecting performance. As a consequence, VB-ICOUNT does not only outperform
ROB-DCRA with a simpler fetch policy, but also permits a decrement of the main
storage resources size, maintaining performance gains.
41
Chapter 6
Related Work
Long latency operations constrain the output rate of the ROB, and thus, microprocessor performance. Recent microprocessor mechanisms have been proposed
dealing with this problem [3, 5, 4]. In essence, these proposals permit to retire
instructions in a speculative mode when a long latency operation blocks the ROB
head. These solutions introduce specific hardware to checkpoint the architectural
state at specific times and guarantee correct execution. When a misprediction occurs, the processor rolls back to the checkpoint, discarding all subsequent computations. Some of these proposals have been extended to be used in multiprocessor
systems [37, 38].
In [3], Kirman et al propose the checkpointed early load retirement mechanism
which has certain similarities with the previous one. To unclog the ROB when a
long-latency load instruction blocks the ROB head, a predicted value is provided
for those dependent instructions to allow them to continue. When the value of
the load is fetched from memory, it is compared against the predicted one. On a
misprediction, the processor must roll back to the checkpoint.
In [5], Cristal et al propose to replace the ROB structure with a mechanism
to perform checkpoints at specific instructions. This mechanism uses a CAM
structure for register mapping purposes, which is also in charge of the freeing
physical registers. Stores must wait in the commit stage to modify the machine
state until the closest previous checkpoint has committed. In addition, instructions
taking a long time to issue (e.g., those dependent from a load) are moved from the
instruction queue to a secondary buffer, thus freeing resources that can be used
by other instructions. These instructions must be re-inserted into the instruction
queue when the instruction they are dependent on has completed (e.g., the load
data has already been fetched). This problem has also been tackled by Akkary et
al in [4].
In [39] Martinez et al propose an in-order retirement mechanism which identifies irreversible instructions to early freing resources. Unlike the VB microar42
chitecture, this proposal retires instructions in-order. This proposal, as well as the
works discussed above, need checkpointing to roll back the processor to a correct
state.
In [6] a checkpoint free approach is presented. However, this proposal still
use a ROB, and scans the n oldest entries of the ROB to select instructions to be
retired. This fact constrains this proposal making it unsuitable for large ROB sizes.
In addition, resources are handled as a typical processor using a ROB, without any
focus on improving resource usage.
Finally, the performance degradation caused by ROB blocking could also be
alleviated by enlarging the major microprocessor structures or efficiently managing them [40, 41, 42].
43
Chapter 7
Conclusions
As a first contribution of this work, we have proposed the VB microarchitecture,
which aims at retiring instructions out of order while still providing support for
speculation and precise exceptions handling. Unlike most previous proposals, the
out-of-order commit mechanism proposed in this work does not require hardware
to perform checkpointing because out-of-order instruction retirement is correct by
design.
Performance has been compared against two ROB-based proposals, one retiring instructions in order and the other one out-of-order. Results are encouraging
since with only a 32-entry validation buffer and assuming the remaining major
processor structures unbounded, our proposal achieves performance similar to the
other evaluated architectures but with a 256-entry ROB. Moreover, when sizing
the major microprocessor structures close to the ones implemented in a modern
processor, an 8-entry VB microarchitecture outperforms the compared architectures with a 128-entry ROB.
Concerning major processor resource requirements, results show that, besides
achieving better performance, the resource usage does not increase in the VB microarchitecture. For instance, the register file and the load/store queue usages are
reduced. This feature makes our proposal an interesting alternative for poweraware implementations. It is also shown that the validation buffer has a lower
occupancy than the instruction queue. Therefore, in the VB microarchitecture the
hardware dealing with instruction retirement is not, in general, the main microprocessor structure constraining the performance anymore.
The second main contribution of this work is the combination of the out-oforder retirement VB microarchitecture with different models of multithreading.
This has lead to the observation that both techniques contribute orthogonally to increase processor performance. We also explored the behaviour of different thread
selection policies at the fetch stage (i.e., fetch policies) on the resulting multithreaded VB architecture.
44
threaded Processors, The 16th International Conference on Parallel Architectures and Compilation Techniques, Brasov (Romania), September 2007.
R. Ubal, J. Sahuquillo, S. Petit and P. Lopez, A Simulation Framework
to Evaluate Multicore-Multithreaded Processors, 19th International Symposium on Computer Architecture and High Performance Computing, Gramado (Brasil), October 2007.
R. Ubal, J. Sahuquillo, S. Petit, P. Lopez and J. Duato, The Impact of
Out-of-Order Commit in Coarse-Grain, Fine-Grain and Simultaneous Multithreaded Architectures, to appear in the 22nd IEEE International Parallel
and Distributed Processing Symposium, Miami (Florida, USA), April 2008.
An additional paper has been submitted, and is currently under review process:
S. Petit, J. Sahuquillo, P. Lopez, R. Ubal and J. Duato, A ComplexityEffective Out-of-Order Retirement Microarchitecture, IEEE Transactions
on Computers.
46
Bibliography
[1] J.E. Smith and A.R. Pleszkun. Implementation of precise interrupts in
pipelined processors. In Proceedings of the 12th Annual International Symposium on Computer Architecture, pages 3644, June 1985.
[2] S. Palacharla, N.P. Jouppi, and J.E. Smith. Complexity-effective superscalar
processor. In Proceedings of the 24th Annual International Symposium on
Computer Architecture, June 1997.
[3] N. Kirman, M. Kirman, M. Chaudhuri, and J. Martnez. Checkpointed early
load retirement. In Proceedings of the International Symposium on High
Performance Architecture, February 2005.
[4] H. Akkary, R. Rajwar, and S. T. Srinivasan. Checkpoint processing and recovery: Towards scalable large instruction window processors. In Proceedings of the 36th International Symposium on Microarchitecture, December
2003.
[5] A. Cristal, D. Ortega, J. Llosa, and M. Valero. Out-of-order commit processors. In Proceedings of the International Symposium on High Performance
Architecture, February 2004.
[6] G.B. Bell and M.H. Lipasti. Deconstructing Commit. In Proceedings of
the The International Symposium on Performance Analysis of Systems and
Software, March 2004.
[7] R. E. Kessler. The alpha 21264 microprocessor. IEEE Micro, 19(2):2436,
March 1999.
[8] J. M. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. Power4 system
microarchitecture, technical white paper. IBM Server Group, October 2001.
[9] G. Hinton, D. Sager, and M. Upton et al. The microarchitecture of the Pentium 4 processor. Intel Technology Journal. Q1, 2001.
47
[10] D.M. Tullsen and S.J. Eggers and H.M. Levy. Simultaneous Multithreading:
Maximizing On-Chip Parallelism. 22nd Annual International Symposium
on Computer Architecture, June 1995.
[11] R. Kalla and B. Sinharoy and J.M. Tendler. IBM Power5 Chip: a Dual-Core
Multithreaded Processor. IEEE Micro, March-April 2004.
[12] P. Kongetira and K. Aingaran and K. Olukotun. Niagara: a 32-way Multithreaded Sparc Processor. IEEE Micro, March-April 2005.
[13] C. McNairy and R. Bhatia. Montecito: a Dual-Core, Dual-Thread Itanium
Processor. IEEE Micro, March-April 2005.
[14] J.E. Smith and G. Sohi. The microarchitecture of superscalar processors.
Proc. of the IEE, 83(2), December 1995.
[15] M. Moudgill, K. Pingali, and S. Vassiliadis. Register renaming and dynamic
speculation: an alternative approach. In Proceedings of the 26th International Symposium on Microarchitecture, pages 202213, December 1993.
[16] K.C. Yeager. The MIPS R10000 Superscalar Microprocessor. IEEE Micro,
pages 2840, April 1996.
[17] J.P. Shen and M.H. Lipasti. Modern Processor Design. McGraw-Hill, 2005.
[18] S. E. Raasch and S. K. Reinhardt. The Impact of Resource Partitioning on
SMT Processors. 12th International Conference on Parallel Architectures
and Compilation Techniques, 2003.
[19] F. J. Cazorla and A. Ramirez and M. Valero and E. Fernandez. Dynamically Controlled Resource Allocation in SMT Processors. In Proceedings of
the 37th annual IEEE/ACM International Symposium on Microarchitecture,
pages 171182, 2004.
[20] J. Sharkey and D. Balkan and D. Ponomarev. Adaptive Reorder Buffers for
SMT Processors. In Proceedings of the 15th International Conference on
Parallel Architectures and Compilation Techniques, pages 244253, 2006.
[21] D.M. Tullsen and S.J. Eggers and J.S. Emer and H.M. Levy and J.L. Lo and
R.L. Stamm. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. 23rd Annual International
Symposium on Computer Architecture, May 1996.
48
[22] A. El-Moursy and D.H. Albonesi. Front-End Policies for Improved Issue
Efficiency in SMT Processors. Proceedings of the 9th International Conference on High Performance Computer Architecture, Feb 2003.
[23] www.gap.upv.es/raurte/tools/multi2sim.html.
R. Ubal Homepage Tools Multi2Sim.
[24] MIPS Technologies, Inc. MIPS32TM Architecture For Programmers, volume
I: Introduction to the MIPS32TM Architecture. 2001.
[25] MIPS Technologies, Inc. MIPS32TM Architecture For Programmers, volume
II: The MIPS32TM Instruction Set. 2001.
[26] D.C. Burger and T.M. Austin. The simplescalar tool set, version 2.0. Computer Architecture News, 25(3), 1997.
[27] M. R. Marty, B. Beckmann, L. Yen, A. R. Alameldeen, M. Xu, and
K. Moore. GEMS: Multifacets General Execution-driven Multiprocessor
Simulator. International Symposium on Computer Architecture, 2006.
R
Threads. Addison Wesley
[28] D. R. Butenhof. Programming with POSIX
Professional, 1997.
[29] P. Sweazey and A.J. Smith. A Class of Compatible Cache Consistency Protocols and Their Support by the IEEE Futurebus. 13th Intl Symp. Computer
Architecture, pages 414423, June 1986.
[30] Standard
performance
http://www.spec.org/cpu2000/.
evaluation
corporation.
[31] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages
and Operating Systems (ASPLOS-X), October 2002.
[32] Free
Software
Foundation,
[Online].
Available:http://www.gnu.org/software/gcc/onlinedocs/.
GCC online documentation, 2006.
[33] R. Ubal, J. Sahuquillo, S. Petit, and P. Lopez. Multi2Sim: A Simulation
Framework to Evaluate Multicore-Multithreaded Processors. 19th International Symposium on Computer Architecture and High Performance Computing, October 2007.
49
[34] M. Pericas and A. Cristal and R. Gonzalez and D.A. Jimenez and M. Valero.
A Decoupled KILO-Instruction Processor. 11th International Conference
on High Performance Computer Architecture, February 2006.
[35] G. Bell and M. Lipasti. Deconstructing commit. International Symposium
on Performance Analysis of Systems and Software, March 2004.
[36] S. Choi and D. Yeung. Learning-Based SMT Processor Resource Distribution via Hill-Climbing. 33rd International Symposium on Computer Architecture, June 2006.
[37] M. Kirman, N. Kirman, and J.F. Martnez. Cherry-mp: Correctly integrating
checkpointed early resource recycling in chip multiprocessors. In Proceedings of the International Symposium on Microarchitecture, November 2005.
[38] E. Vallejo, M. Galluzzi, A. Cristal, F. Vallejo, R. Beivide, P. Stenstrom, J. E.
Smith, and M. Valero. Implementing kilo-instruction multiprocessors. In
IEEE Conference on Pervasive Services, Invited lecture, July 2005.
[39] J.F. Martnez, J. Renau, MC. Huang, M. Prvulovic, and J. Torrellas. Cherry:
checkpointed early resource recycling in out-of-order processors. In Proceedings of the 35th International Symposium on Microarchitecture, November 2002.
[40] S. E. Raasch, N. L. Binkert, and S. K. Reinhardt. A scalable instruction
queue design using dependence chains. In Proceedings of the 29th Annual
International Symposium on Computer Architecture, May 2002.
[41] R. Balasubramonian, S. Dwarkadas, and D.H. Albonesi. Reducing the complexity of the register file in dynamic superscalar processors. In Proceedings
of the 34th Int. Symp. on Microarchitecture, December 2001.
[42] I. Park, C.L. Ooi, and T.N. Vijaykumar. Reducing design complexity of the
load/store queue. In Proceedings of the 36th International Symposium on
Microarchitecture, December 2003.
50