You are on page 1of 11

Branch prediction: Look ahead Pre-fetching [A detailed look into branch prediction leading to

look-ahead prediction techniques with software and hardware compilation and overall analytic
contrast]
By: SALEEM, Muhammad Umair [25279] under the supervision of Prof. Dr. Andreas Siggelkow
Hochschule Ravensburg-Weingarten, Department of Electrical Engineering (Master of Engineering)
Computer Architecture 4872
1.0 - Abstract:
The discussion to follow in this compilation is based on the observation, reading, assessment and conclusions of branch
style predictions and methodologies. As a starter for comprehensive understanding of this writing, it is assumed that
the reader is well aware of the concept of branches, and how they can be useful in decreasing the overall instruction
cycle for a nominated Super-Scalar system. The basic idea of pre-fetching scheme is to keep track of the data access
patterns in a Prediction Table organized as an instruction cache.
In the current computer technological age, the concept of branch prediction utilization and clock instruction reduction
is nothing new. In this paper we will discuss the technique of pre-fetching branch prediction. How pre-fetching is
based on in terms of hardware utilization and realizations, its rmware counterpart and a little analytical theory to
complement their usage in terms of being handy and powerfully useful to reduce memory access latency and improve
performance.
1.1 - Introduction:
Instruction pre-fetching is an important technique for closing the gap between the speed of the microprocessor and
its memory system. As current microprocessors become ever faster, this gap continues to increase and becomes a
bottleneck, resulting in the loss of overall system performance. To close this gap, instruction prefetching speculatively
brings the instructions needed in ahead of time close to the microprocessor and, hence, reduces the transfer delay due
to the relatively slow memory system. If instruction prefetching can predict future instructions accurately and bring
them in advance, most of the delay due to the memory system can be eliminated. The branch predictors are built into
current microprocessors to reduce the stall time due to instruction fetching and, in general, can achieve prediction
accuracy as high as 95% for SPEC benchmarks. Prefetching based on branch prediction (BP-based prefetching) can
achieve higher performance than a cache, by speculatively running ahead of the execution unit at a rate close to one
basic block per cycle. With the aid of advanced branch predictors and a small autonomous fetching unit, this type of
prefetching can accurately select the most likely path and fetch the instructions on the path in advance. Therefore,
most of the pre-fetches are useful and can fetch instructions before they are needed by the execution unit.
The paper is designed on the descriptive pattern that gives an overview of the pre-fetch scheme on hardware and soft-
ware techniques, along with the comparison for dierent schemes for pre-fetching algorithms. The dierent techniques
described here are compiled from academic research projects, and shown here with proper mentions and credits for
their work. These results will eventually create an outline on the preference of design schemes and implementation for
the all-encompassing purpose of reduced memory access latency and super-scale pipeline structure orientations used
for pre-fetch logics.
2.0 - A guide to pre-fetch schema:
Microprocessor performance has increased at a dramatic rate over the past decade. This trend has been sustained by
continued architectural innovations and advances in microprocessor fabrication technology. In contrast, main memory
(dynamic RAM) performance has increased at a much more leisurely rate. This expanding gap between microprocessor
and DRAM performance has necessitated the use of increasingly aggressive techniques designed to reduce or hide the
large latency of memory accesses.
Chief among the latency reducing techniques is the use of cache memory hierarchies [1]. The static RAM (SRAM)
memories used in caches have managed to keep pace with processor memory request rates but continue to be too
expensive for a main store technology. Although the use of large cache hierarchies has proven to be eective in
reducing the average memory access penalty for programs that show a high degree of locality in their addressing
patterns, it is still not uncommon for data intensive programs to spend more than half their run times stalled on
memory requests [2]. The large, dense matrix operations that form the basis of many such applications typically
exhibit little locality and therefore can defeat caching strategies.
1
2.1 The On-Demand Fetch
This policy fetches data into the cache from main memory only after the processor has requested a word and found
it absent from the cache. The situation is illustrated in Figure (a) where computation, including memory references
satised within the cache hierarchy, are represented by the upper time line while main memory access time is repre-
sented by the lower time line. In this gure, the data blocks associated with memory references r1, r2, and r3 are not
found in the cache hierarchy and must therefore be fetched from main memory. Assuming the referenced data word
is needed immediately, the processor will be stalled while it waits for the corresponding cache block to be fetched.
Once the data returns from main memory it is cached and forwarded to the processor where computation may again
proceed.
Note that this fetch policy will always result in a cache miss for the rst access to a cache block since only previously
accessed data are stored in the cache. Such cache misses are known as cold start or compulsory misses. Also, if the
referenced data is part of a large array operation, it is likely that the data will be replaced after its use to make room
for new array elements being streamed into the cache. When the same data block is needed later, the processor must
again bring it in from main memory incurring the full main memory access latency. This is called a capacity miss.
Many of these cache misses can be avoided if we augment the demand fetch policy of the cache with the addition
of a data pre-fetch operation. Rather than waiting for a cache miss to perform a memory fetch, data prefetching
anticipates such misses and issues a fetch to the memory system in advance of the actual memory reference. This
pre-fetch proceeds in parallel with processor computation, allowing the memory system time to transfer the desired
data from main memory to the cache. Ideally, the pre-fetch will complete just in time for the processor to access the
needed data in the cache without stalling the processor.
2.2 The Explicit Fetch
At a minimum, this fetch species the address of a data word to be brought into cache space. When the fetch
instruction is executed, this address is simply passed on to the memory system without forcing the processor to wait
for a response. The cache respond to the fetch in a manner similar to an ordinary load instruction with the exception
that the referenced word is not forwarded to the processor after it has been cached. Figure (b) shows how pre-fetching
can be used to improve the execution time of the demand fetch case given in Figure (a). Here, the latency of main
memory accesses is hidden by overlapping computation with memory accesses resulting in a reduction in overall run
time. This gure represents the ideal case when pre-fetched data arrives just as it is requested by the processor.
Figure 1: Fig1: an example of explicit fetching
A less optimistic situation is depicted in Figure (c). In this gure, the pre-fetches for references r1 and r2 are issued
too late to avoid processor stalls although the data for r2 is fetched early enough to realize some benet. Note that
the data for r3 arrives early enough to hide all of the memory latency but must be held in the processor cache for
some period of time before it is used by the processor. During this time, the pre-fetched data are exposed to the cache
replacement policy and may be evicted from the cache before use. When this occurs, the pre-fetch is said to be useless
because no performance benet is derived from fetching the block early.
2.2.1 - Hazards associated with pre-fetching:
A prematurely pre-fetched block may also displace data in the cache that is currently in use by the processor, resulting
in what is known as cache pollution. Not the same as normal cache replacement misses. A pre-fetch that causes a
2
Figure 2: Fig2:Overhead on fetching statements in a processor
miss in the cache that would not have occurred if prefetching was not in use is dened as cache pollution. If, however,
a pre-fetched block displaces a cache block which is referenced after the pre-fetched block has been used, this is an
ordinary replacement miss since the resulting cache miss would have occurred with or without prefetching. A more
subtle side eect of prefetching occurs in the memory system. Note that in previous Figure (a) the three memory
requests occur within the rst 31 time units of program startup whereas in the same previous Figure (b), these requests
are compressed into a period of 19 time units. By removing processor stall cycles, prefetching eectively increases
the frequency of memory requests issued by the processor. Memory systems must be designed to match this higher
bandwidth to avoid becoming saturated and nullifying the benets of prefetching. This is can be particularly true for
multiprocessors where bus utilization is typically higher than single processor systems.
3.0 - Software pre-fetch:
Software prefetching [3] can achieve a reduction in run time despite adding instructions into the execution stream.
In Figure shown here, the memory eects from the previous Figure (a,b,c)s in section 2 are ignored and only the
computational components of the run time are shown. Here, it can be seen that the three pre-fetch instructions
actually increase the amount of work done by the processor.
Although hardware prefetching incurs no instruction overhead, it often generates more unnecessary pre-fetches than
software prefetching. Unnecessary pre-fetches are more common in hardware schemes because they speculate on
future memory accesses without the benet of compile-time information. Although unnecessary pre-fetches do not
aect correct program behavior, they can result in cache pollution and will consume memory bandwidth. To be
eective, data prefetching must be implemented in such a way that pre-fetches are timely, useful, and introduce little
overhead.
3.1 - Software Pre-fetch Methodologies:
The pre-fetch overhead can be reduced to a minimum if we can selectively pre-fetch only those references that will be
misses. Various algorithms have been suggested to deal with this, but both Chen and Baer [1] and Tulsen and Eggers
[4] feel that the algorithm described by Mowry and Gupta [5] is the best of these. Using Mowry and Guptas algorithm,
once a potential cache miss has been identied, the software scheme inserts a pre-fetch instruction. If accesses have
spatial or group locality in the same cache line, only the rst access to the line will result in a cache miss, and only
one pre-fetch instruction should be issued.
Testing for this condition, however, can be expensive and the compiler will generally perform loop splitting and loop
unrolling. One consequence of this is that the code may expand signicantly. An example of this can be seen in
example provided in the table.
However, Mowry et al. [5] report that, for more than half of the thirteen Benchmarks (irrelevant here) that they used,
the instruction overhead caused less than a 15% increase in instruction count, and that in the other cases the number
of instructions increased by 25% to 50%.
Since the compiler does not have complete information about the dynamic behavior of the program, it will be unable
to successfully cover all misses, and a miss covered by a pre-fetch may still stall the processor if the pre-fetch arrives
late or if it is cancelled by some other activity. Furthermore the compiler may also insert unnecessary pre-fetches for
those variables that generate hits in the original execution [6].
It is assumed that a cache line holds two array elements (so that prefetching &X[i] also gets &X[i+1]) and that the
memory latency requires the pre-fetch to be scheduled four iterations ahead. After the original split, the loops are
unrolled by a factor of two.
4.0 - Description of Hardware Prefetching Schemes
4.1 - Sequential prefetching (general scheme of One Block Look Ahead OBL):
Many prefetching schemes are designed to fetch data from main memory into the processor cache in units of cache
blocks. By grouping consecutive memory words into single units, caches exploit the principle of spatial locality to
3
Figure 3: Fig3: code selective pre-fetching example
implicitly pre-fetch data. The degree to which large cache blocks can be eective in prefetching data is limited by
the ensuing cache pollution eects (mentioned before). Sequential prefetching can take advantage of spatial locality
without introducing some of the problems associated with large cache blocks. The simplest sequential prefetching
schemes are variations upon the one block look ahead (OBL) approach which initiates a pre-fetch for block b+1 when
block b is accessed.
Figure 4: Fig4: 3 forms of sequential prefetching: a) Pre-fetch on miss, b) tagged and c) sequential pre-fetch with K
= 2.
4.1.1 Types of OBL
OBL implementations dier depending on what type of access to block b initiates the pre-fetch of b+1. Smith [7]
summarizes several of these approaches of which the pre-fetch-on-miss and tagged pre-fetch algorithms. The pre-
fetch-on-miss algorithm simply initiates a pre-fetch for block b+1 whenever an access for block b results in a cache
miss. If b+1 is already cached, no memory access is initiated. The tagged pre-fetch algorithm associates a tag bit with
every memory block. This bit is used to detect when a block is demand-fetched or a pre-fetched block is referenced
for the rst time. In either of these cases, the next sequential block is fetched.
Smith [7] found that tagged prefetching reduced cache miss ratios in a unied (both instruction and data) cache
by between 50% and 90%. Pre-fetch-on-miss was less than half as eective as tagged prefetching in reducing miss
ratios. The reason pre-fetch-on-miss is less eective is illustrated in Figure where the behavior of each algorithm when
accessing three contiguous blocks is shown. Here, it can be seen that a strictly sequential access pattern will result
in a cache miss for every other cache block when the pre-fetch-on-miss algorithm is used but this same access pattern
results in only one cache miss when employing a tagged pre-fetch algorithm.
4.1.2 - Sequential adaptive prefetching an improvement:
One upgrade to the pre-existing above logic was provided by Dahlgren and Stenstrm [8] who proposed an adaptive
sequential prefetching policy that allows the value of K (fetching frequency) to vary during program execution. To do
this, a pre-fetch eciency count is periodically calculated by the cache as an indication of the current spatial locality
4
characteristics of the program. Pre-fetch eciency is dened to be the ratio of useful pre-fetches to total pre-fetches
where a useful pre-fetch occurs whenever a pre-fetched block results in a cache hit. The value of K is initialized to one,
incremented whenever the pre-fetch eciency exceeds a predetermined upper threshold and decremented whenever the
eciency drops below a lower threshold as shown in the graphical gure. Note that if K is reduced to zero, prefetching
is eectively disabled.
Figure 5: Fig5: adaptive stride for sequential fetching
4.2 - Prefetching with arbitrary strides:
Several techniques have been proposed which employ special logic to monitor the processors address referencing pattern
to detect constant stride array references [1, 9, and 10]. This is accomplished by comparing successive addresses used
by load or store instructions. Chen and Baers scheme [1] illustrate its design, assume a memory instruction references
addresses a1, a2 and a3 during three successive loop iterations. Prefetching for MI will be initiated if (a2 - a1) = D
!= 0, Where D is now assumed to be the stride of a series of array accesses. The rst pre-fetch address will then be
A3 = a 2 + D where A3 is the predicted value of the observed address, a3. Prefetching continues in this way until
the equality A n = a n no longer holds true.
Figure 6: Fig6: register level realization of Arbitrary strides pre-fetching from Chen et al
Note that this approach requires the previous address used by a memory instruction to be stored along with the
last detected stride, if any. Recording the reference histories of every memory instruction in the program is clearly
impossible. Instead, a separate cache called the reference prediction table (RPT) holds this information for only the
most recently used memory instructions. The organization of the RPT in Figure is shown above.
The rst time a load instruction causes a miss, a table entry is reserved, possibly evicting the table entry for an older
load instruction. The miss address is then recorded in the last address held and the state is set to initial. The next
time this instruction causes a miss, last address is subtracted from the current miss address and the result is stored
in the delta (stride) held. Last address is then updated with the new miss address. The entry is now in the training
state. The third time the load instruction misses a new delta is computed. If this delta matches the one stored in the
entry, then there is a stride access pattern. The pre-fetcher then uses the delta to calculate which cache block(s) to
pre-fetch.
4.3 - The look ahead program counter scheme; with selective strides development
The RPT still limits the pre-fetch distance to one loop iteration. To remedy this shortcoming, a distance eld may be
added to the RPT which species the pre-fetch distance explicitly. Pre-fetched addresses would then be calculated as
5
Figure 7: Fig7: 2 bit state machine representation
eective address + ( stride x distance )
The addition of the distance eld requires some method of establishing its value for a given RPT entry. To calculate
an appropriate value, Chen and Baer [1] decouple the maintenance of the RPT from its use as a pre-fetch engine.
The RPT entries are maintained under the direction of the Program Counter as described above but pre-fetches are
initiated separately by a pseudo program counter, called the look-ahead program counter (LA-PC) which is allowed
to precede the PC.
Figure 8: Fig8: A realization of LA-PC Chen et al
This is basically a pseudo-program counter that runs several cycles ahead of the regular program counter (PC).
The LA-PC then looks up a Reference Prediction Table to pre-fetch data in advance. LA-PC scheme only advances
one instruction per cycle and is restricted to be, at most, a xed number of cycles ahead of the regular PC (Program
Counter). The studies from Chen etal [1] focused on data prefetching rather than instruction prefetching, and did not
evaluate the eects of speculative execution, multiple instruction issue, and the presence of advanced branch prediction
mechanisms.
Similar data prefetching schemes can be seen in the work of Liu and Pinter etal [2] gure shown.
The prefetching degree is the number of cache blocks that are fetched on a single prefetching operation, while the
prefetching distance is how far ahead prefetching starts. For example, a sequential pre-fetcher with a prefetching
degree of 2, and a prefetching distance of 5, would fetch blocks X+5 and X+6 if there was a miss on block X.
Perez, et al. [11] did a comparative survey in 2004 of many proposed prefetching heuristics and found that tagged se-
quential prefetching, reference prediction tables (RPT) and Program Counter/Delta Correlation Prefetching (PC/DC)
were the top performers.
4.4 - PC/DC Prefetching:
This approach presented by Nesbit and Smith [12] presented the idea utilizing a Global History Buer (GHB). The
structure of the GHB is shown in the gure.
Each cache miss or cache hit to a tagged (pre-fetched) cache block is inserted into the GHB in FIFO order. The index
table stores the address of the load instruction and a pointer into the GHB for the last miss issued by that instruction.
Each entry in the GHB has a similar pointer, which points to the next miss issued by the same instruction.
6
Figure 9: Fig9: A Global History Buer , based on FIFO for pre-fetching
PC/DC prefetching calculates the deltas between successive cache misses and stores them in a delta-buer. The
history in GHB Figure yields the address stream and corresponding deltas stream buer in Figure. The last pair of
deltas is (1, 9). By searching the delta-stream (correlating), we nd this same pair in the beginning. A pattern is
found, and prefetching can begin. The deltas after the pair are then added to the current miss address, and pre-fetches
are issued for the calculated addresses.
Figure 10: Fig10: Delta stream buer structure
4.5 - Branch prediction-based prefetching
Conceptually, the instruction prefetching scheme proposed here [13] is similar to the look-ahead program counter,
yet with much more aggressive prefetching policies. The pre-fetching unit is an autonomous state machine, which
speculatively runs down the instruction stream as fast as possible and brings all the instructions encountered along
the path. When a branch is encountered, the prefetching unit predicts the likely execution path using the branch
predictor, records the prediction in a log, and continues. In the meantime, the execution unit of the microprocessor
routinely checks the log as branches are resolved and resets the program counter of the prefetching unit if an error is
found.
Figure 11: Fig11:The organization of BP-based prefetching scheme.
Initially, the (PC) of the prefetching unit is set to be equal to the PC of the execution unit. Then the prefetching unit
spends one cycle to fetch the desired cache line.
The prefetching unit examines an entire cache line as a unit, and quickly nds the rst branch (either conditional or
unconditional) in that cache line using existing pre-decoded information or a few bits from the opcode. During the
same cycle, the prefetching unit also predicts and computes the potential target for the branch in one of three ways:
rst, for a subroutine return branch, its target is predicted with a return address stack, which has high prediction
accuracy. [14] The prefetching unit has its own separate return address stack.
7
Second, for a conditional branch, the direction is predicted with a two-level branch predictor and the target address
is computed with the dedicated adder in the same cycle. A dedicated adder is used instead of a branch target buer,
because the rst time the branch is encountered it will not yet be recorded in the target buer. Also note that the
two-level branch predictor used in the prefetching unit has its own small branch history register but shares the same
expensive pattern history table with the execution unit. The prefetching unit only speculative updates its own branch
history register, but does not update the pattern history table.
Third, for an unconditional branch, its direction is always taken and its target is calculated using the same adder used
for conditional branches. However, for an indirect branch, the prefetching unit stalls and waits for the execution unit
because this type of branch can have multiple targets.
Figure 12: Logic ow pre-directive for branch predictor fetching
The cache line pre-fetch depends on the predicted direction of a branches. When a branch is predicted to be taken, the
cache line containing its target is pre-fetched; otherwise, the prefetching unit examines the next branch in the cache
line. The prefetching unit continues to examine successive branches until the end of the current cache line is reached,
then the next sequential cache line is pre-fetched. The entire process is repeated again for the newly pre-fetched cache
line.
To verify the predictions made, when a branch is predicted, the predicted outcome is recorded in a log. This log
is organized as a rst-in-rst-out (FIFO) buer. When the execution unit resolves a branch, the actual outcome is
compared with the one predicted by the prefetching unit, if the actual outcome matches the one predicted, the item
is removed from the log. However, if the actual outcome diers from the one predicted, then the entire log is ushed
and the PC of the prefetching unit is reset to the PC of the execution unit.
4.6 - Delta-Correlating Prediction Tables DCPT [15]
a combinatory approach from Reference Prediction Tables and Delta Correlation Scheme
Figure 13: Fig13: Structure of a DCPT Instruction fetch
In DCPT we use a large table indexed by the address (PC) of the load. Each entry has the format shown in gure
bellow. The last address eld works in a similar manner as in RPT prefetching. The n delta elds acts as a circular
buer, holding the last n deltas observed by this load instruction and the delta pointer points to the head of this
circular buer.
To provide further insight into the operation of this scheme, a pseudo code is presented courtesy of the development
team as mentioned before [15].
In this pseudo code the mnemonic of is used as an assignment operator and is used as the insert into circular
buer operator. For the ease of description and display, the delta buer and the inight buer is presented as an
array, however in reality, they are still circular and get rested to ushed state every time the buer reaches a circular
full on its state.
Initially, the PC is used to look up into a table of entries. In our implementation we have used a fully-associative
table, but it is possible to use other organizations as well. If an entry with the corresponding PC is not found, then a
replacement entry is initialized. This is shown in lines 4-8. If an entry is found, the delta between the current address
and the previous address is computed. The buer is only updated if the delta is non-zero. The new delta is inserted
8
Figure 14: Fig14:
into the delta buer and the last address eld is updated. Each delta is stored as an n bit value. If the value cannot
be represented with only n bits, a 0 is stored in the delta buer as an indicator of an overow error.
Figure 15: Fig15:
Delta correlation begins after updating the entry. The pseudo code for delta correlation is shown in Algorithm below.
The deltas are navigated in reverse order, looking for a match to the two most recently inserted deltas. If a match
is found, the next stage begins. The rst pre-fetched candidate is generated by adding the delta after the match to
the value found in last address. The next pre-fetch candidate is generated by adding the next delta to the previous
pre-fetched candidate. This process is repeated for each of the deltas after the matched pair including the newly
inserted deltas.
Figure 16: Fig16:
The next step in the DCPT ow is pre-fetch ltering. The pseudo code for this step is shown in Algorithm below.
If a pre-fetch candidate matches the value stored in last pre-fetch, the content of the pre-fetch candidate buer up
to this point is discarded. Every pre-fetch candidate is looked up in the cache to see if it is already present. If it is
not present, it is checked against the miss status holding registers to see if a demand request for the same block has
already been issued.
This buer can only hold 32 pre-fetches. If it is full, then pre-fetch is discarded in FIFO order. Finally, the last
pre-fetched eld is updated with the address of the issued pre-fetch.
9
1. - Critical evaluation
(a) - Increased cache interference
However, pre-fetching may lead to increased cache interference. For uniprocessors, there are two dierent ways in
which prefetching can increase cache interference:
5.1.1 - A pre-fetched line can displace another cache line which would have been a hit under the original execution
5.1.2 - A pre-fetched line can be removed from the cache by either an access or another pre-fetch before the
processor has time to reference it. In the former case a pre-fetch generates another miss, while in the latter it cancels
a pre-fetch.
However for a multiprocessor prefetching can cause internode interference. This happens when invalidations generated
by pre-fetches occurring at other nodes transform original local hits into misses or cancel pre-fetched data before they
can be referenced by the processor.
1. (a) - Increased memory trac
There are two reasons why this might occur in prefetching schemes. One reason is the prefetching of un-necessary
data and the other is early displacement of and on demand or later recall of the same useful data. These types of
increase in memory trac can add to memory access latency. These actions may lead to performance degradations in
case of processors that support bus based multi-processing. These types of processors dont support pre-fetching very
well and cause performance degradation, the results shown and tested by Tullsen et al [16, 17].
6.0 - Comparisons and nal thoughts
Starting o with hardware based prefetching, these schemes require some sort of hardware modications or tweaks at
that to the main processor. Its main advantage is that the hardware pre-fetches are handled dynamically at runtime
without compiler intervention. The drawbacks are that extra hardware resources are needed, that memory references
for complex access patterns are dicult to predict and that it tends to have a more negative eect on memory trac.
In contrast, software-directed approaches require little to no hardware support. They rely on compiler technology
to perform statistical program analysis and to selectively insert pre-fetching instructions. Because of this, they are
less likely to pre-fetch unnecessary data and hence reduce cache pollution. The disadvantages are that there is some
overhead due to the extra pre-fetch instructions and that some useful pre-fetching cannot be uncovered at runtime.
The conclusions provided above are based on study of Chen et al [1]. Their results and observations were taken and
considered as a base line for compilation of this paper.
7.0 - BIBLOGRAPHY:
[1] Chen, T.-F. And Baer, J.-L. Eective hardware based data prefetching for high-performance processors. IEEE
Transactions on Computers, Vol. 44, No. 5, May, 1995.1
[2] Liu, Y. and Kaeli, D. R. Branch-directed and stride-based data cache prefetching. Proceedings of the International
Conference on Computer Design, October, 1996. 2
[3] A.K. Portereld. Software methods for improvement of cache performance on supercomputer applications. Ph.D.
Thesis, Rice University, 1989
[4] D.M. Tullsen and S.J. Eggers. Eective Cache Prefetching on Bus-Based Multiprocessors. ACM Transactions on
Computer Systems, 13, pp 57-88, 1995
[5]T.C. Mowry, M.S. Lam and A. Gupta. Design and evaluation of a computer algorithm for prefetching. 5th Int.
Conf. on Arch. Support for Programming Languages and Operating Systems,
[6] R.H. Saavedra, W. Mao and K. Hwang. Performance and Optimization of Data Prefetching Strategies in Scalable
Multiprocessors. Journal of Parallel and Distributed Computing 22:3, pp 427-448, 1994
[7] Smith, A.J., Cache Memories, Computing Surveys, Vol.14, No.3, September 1982, p. 473-530.
[8] Dahlgren, F., M. Dubois and P. Stenstrom, Fixed and Adaptive Sequential Prefetching in Shared memory Multi-
processors, Proc. International Conference on Parallel Processing, St. Charles, IL, August 1993, p. I-56-63.
[9] Fu, J.W.C., J.H. Patel and B.L. Janssens, Stride Directed Prefetching in Scalar Processors, Proc. 25th Interna-
tional Symposium on Microarchitecture, Portland, OR, December 1992, p. 102-110.
10
[10] Sklenar, I., Prefetch Unit for Vector Operations on Scalar Computers, Proc. 19th International Symposium on
Computer Architecture, Gold Coast, Qld., Australia, May 1992.
[11] D. G. Perez, G. Mouchard, and O. Temam, Microlib: A case for the quantitative comparison of micro-architecture
mechanisms, in MICRO 37: Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitec-
ture, (Washington, DC, USA), IEEE Computer Society, 2004.
[12] K. J. Nesbit and J. E. Smith, Data cache prefetching using a global history buer, High-Performance Computer
Architecture, International Symposium on, vol. 0, 2004.
[13] Instruction Prefetching Using Branch Prediction Information I-Cheng K. Chen, Chih-Chieh Lee, and Trevor N.
Mudge EECS Department, University of Michigan 1301 Beal Ave., Ann Arbor, Michigan 48109-2122
[14] Kaeli, D. and Emma, P. G. Branch history table prediction of moving target branches due to subroutine returns.
Proceedings of the 18th International Symposium on Computer Architecture, May 1991.
[15] Storage Ecient Hardware Prefetching using Delta-Correlating Prediction Tables Marius Grannaes, Magnus Jahre,
Lasse Natvig, Department of Computer and Information Science Norwegian University of Science and Technology Sem
Saelandsvei 7-9, 7491 Trondheim, Norway
[16] D.M. Tullsen and S.J. Eggers. Limitations of Cache Prefetching on a Bus-Based Multiprocessor. ACM Transac-
tions on Computer Systems, 13, 1995.
[17]D.M. Tullsen and S.J. Eggers. Eective Cache Prefetching on Bus-Based Multiprocessors. ACM Transactions on
Computer Systems, 13, 1995.
11

You might also like