IJIRET Archana KV Increasing Memory Performance Using Cache Optimizations in Chip Multiprocessors

ISSN: XXXX-XXXX Volume X, Issue X, Month Year
Increasing Memory Performance Using Cache Opti-

mizations in Chip Multiprocessors

Archana.K.V
Dept of Computer Science and Engineering

BTL Institute of Technology
Bangalore, India
archanaglows@gmail.com

Abstract:
The processor-memory bandwidth in modern generation
processors is the important bottleneck due to a number of
processor cores dealing it through with the same bus/ pro-
cessor-memory interface. Caches take a significant amount
of energy in current microprocessors. To design an energy-
efficient microprocessor, it is important to optimize cache
energy economic consumption. Powerful utilization of this
resource is consequently an important view of memory hier-
archy design of multi core processors. This is presently an
important field of research on a large number of research
issues that have suggested a number of techniques to figure
out the problem. The better contribution of this theme is the
assessment of effectiveness of some of the proficiencies that
were enforced in recent chip multiprocessors. Cache optimi-
zation techniques that were named for single core proces-
sors but have not been implemented in multi core processors
are as well tested to forecast their effectiveness.

Keywords: On-chip cache hierarchy; cache optimizations

1. INTRODUCTION
The on-chip memory and its efficient usage in multi core
processors is the prime focusing of this paper. With the en-
hancing number of cores on a single chip, this scheme will
find the overall memory performance and therefore the per-
formance of the applications running on these systems. The
workload running on these systems is a mix of multiple pro-
grams. The overall performance would consequently not
only be observed from the throughput of multiple programs
just also from the performance of programs making up of
multiple parallel processes running on multiple cores of the
identical chip. The on-chip cache hierarchy needs to be de-
signed with the best feasible configuration and optimizations
to do the above purpose. Portable computing applications
have changed from conventional low performance products
much as wristwatches and calculators to high throughput and
computation intense products such as notebook computers
and cellular phones. The early portable computing applica-
tions expect high speed, however low energy consumption
because for such products longer battery life interprets to
extended use and better marketability. This paper introduces
a case study of performance and power trade-offs in design-
ing on-chip caches for the microprocessors used in portable
computing applications. Early cache studies have primarily
focused on improving performance. Studies of cache access
times and miss rates for different cache parameters (e.g.
cache size, block size, and degree of set associativity) of the
single level caches can be found in [5,8]. Corresponding
studies focusing on multi-level cache organizations can be
found in [6,7]. Studies of instruction set design and it affect
the cache performance and power consumption can be
founding [1,3]. This paper consists of five sections. Section
2 briefly describes the cache performance and energy mod-
els used in this study. Section 3 presents several experi-
mental cache organizations which are designed for either
improving performance or saving energy. Section 4 shows
the experimental results of this study. Finally, concluding
remarks are offered in Section 5.

Fig.1 Block diagram of on-chip memory hierarchy in CMPs

International Journal of Innovatory research in Engineering and Technology - IJIRET

ISSN: XXXX-XXXX Volume X, Issue X, Month Year 26
2. ANALYTICAL MODELS FOR
ON-CHIP CACHES
A formal cache can be separated into three dissimilar
components: address decoding path, cell arrays, and I/O
path. The address decoding path admits address buses and
address decoding logic. The cell arrays include read/write
circuitry, tag arrays, and the data arrays. The I/O path in-
cludes I/O pads and buses to link the address and data buses.
The on-chip cache cycle time is computed based on an ana-
lytical model demonstrated in [6,14] (which was based on
the access time model of Wada et al in [13]). This time mod-
el, based on 0.8 mm CMOS technology, gives some cache
cycle time (i.e. the minimum time required between the start
of two accesses) and cache access time (i.e. the minimum
time between the start and end of a single access) in terms of
cache size, block size, and associativity. The characteristics
of this time model is that it applies SPICE parameters to
predict the delays due to the address decoder, word-line
driver, pre-charged bit lines, sense amplifiers, data bus driv-
er, and data output drivers.
The average time for an off-chip cache access is computed
by the average off-chip access and transfer times which are
rounded to the next higher multiple of on-chip cycle time.
The on-chip cache energy expenditure is based on an ab-
stract model which believes only those cache factors that
dominate overall cache power consumption. In the address
decoding path, the capacity of the decoding logic is general-
ly less than that of the address bus. Energy expenditure of
the address buses dominate the total energy consumption of
the address decoding path. In the cell arrays, the read/write
circuitry generally does not take much power. Most energy
took in the cell arrays is due to both tag and data arrays. The
tag and data arrays in established cache designs can be im-
plemented in dynamic or static logic. In a dynamic circuit
design, word/bit lines are generally pre-charged before they
are accessed. The energy took by the pre-charged cache
word/bit lines normally dominates the overall energy con-
sumption in the cell arrays. In a stable circuit design, there
are no pre-charges on the word/bit lines. The energy ex-
penditure of the tag and data arrays right away depends on
the bit switch activities of the bit lines. In the I/O path, most
energy is consumed during bit switches of the I/O pads.
2.1 OPTIMIZATIONS IMPLEMENT-
ED SUCCESSFULLY
A number of cache optimization proficiencies that were
implemented in single core processors were successfully
carried out in multi core processors. Multi-level cache with
the modern structure of two-level has been implemented
afterwards the very first multi core processor visualized in
(Fig.1). In this form, the first-level cache is private to each
one core and coherence is preserved among them with MESI
or MOESI protocols (Villa, F.J., et al., 2005). The second-
level cache has been carried out with different design selec-
tions in several architectures. In universal, the second-level
cache is distributed between all cores with a number of op-
timizations to be talked over in this section. One of the ma-
jor introductions in the design of the second level cache is
NUCA (Non Uniform Cache Architecture) cache (Kim, C.,
et al., 2003). The cause for building NUCA organization is
that the second-level cache is induced much larger than the
first-level to fulfill the design necessities of multi-level
cache. The result is a slower access time with the enhancing
cache size. This problem is dissolved by dividing the cache
into banks. The context of a particular core is kept in a bank
physically closer to it making advance in the speed of ac-
cess. A number of variants of NUCA have developed over
the last few years with many innovations implemented in
modern generation processors.

3 EXPERIMENTAL CACHE ORGAN-
IZATIONS
3.1 Conventional Designs
Conventional cache plans include direct-mapped and set
associative. A set associative cache generally has a better hit
rate than a direct-mapped cache of the equal size, although
the access time for the set associative cache is commonly
higher than the direct-mapped cache. The number of bit line
switches in the set associative cache is normally more than
that in the direct-mapped cache, but the energy consumption
of each bit line in a set associative cache is in generally less
than that in a direct-mapped cache of the equal size.
3.2 Cache Designs for Low Power
This paper investigates three various cache design ap-
proaches to attain low power: vertical cache partitioning,
horizontal cache partitioning and Gray code addressing
Vertical Cache Partitioning
The fundamental idea of vertical cache partitioning is to
optimize the capacity of each one cache access by increase
on-chip cache hierarchy (e.g. two-level caches). Accessing a
smaller cache has lower power economic consumption since
a smaller cache has a lower load capacitance. We use block
buffering as an good example of this approach. A fundamen-
tal structure of a block buffered cache [1] is presented in
Figure 1. The block buffer itself is, in effect, some other
cache which is closer to the processor than on-chip caches.
The processor finds out if there is a block hit (i.e. the current
access data is placed at the same block of the latest access
data). If it is a hit, the data is directly read from the block
buffer and the cache is not functioned. The cache is operated
only if there is a block miss. A block buffered cache pre-
serves power by optimizing capacity of each cache access.
The effectiveness of block buffering powerfully depends on
the spatial locality of applications and the block sizes. The
higher the spatial locality of the access patterns (e.g. an in-
struction sequence), the larger the number of energy which
can be preserved by block buffering. The block size is also
very essential in block buffering. Excluding the effect to the
cache hit rate of the cache block size, a small block may
result in defining the number of energy protected by the

block buffered cache and a large block may result in enhanc-
ing unnecessary energy economic consumption by the un-
used data in the block.

Horizontal Cache Partitioning
The primary idea of the horizontal cache segmentation
approach is to partition the cache data memory into various
segments. Each segment can be high-powered individually.
Cache sub-banking, proposed in [11], is one horizontal
cache partition technique which partitions the data array of a
cache into different banks (called cache sub banks). Each
cache sub-bank can be accessed (powered up) separately.
Just the cache sub-bank where the applied data is located
consumes power in each cache access. A primary structure
for cache sub-banking is presented in Figure Cache
sub-banking keeps power by eliminating unnecessary ac-
cesses. The number of power saving depends on the number
of cache sub-banks. More cache
Sub-banks preserve more power. One advantage of cache
sub-banking over block buffering is that the efficient cache
hit time of a sub-bank cache can be as smooth as a conven-
tional performance-driven cache since the sub-bank selec-
tion logic is generally very simple and can be well hidden in
the cache index decoding logic. With the advantage of main-
taining the cache performance, cache sub-banking would be
very attractive to computer architects in designing energy-
efficient high-performance microprocessors.
Gray Code Addressing
Memory addressing used in a traditionalistic processor
design is usually in a 2s complement representation. The bit
switching of the address buses when accessing consecutive
memory space is not optimal. Since there is a significant
number of energy consumed on the address buses and se-
quential memory address access are frequently seen in an
application with high spatial locality, it is essential to opti-
mize bit switching activities of the address buses for low
power caches.
4.RESULTS AND DISCUSSION
4.1. Proposed Cache Optimizations
A number of cache optimization techniques were success-
fully carried out in single core processors or single core mul-
tiprocessors but have not yet been attempted in multi core
processors. Any of these techniques are discussed in this
section with a prediction of their effectiveness in multi core
processors.

3.2 Ineffective Cache Optimizations
The optimization proficiencies introduced in Section 3.1
needs to be carried out to find out their effectiveness. A few
optimizations were tried out for multi core processors and
were found to be ineffective. As more optimizations are test-
ed, one may find more such techniques as not being efficient
for multi core processors. The succeeding paragraphs give a
brief account of the tested techniques that were not success-
ful in CMPs.
Cache affinity is a policy decision taken by the operating
system to schedule processes on particular cores. The deci-
sion is based on the activity of a process that has its context
in a cache and is expected to reuse the contents as a result of
temporal locality. After a context switch, when a process is
rescheduled, it is allocated to the same processor, assuming
that its context may still be present in the cache, reducing the
compulsory or cold start misses. This scheme has improved
the performance in conventional multiprocessors (SMPs).
On investigation of this scheme in multi core processors and
summarized in (Kazempour, et al., 2008), it was observed
that the performance improvement in multi core uniproces-
sors (CMPs) is not significant, but the performance is good
in case of multi core multiprocessors (SMPs based on
CMPs).
5. CONCLUSION AND FUTURE DI-
RECTIONS
This paper forms part of the guideline for future work for
researchers interested in optimization of memory hierarchy
for scalable multi core processors, as it presents a survey of
all such techniques proposed in recent publications. The
techniques are also presented along with the comments
about their effectiveness. A summary of all the optimization
techniques discussed in this paper is presented in Table 1.
The effect of the mechanisms and policies of operating sys-
tem on the memory hierarchy, especially the on-chip cache
hierarchy is another direction of research that can be ex-
plored. High coherence traffic gives rise to congestion at the
first level cache. Directory-based coherence protocols may
reduce the overall coherence traffic but this comes with the
cost of maintaining the directory and keeping it updated.
These and other research directions shall be explored in fu-
ture research.

6. REFERENCES
[1] Chang and Sohi, (2006),Cooperative Caching for Chip
Multiprocessors, Proceedings of the 33rd Annual Interna-
tional Symposium on Computer Architecture, p.264-276
[2] Chen and Kandemir, (2008), Code Restructuring for
Improving Cache Performance in MPSoCs, IEEE Transac-
tions on Parallel and Distributed Systems, Vol. 19, No. 9, p.
1201-1214
[3] Dybdahl , H., P. Stenstrm, (2007), An Adaptive
Shared/Private NUCA Cache Partitioning Scheme for Chip
Multiprocessors, Proceedings of the IEEE 13th Internation-
al Symposium on High Performance Computer Architecture,
p. 2-12
[4] Dybdahl., Stenstrm, (2006), Enhancing Last-Level
Cache Performance by Block Bypassing and Early Miss

Determination, Asia-Pacific Computer Systems Architec-
ture Conference (ACSAC), LNCS
4186, p. 52-66
[5] Core Systems, Proceedings of the 42nd International
Symposium on Micro-architecture (MICRO), p.327-336

IJIRET Archana KV Increasing Memory Performance Using Cache Optimizations in Chip Multiprocessors

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IJIRET Archana KV Increasing Memory Performance Using Cache Optimizations in Chip Multiprocessors

Uploaded by

Copyright:

Available Formats

ISSN: XXXX-XXXX Volume X, Issue X, Month Year

Increasing Memory Performance Using Cache Opti-

You might also like