The processor-memory bandwidth in modern generation
processors is the important bottleneck due to a number of
processor cores dealing it through with the same bus/ processor-
memory interface. Caches take a significant amount
of energy in current microprocessors. To design an energyefficient
microprocessor, it is important to optimize cache
energy economic consumption. Powerful utilization of this
resource is consequently an important view of memory hierarchy
design of multi core processors. This is presently an
important field of research on a large number of research
issues that have suggested a number of techniques to figure
out the problem. The better contribution of this theme is the
assessment of effectiveness of some of the proficiencies that
were enforced in recent chip multiprocessors. Cache optimization
techniques that were named for single core processors
but have not been implemented in multi core processors
are as well tested to forecast their effectiveness.
The processor-memory bandwidth in modern generation
processors is the important bottleneck due to a number of
processor cores dealing it through with the same bus/ processor-
memory interface. Caches take a significant amount
of energy in current microprocessors. To design an energyefficient
microprocessor, it is important to optimize cache
energy economic consumption. Powerful utilization of this
resource is consequently an important view of memory hierarchy
design of multi core processors. This is presently an
important field of research on a large number of research
issues that have suggested a number of techniques to figure
out the problem. The better contribution of this theme is the
assessment of effectiveness of some of the proficiencies that
were enforced in recent chip multiprocessors. Cache optimization
techniques that were named for single core processors
but have not been implemented in multi core processors
are as well tested to forecast their effectiveness.
The processor-memory bandwidth in modern generation
processors is the important bottleneck due to a number of
processor cores dealing it through with the same bus/ processor-
memory interface. Caches take a significant amount
of energy in current microprocessors. To design an energyefficient
microprocessor, it is important to optimize cache
energy economic consumption. Powerful utilization of this
resource is consequently an important view of memory hierarchy
design of multi core processors. This is presently an
important field of research on a large number of research
issues that have suggested a number of techniques to figure
out the problem. The better contribution of this theme is the
assessment of effectiveness of some of the proficiencies that
were enforced in recent chip multiprocessors. Cache optimization
techniques that were named for single core processors
but have not been implemented in multi core processors
are as well tested to forecast their effectiveness.
Archana.K.V Dept of Computer Science and Engineering
BTL Institute of Technology Bangalore, India archanaglows@gmail.com
Abstract: The processor-memory bandwidth in modern generation processors is the important bottleneck due to a number of processor cores dealing it through with the same bus/ pro- cessor-memory interface. Caches take a significant amount of energy in current microprocessors. To design an energy- efficient microprocessor, it is important to optimize cache energy economic consumption. Powerful utilization of this resource is consequently an important view of memory hier- archy design of multi core processors. This is presently an important field of research on a large number of research issues that have suggested a number of techniques to figure out the problem. The better contribution of this theme is the assessment of effectiveness of some of the proficiencies that were enforced in recent chip multiprocessors. Cache optimi- zation techniques that were named for single core proces- sors but have not been implemented in multi core processors are as well tested to forecast their effectiveness.
1. INTRODUCTION The on-chip memory and its efficient usage in multi core processors is the prime focusing of this paper. With the en- hancing number of cores on a single chip, this scheme will find the overall memory performance and therefore the per- formance of the applications running on these systems. The workload running on these systems is a mix of multiple pro- grams. The overall performance would consequently not only be observed from the throughput of multiple programs just also from the performance of programs making up of multiple parallel processes running on multiple cores of the identical chip. The on-chip cache hierarchy needs to be de- signed with the best feasible configuration and optimizations to do the above purpose. Portable computing applications have changed from conventional low performance products much as wristwatches and calculators to high throughput and computation intense products such as notebook computers and cellular phones. The early portable computing applica- tions expect high speed, however low energy consumption because for such products longer battery life interprets to extended use and better marketability. This paper introduces a case study of performance and power trade-offs in design- ing on-chip caches for the microprocessors used in portable computing applications. Early cache studies have primarily focused on improving performance. Studies of cache access times and miss rates for different cache parameters (e.g. cache size, block size, and degree of set associativity) of the single level caches can be found in [5,8]. Corresponding studies focusing on multi-level cache organizations can be found in [6,7]. Studies of instruction set design and it affect the cache performance and power consumption can be founding [1,3]. This paper consists of five sections. Section 2 briefly describes the cache performance and energy mod- els used in this study. Section 3 presents several experi- mental cache organizations which are designed for either improving performance or saving energy. Section 4 shows the experimental results of this study. Finally, concluding remarks are offered in Section 5.
Fig.1 Block diagram of on-chip memory hierarchy in CMPs
International Journal of Innovatory research in Engineering and Technology - IJIRET
ISSN: XXXX-XXXX Volume X, Issue X, Month Year 26 2. ANALYTICAL MODELS FOR ON-CHIP CACHES A formal cache can be separated into three dissimilar components: address decoding path, cell arrays, and I/O path. The address decoding path admits address buses and address decoding logic. The cell arrays include read/write circuitry, tag arrays, and the data arrays. The I/O path in- cludes I/O pads and buses to link the address and data buses. The on-chip cache cycle time is computed based on an ana- lytical model demonstrated in [6,14] (which was based on the access time model of Wada et al in [13]). This time mod- el, based on 0.8 mm CMOS technology, gives some cache cycle time (i.e. the minimum time required between the start of two accesses) and cache access time (i.e. the minimum time between the start and end of a single access) in terms of cache size, block size, and associativity. The characteristics of this time model is that it applies SPICE parameters to predict the delays due to the address decoder, word-line driver, pre-charged bit lines, sense amplifiers, data bus driv- er, and data output drivers. The average time for an off-chip cache access is computed by the average off-chip access and transfer times which are rounded to the next higher multiple of on-chip cycle time. The on-chip cache energy expenditure is based on an ab- stract model which believes only those cache factors that dominate overall cache power consumption. In the address decoding path, the capacity of the decoding logic is general- ly less than that of the address bus. Energy expenditure of the address buses dominate the total energy consumption of the address decoding path. In the cell arrays, the read/write circuitry generally does not take much power. Most energy took in the cell arrays is due to both tag and data arrays. The tag and data arrays in established cache designs can be im- plemented in dynamic or static logic. In a dynamic circuit design, word/bit lines are generally pre-charged before they are accessed. The energy took by the pre-charged cache word/bit lines normally dominates the overall energy con- sumption in the cell arrays. In a stable circuit design, there are no pre-charges on the word/bit lines. The energy ex- penditure of the tag and data arrays right away depends on the bit switch activities of the bit lines. In the I/O path, most energy is consumed during bit switches of the I/O pads. 2.1 OPTIMIZATIONS IMPLEMENT- ED SUCCESSFULLY A number of cache optimization proficiencies that were implemented in single core processors were successfully carried out in multi core processors. Multi-level cache with the modern structure of two-level has been implemented afterwards the very first multi core processor visualized in (Fig.1). In this form, the first-level cache is private to each one core and coherence is preserved among them with MESI or MOESI protocols (Villa, F.J., et al., 2005). The second- level cache has been carried out with different design selec- tions in several architectures. In universal, the second-level cache is distributed between all cores with a number of op- timizations to be talked over in this section. One of the ma- jor introductions in the design of the second level cache is NUCA (Non Uniform Cache Architecture) cache (Kim, C., et al., 2003). The cause for building NUCA organization is that the second-level cache is induced much larger than the first-level to fulfill the design necessities of multi-level cache. The result is a slower access time with the enhancing cache size. This problem is dissolved by dividing the cache into banks. The context of a particular core is kept in a bank physically closer to it making advance in the speed of ac- cess. A number of variants of NUCA have developed over the last few years with many innovations implemented in modern generation processors.
3 EXPERIMENTAL CACHE ORGAN- IZATIONS 3.1 Conventional Designs Conventional cache plans include direct-mapped and set associative. A set associative cache generally has a better hit rate than a direct-mapped cache of the equal size, although the access time for the set associative cache is commonly higher than the direct-mapped cache. The number of bit line switches in the set associative cache is normally more than that in the direct-mapped cache, but the energy consumption of each bit line in a set associative cache is in generally less than that in a direct-mapped cache of the equal size. 3.2 Cache Designs for Low Power This paper investigates three various cache design ap- proaches to attain low power: vertical cache partitioning, horizontal cache partitioning and Gray code addressing Vertical Cache Partitioning The fundamental idea of vertical cache partitioning is to optimize the capacity of each one cache access by increase on-chip cache hierarchy (e.g. two-level caches). Accessing a smaller cache has lower power economic consumption since a smaller cache has a lower load capacitance. We use block buffering as an good example of this approach. A fundamen- tal structure of a block buffered cache [1] is presented in Figure 1. The block buffer itself is, in effect, some other cache which is closer to the processor than on-chip caches. The processor finds out if there is a block hit (i.e. the current access data is placed at the same block of the latest access data). If it is a hit, the data is directly read from the block buffer and the cache is not functioned. The cache is operated only if there is a block miss. A block buffered cache pre- serves power by optimizing capacity of each cache access. The effectiveness of block buffering powerfully depends on the spatial locality of applications and the block sizes. The higher the spatial locality of the access patterns (e.g. an in- struction sequence), the larger the number of energy which can be preserved by block buffering. The block size is also very essential in block buffering. Excluding the effect to the cache hit rate of the cache block size, a small block may result in defining the number of energy protected by the International Journal of Innovatory research in Engineering and Technology - IJIRET
ISSN: XXXX-XXXX Volume X, Issue X, Month Year 27 block buffered cache and a large block may result in enhanc- ing unnecessary energy economic consumption by the un- used data in the block.
Horizontal Cache Partitioning The primary idea of the horizontal cache segmentation approach is to partition the cache data memory into various segments. Each segment can be high-powered individually. Cache sub-banking, proposed in [11], is one horizontal cache partition technique which partitions the data array of a cache into different banks (called cache sub banks). Each cache sub-bank can be accessed (powered up) separately. Just the cache sub-bank where the applied data is located consumes power in each cache access. A primary structure for cache sub-banking is presented in Figure Cache sub-banking keeps power by eliminating unnecessary ac- cesses. The number of power saving depends on the number of cache sub-banks. More cache Sub-banks preserve more power. One advantage of cache sub-banking over block buffering is that the efficient cache hit time of a sub-bank cache can be as smooth as a conven- tional performance-driven cache since the sub-bank selec- tion logic is generally very simple and can be well hidden in the cache index decoding logic. With the advantage of main- taining the cache performance, cache sub-banking would be very attractive to computer architects in designing energy- efficient high-performance microprocessors. Gray Code Addressing Memory addressing used in a traditionalistic processor design is usually in a 2s complement representation. The bit switching of the address buses when accessing consecutive memory space is not optimal. Since there is a significant number of energy consumed on the address buses and se- quential memory address access are frequently seen in an application with high spatial locality, it is essential to opti- mize bit switching activities of the address buses for low power caches. 4.RESULTS AND DISCUSSION 4.1. Proposed Cache Optimizations A number of cache optimization techniques were success- fully carried out in single core processors or single core mul- tiprocessors but have not yet been attempted in multi core processors. Any of these techniques are discussed in this section with a prediction of their effectiveness in multi core processors.
3.2 Ineffective Cache Optimizations The optimization proficiencies introduced in Section 3.1 needs to be carried out to find out their effectiveness. A few optimizations were tried out for multi core processors and were found to be ineffective. As more optimizations are test- ed, one may find more such techniques as not being efficient for multi core processors. The succeeding paragraphs give a brief account of the tested techniques that were not success- ful in CMPs. Cache affinity is a policy decision taken by the operating system to schedule processes on particular cores. The deci- sion is based on the activity of a process that has its context in a cache and is expected to reuse the contents as a result of temporal locality. After a context switch, when a process is rescheduled, it is allocated to the same processor, assuming that its context may still be present in the cache, reducing the compulsory or cold start misses. This scheme has improved the performance in conventional multiprocessors (SMPs). On investigation of this scheme in multi core processors and summarized in (Kazempour, et al., 2008), it was observed that the performance improvement in multi core uniproces- sors (CMPs) is not significant, but the performance is good in case of multi core multiprocessors (SMPs based on CMPs). 5. CONCLUSION AND FUTURE DI- RECTIONS This paper forms part of the guideline for future work for researchers interested in optimization of memory hierarchy for scalable multi core processors, as it presents a survey of all such techniques proposed in recent publications. The techniques are also presented along with the comments about their effectiveness. A summary of all the optimization techniques discussed in this paper is presented in Table 1. The effect of the mechanisms and policies of operating sys- tem on the memory hierarchy, especially the on-chip cache hierarchy is another direction of research that can be ex- plored. High coherence traffic gives rise to congestion at the first level cache. Directory-based coherence protocols may reduce the overall coherence traffic but this comes with the cost of maintaining the directory and keeping it updated. These and other research directions shall be explored in fu- ture research.
6. REFERENCES [1] Chang and Sohi, (2006),Cooperative Caching for Chip Multiprocessors, Proceedings of the 33rd Annual Interna- tional Symposium on Computer Architecture, p.264-276 [2] Chen and Kandemir, (2008), Code Restructuring for Improving Cache Performance in MPSoCs, IEEE Transac- tions on Parallel and Distributed Systems, Vol. 19, No. 9, p. 1201-1214 [3] Dybdahl , H., P. Stenstrm, (2007), An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors, Proceedings of the IEEE 13th Internation- al Symposium on High Performance Computer Architecture, p. 2-12 [4] Dybdahl., Stenstrm, (2006), Enhancing Last-Level Cache Performance by Block Bypassing and Early Miss International Journal of Innovatory research in Engineering and Technology - IJIRET
ISSN: XXXX-XXXX Volume X, Issue X, Month Year 28 Determination, Asia-Pacific Computer Systems Architec- ture Conference (ACSAC), LNCS 4186, p. 52-66 [5] Core Systems, Proceedings of the 42nd International Symposium on Micro-architecture (MICRO), p.327-336