You are on page 1of 25

Low Power Computing

-Low-Power memory optimization


Outline

Low power caches and other optimizations


Memory customization
Low power caches and
optimizations
Memory Optimization
Most memory-oriented performance
optimizations reduce the NUMBER of
memory accesses
this also indirectly reduces power

However, other optimizations can apply


particularly when memory architecture is flexible
Cache architecture

READ Operation addr

Tag Index Off

Tag Data

Hit ?
Valid, Match

Data
Augmenting Cache Architecture

Basic idea: add a small buffer/cache


Minimize hit into the off-chip memory/main core
stores recent data
fetch directly from this buffer, inhibiting L1 cache access
if high hit rate at buffer, power reduction

Techniques of cache augmentation :


Filter
cache
Block buffers
Scratch Pad Memory
Filter Cache

Filter Cache
insert another small
cache before L1
to lower hit ratio CPU
Filter
cache L1 L2
reduces power
Block Buffer
Aim is to reduce access to
memory core
Block buffering
Save last accessed cache line addr
in buffer.
if next access is to the same line, Tag Index Off
read directly from buffer
Tag Data
Saves access to memory core
when spatial locality exists buf buf
Extend to small fully Hit ?
associative buffer. Valid, Match
Techniques -
Direct mapping Data
Fully associative mapping
Set associative mapping
Scratch Pad Memory
Compiler Managed Memory
part of memory space directly addressed
can be on-chip SRAM
Fast, predictable, low power vs. cache
Embedded Processors, IBM Cell
What data/code should reside in Scratch Pad?
Compiler decision
On-chip
Memory
1 cycle

CPU Address
space
Data
Off-chip
Cache
Memory
1 cycle (on-chip) 10-20
cycles
Tag Comparison
Basic idea:
Need to reduce comparison to save power
Techniques :
Conventional tag and data access
all tag data, arrays accessed simultaneously
More power consumption, performance is high
Sequential tag and data access
Power saving, performance penalty
Way Prediction (Memory Hit/Miss ratio etc)
Hit on predicted way
Produce result on single cycle
Miss on predicted way
Produce result on more than one cycle
Transformations and Disks
Basic idea:
Datatransformation techniques to represent and access
Assume data is large, so disk access needed

Techniques :
Loop fusion
Loop fission
Loop Fusion vs. Loop Fission
Which one is preferable ?
Transformations and Disks
Assume a, b data are located on different disks
Loop fusion may be bad
both disks are continuously accessed
Loop fission (reverse) could be preferable
idle disk in low power mode

for (i=0;i<N;i++) { for (i=0;i<N;i++) {


Disk 2 idle a[i] [j]=0; a[i] [j]=0; Both Disks busy
} b[i] [j]=0;
for (i=0;i<N;i++) { }
Disk 1 idle b[i] [j]=0;
}
DRAM
Two ways Dynamic RAM use :
Power management by controller
Compiler / OS
insert code to control DRAM power state
bring successive accesses together take common
Encoding
Techniques for Encoding words
reduce Hamming Distance between successive
words in transmission
add extra bits
data layout, etc.
Compression
Opportunity: If we can compress instructions
or data, we can use smaller memories
Issues?
Compression Possibilities mostly instn.
Instruction Compression
small subset of instructions typically used
in achieving this, de-compressor may lie in
between Cache and Memory
activated only on miss
between CPU and Cache

better compression in cache


Memory Customization
Memory Architecture Customization
Tailoring memory architecture parameters for
applications
cache size, line size
banking structure
scratch pad memory
Customization
Memory Optimization in terms of memory access rate :
Algorithm (program logic) design in terms of memory access rates
as less as possible.
Consider the power consumption during inception of logic and minimize
simultaneous access to data
Example :
1. Matrix Multiplication
How many memory accesses ? <==== In terms of data/opcode.
What optimization is possible ?

for (i = 0; i < N; i++) {


for (j = 0; j < N; j++){
c[i][j] = 0; <==== N*N times memory access for c
for (k = 0; k <N; k++)
c[ i] [j] += a[i] [k] * b[k][j]; <==== N*N*N times for a and b and c
}
}
Memory accesses : for c N*N and N*N*N i.e. N*N + N*N*N
for a and b N*N*N
Optimization : Can we reduce access to these data variables? If so it
will consume less power.
Customization
Optimization : for Matrix Multiplication
Try to use smaller memory.
We can add local new variable that will use STACK Memory from the
Scratch Pad Memory
result = 0;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++){
result = 0;
for (k = 0; k <N; k++)
result += a[i] [k] * b[k][j]; <==== N*N*N times for a and b
c[i][j] = result; <==== So, N*N*N for c also
}
}
Optimization in no. of memory accesses : for c N*N*N only
for a and b N*N*N
So, power consumption becomes lesser for c
Memory Banking
Memory bank is a part of (cache) memory
addressed consecutively in the total set of memory banks,
i.e., when data item say, a(n) is stored in bank b,
data item a(n + 1) is stored in bank b + 1.
(Cache) memory is divided into banks to evade the effects
of the bank cycle time.
When data is stored or retrieved consecutively each bank
has enough time to recover before the next request for that
bank arrives.
Memory Banking is important in power optimization
Again the memory access rate
Basic idea: idle banks can be turned off
Arrange data and data accesses into banks such that this
opportunity is created.
Reconfigurable Caches
Configure cache blocks dynamically.
How?
Adjusting cache parameters
Dynamically adjust cache parameters
size
associativity
line size
How does this help?
Reconfigurable Caches
Reconsider parameter adjustability in terms of
what cache reconfiguration possibly exists?
loopnesting behavior
Conflict Prediction

How to adjust for loop nesting requirement?


Should cache be flushed?
If data reuse expected, dont flush
Should reconsider to decide loops with
variable/data to be accessed?
Use of smaller or medium cache/memory based on
loop nature
Reconfigurable Caches
How to adjust for conflict prediction?
Cache size more the associativity more space to utilize
Associativity need to reduce associativity
Example : Matrix multiplication
for (i = 0; i < N; i++) { for (i = 0; i < N; i++) {
for (j = 0; j < N; j++){ for (j = 0; j < N; j++){
c[i][j] = 0; x = a[i][j]+b[i][j];
for (k = 0; k <N; k++) for (k = 0; k <N; k++)
c[ i] [j] += a[i] [k] * b[k][j]; c[i][k]=c[i][k]+x;
} }
} }
Fig1 Fig2
Fig1 : Similar amount of data accessed same cache size required
Fig2 : Lesser conflicts in second loop can use lower associativity
Commercial Implementations
Software caches on CELL processor
cache emulated in software
References
P. R. Panda, F. Catthoor, N. Dutt, K. Dankaert, E. Brockmeyer, C.
Kulkarni, A. Vandercapelle, P. G. Kjeldsberg, Data and Memory
Optimization Techniques for Embedded Systems, ACM Transactions
D i A t ti f El t i S t 6(2) A 2001on Design Automation of Electronic
Systems, 2), Apr 2001

W. Wolf and M. Kandemir, Memory System Optimization of


Embedded Software, Proceedings of the IEEE, 91(1), Jan 2003