00 B 495219 Beebdbdaa 000000

110
IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 1, JANUARY 2011
Program Phase-Aware Dynamic Voltage Scaling

Under Variable Computational Workload and
Memory Stall Environment
Jungsoo Kim, Student Member, IEEE, Sungjoo Yoo, Member, IEEE, and Chong-Min Kyung, Fellow, IEEE
AbstractMost complex software programs are characterized

by program phase behavior and runtime distribution. Dynamism
of the two characteristics often makes the design-time workload
prediction difficult and inefficient. Especially, memory stall time
whose variation is significant in memory-bound applications has
been mostly neglected or handled in a too simplistic manner
in previous works. In this paper, we present a novel online
dynamic voltage and frequency scaling (DVFS) method which
takes into account both program phase behavior and runtime
distribution of memory stall time, as well as computational
workload. The online DVFS problem is addressed in two ways:
intraphase workload prediction and program phase detection.
The intraphase workload prediction is to predict the workload
based on the runtime distribution of computational workload
and memory stall time in the current program phase. The
program phase detection is to identify to which program phase
the current instant belongs and then to obtain the predicted
workload corresponding to the detected program phase, which
is used to set voltage and frequency during the program phase.
The proposed method considers leakage power consumption as
well as dynamic power consumption by a temperature-aware
combined Vdd /Vbb scaling. Compared to a conventional method,
experimental results show that the proposed method provides
up to 34.6% and 17.3% energy reduction for two multimedia
applications, MPEG4 and H.264 decoder, respectively.
Index TermsDynamic voltage and frequency scaling (DVFS),
energy optimization, memory stall, phase, runtime distribution.
I. Introduction
YNAMIC voltage and frequency scaling (DVFS) is one
of the most effective methods for lowering energy consumption. DVFS is used to suppress the leakage energy by a
dynamic control of supply voltage (Vdd ) and body bias voltage
(Vbb ). Accurate prediction of remaining workload (hereafter,
workload prediction) plays a central role in DVFS where the
Manuscript received March 15, 2010; accepted July 27, 2010. Date of
current version December 17, 2010. This work was supported in part by
the National Research Foundation of Korea Grant funded by the Korean
Government, under Grant 2010-0000823, and the Brain Korea 21 Project,
the School of Information Technology, Korea Advanced Institute of Science
and Technology in 2010. This paper was recommended by Associate Editor
H.-H. S. Lee.
J. Kim and C.-M. Kyung are with the Korea Advanced Institute of
Science and Technology, Daejeon 305-701, South Korea (e-mail: jungsoo.kim83@gmail.com; kyung@ee.kaist.ac.kr).
S. Yoo is with the Pohang University of Science and Technology, Pohang
790-784, South Korea (e-mail: sungjoo.yoo@postech.ac.kr).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCAD.2010.2068630
frequency level of the processor is set as the ratio of remaining

workload to time-to-deadline.
Workload of software program varies due to data dependency (e.g., loop counts), control dependency (e.g., if/else,
switch/case statement), and architectural dependency [e.g.,
cache hit/miss, translation lookaside buffer (TLB) hit/miss,
and so on]. To tackle the workload variation, extensive works
have been proposed [9][13], [19] assuming that workload
(i.e., elapsed number of clock cycles seen by processor) is
invariant to processor frequency scaling. However, the assumption is not appropriate for applications having significant
memory accesses. Fig. 1(a) shows the distribution of per-frame
workload of MPEG4 decoder at two different frequency levels,
i.e., 1 and 2 GHz. It is obtained from decoding 3000 frames
of 1920 800 movie clip (an excerpt from Dark Knight)
on LG XNOTE LW25 laptop.1 As shown in Fig. 1(a), the
workload increases as processor frequency increases. This is
due to the processor stall cycles spent while waiting for data
from external memory (e.g., SDRAM, SSD, and so on). For
example, when the memory access time is 100 ns, each offchip memory access takes 100 and 200 processor clock cycles
at 1 GHz and 2 GHz, respectively. Since the memory access
time, called memory stall time, is invariant to processor clock
frequency, the number of processor clock cycles spent for
memory access grows as the clock frequency increases.
To consider memory stall time in clock frequency scaling,
[4][6] present DVFS methods which set the clock frequency
of processor based on the decomposition of whole workload
into two clock frequency-invariant workloads: computational
and memory stall workloads. Computational workload is the
number of clock cycles spent for instruction execution, and
memory stall workload corresponds to memory stall time.
Based on the decomposed workloads, previous methods set
clock frequency, f , as f = wcomp /(tdR t stall ), where wcomp and
t stall represent average (or worst-case) computational workload
and memory stall time, respectively. tdR is the time-to-deadline.
Generally, computational workload and memory stall time
have distributions as shown in Fig. 1(b) and (c). Fig. 1(b)
shows the distribution of computational workload caused by
data, control, and architectural dependency. Distribution of
1 LG XNOTE LW25 laptop consists of 2 GHz Intel Core2Duo T7200 processor with 128 KB L1 instruction and data cache, 4 MB shared L2 cache, and
667 MHz 2 GB DDR2 SDRAM.
c 2010 IEEE
0278-0070/$26.00
KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD
111
excerpted from Dark Knight. The x-axis and the left-hand side
y-axis represent frame index and per-frame decoding cycles,
respectively. The right-hand side y-axis represents program
phase index. Note that the program phase index does not correspond to the required performance level of the corresponding
program phase in this example. As shown in Fig. 1(d), the
entire time for decoding 1000 frames is classified into nine
program phases, and, within a program phase, per-frame
decoding cycle has a runtime distribution. Fig. 1(e) shows
runtime distributions of three representative program phases
out of nine program phases to illustrate that there can be a wide
runtime distribution within each program phase characterized
by its runtime distribution.
Fig. 1. Per-frame profile results of MPEG4 decoder when decoding Dark

Knight movie clip. (a) Total workload at 1 and 2 GHz. (b) Computational
workload. (c) Memory stall time. (d) Phase behavior in the per-frame
workload. (e) Runtime distributions of three representative phases.
memory stall workload shown in Fig. 1(c) results mostly

from L2 cache hit/miss, page hit/miss, and interference (e.g.,
memory access scheduling [1]) in accessing DRAM. As the
distribution of memory stall workload becomes significant,
previous DVFS methods based on average (or worst-case)
memory stall workload become inappropriate in reducing
energy consumption.
Long-running software programs are mostly characterized
by nonstationary phase behavior [14], [15]. For example,
multimedia programs (e.g., MPEG4 and H.264 CODEC) have
distinct time durations whose workload characteristics (e.g.,
mean, standard deviation, and max value of runtime) are
clearly different from other time durations. We call each such
distinct time duration program phase [14], [15]. Formal
definition of program phase will be given later in Section VIII.
Fig. 1(d) exemplifies the program phase behavior of MPEG4
decoder when decoding the first 1000 frames of the movie clip
A. Our Approach
Our observation on the runtime characteristics of software
program suggests that, as shown in Fig. 1, the program workload has two characteristics: nonstationary program phase behavior and runtime distribution (even within a program phase)
of computational workload and memory stall time. Based on
the observations above, this paper presents an online DVFS
method that tackles the characteristics of program workload in
order to minimize the average energy consumption of software
program. We address the online DVFS problem in two ways:
intraphase workload prediction and program phase detection.
The intraphase workload prediction predicts workloads based
on the runtime distribution of computational workload and
memory stall time in the current program phase. The program
phase detection identifies to which program phase the current
instant belongs and then obtains the intraphase workload
prediction of the corresponding program phase, which is used
to set voltage and frequency during the program phase.
Leakage power consumption often dominates total power
consumption especially at high temperature. Our method tackles leakage power consumption with a temperature-aware combined Vdd /Vbb scaling. During runtime, based on temperature
readings as well as the runtime distribution, the online method
selects a set of appropriate Vdd and Vbb corresponding to
frequency level from the solution table (which was prepared
during design time).
This paper is organized as follows. Section II reviews related
works. Section III gives preliminaries on our energy model and
profiling method. Section IV presents the problem definition
and solution overview, followed by analytical formulation of
our problem in Section V. Sections VI and VII explain the proposed runtime distribution-aware DVFS. Section VIII presents
the program phase detection method. Section IX reports experimental results followed by the conclusion in Section X.
II. Related Works
There are a number of methods on the workload prediction
for online DVFS based on (weighted) average, maximum,
or the most frequent workload, or finding a repeated pattern
among N recent workloads [2]. Recently, a control theorybased workload prediction method was proposed to accurately
capture the transient behavior of workload [3]. To exploit
memory stall time, [4] and [5] present memory stall timeaware DVFS for soft real-time intertask DVFS which lowers
112
the clock frequency by an amount proportional to the average

ratio of external memory access per instruction to clock
cycle per instruction. However, these memory stall time-aware
DVFS methods are based on average memory stall time and
do not exploit the workload distribution and nonstationary
program phase behavior.
Runtime distribution in computational workload (in most
cases, assuming a constant memory stall time) has been
studied mostly in intratask DVFS methods where performance
level is set dynamically during the execution of a task.
There are several intratask DVFS methods where workload is
predicted based on the program execution paths, e.g., worstcase execution path [7], average-case execution path [8], and
virtual execution path based on the probability of branch
invocation [9]. [10] presents an analytic workload prediction
method which minimizes statistical average dynamic energy
consumption. [11] presents a numerical solution for combined
Vdd /Vbb scaling to tackle leakage energy. [12] and [13] present
a DVFS method, called accelerating frequency schedules,
which considers the per-task runtime distribution for a set
of independent tasks. All the works mentioned above assume
constant memory stall time and single program phase.
Program phase concept has been one of the hottest research
issues because it allows new opportunities of performance
optimization, e.g., program phase-aware dynamic adaptations
of cache architecture [14], [15]. Various methods have been
proposed to characterize program phase behavior. Among
them, a vector of the average execution cycles of basic
blocks, called basic block vector (BBV), is most widely
used. By characterizing a program phase with BBV, one can
apply the program phase concept to DVFS as in [16] and
[17]. A new program phase is detected when two BBVs are
significantly different, e.g., when Hamming distance between
two BBVs is larger than a pre-defined threshold value.
Because there are a large number of basic blocks in typical
software applications, program phase detection utilizing the
BBV is usually impractical. Thus, the key issue is to reduce
the dimensionality of the BBV by identifying a subset of
basic blocks to represent the program phase behavior. A
random linear projection method is described in [14] and
[15] to reduce the effort of exploring all the combinations of
basic blocks to identify the subset. In this paper, we present a
program phase detection scheme suitable for DVFS purpose,
based on the vectors of predicted workloads for coarsegrained code sections (instead of using BBV) as explained in
Section VIII. In addition, unlike existing phase-based DVFS
methods, our method exploits runtime distribution within
each program phase to better predict the remaining workload.
Several online DVFS methods have been presented to utilize
the dynamic program behavior for further energy saving. [18]
presents a workload prediction method utilizing the Kalman
filter which captures time-varying workload characteristics
by adaptively reducing the prediction error via feedback.
We presented an online workload prediction method which
minimizes both dynamic and leakage energy consumption by
exploiting the program phase behavior and runtime distribution
of computational cycle within each program phase [19]. Based
on the assumption that memory stall time does not vary a
lot during runtime, the distribution of memory stall time is

not considered. However, the memory stall time is simply
accounted for as an integral (nonseparable) part of the total
runtime of software program. However, in memory-bound applications where memory stall time becomes a significant portion of total program runtime, the distribution of memory stall
time needs to be exploited to achieve further energy reduction.
Compared to the method which sets voltage and frequency
based on average computational workload and memory stall
time during program runs [4], our method has three distinctive
features. First, our approach exploits runtime distribution of
both computational cycle and memory stall time, while only
the average values are assumed in [4]. Second, we exploited
program phase detection to achieve maximal reduction of
energy consumption, while [4] utilizes average workload of
whole program without the notion of program phase. Third,
in our method, workload prediction is done in a temperatureadaptive manner to tackle the dependency of leakage energy
and temperature, while the temperature dependence is ignored
in [4].
III. Preliminary
A. Processor Energy Model
Energy consumption per cycle (e) consists of switching (es )
and leakage (el ) components. Additionally, in deep submicron
regime, el is further divided into subthreshold (esub
l ), gate
gate
junc
(el ), and junction (el ) leakage energy. Putting them all
together, we can express the total energy consumption per
cycle as follows [20], [21]:
e
2
Ceff Vdd
+ Ng f 1 (Vdd K1 exp(K2 Vdd ) exp(K3 Vbb )

(1)
+Vdd K4 exp(K5 Vdd ) + |Vbb |Ij
where Ceff and Ng are effective capacitance and effective number of gates of the target processor, respectively. K1 , K2 , K3 ,
K4 , K5 , and Ij are process-dependent curve-fitting paramgate
junc
, respectively. Especially, the
eter sets for esub
l , el , and el
values of K1 , K2 , K3 are functions of operating temperature
increases exponentially as the operating tem(T ), since esub
l
perature increases. According to BSIM4 model and [21], the
temperature dependence of the parameters (K1 , K2 , and K3 )
is modeled as follows:
T 2
K6
Tref
K1 (T )
exp
(1
) K1 (Tref ) (2)
Tref
Tref
T
Tref
K2 (T )
K2 (Tref )
(3)
T
Tref
(4)
K3 (Tref )
K3 (T )
T
where Tref is reference temperature and K6 is a curve-fitting
parameter. Thus, K1 , K2 , K3 at temperature T can be obtained from the values at Tref using the relationship in (2)(4).
Since the temperature-aware energy model shown in (1)
(4) is too complicated to be used in our optimization, we
adopted a simplified energy model of combined Vdd /Vbb
scaling to approximate the energy consumption per cycle at
each temperature T as follows:
e(f, T ) as (T )f bs (T ) + al (T )f bl (T ) + c(T )
(5)
113
TABLE I
Energy Fitting Parameters for Approximating the Processor
Energy Consumption to the Accurate Estimation Obtained
from PTscalar with BPTM High-k/Metal Gate 32 nm HP Model
for Different Temperatures, Along with the Corresponding
(Maximal and Average) Errors
Temperature
(C)
Fitting Parameters
as
1.2101
1.2101
1.2101
1.2101
25
50
75
100
bs
1.3
1.3
1.3
1.3
al
4.6109
2.0107
2.2106
1.4105
Maximum (Avg)
Error (%)
bl
20.5
16.6
14.2
12.4
c
0.11
0.12
0.14
0.15
2.8
1.4
1.4
1.7
[0.9]
[0.5]
[0.4]
[0.7]
Fig. 2. Memory stall time vs. number of L2 cache misses as approximated

by a straight line.
where as (T ), bs (T ) and al (T ), bl (T ) are sets of curvefitting parameters which model frequency-dependent portion in
es (T ) and el (T ), respectively. c(T ) is a curve-fitting parameter
corresponding to the amount of frequency-independent energy
portion in e(f, T ). Table I shows examples of fitting parameters which approximate the processor energy consumption
obtained from PTscalar [21] and Cacti5.3 [22] with the energy
model, i.e., (1)(4), for Berkeley predictive technology model
(BPTM) high-k/metal gate 32 nm HP model [23] at 25 C,
50 C, 75 C, and 100 C. In the modeling, we configured a
target processor in PTscalar as the best-effort estimate of Core
2-class microarchitecture using the parameters presented in
[24]. As Table I shows, the simplified energy model tracks
the original energy model within 2.8% of maximum error for
all the operating temperatures. Note that fitting parameters
for modeling switching energy consumption, i.e., as and bs ,
are unchanged as temperature varies because switching energy
consumption is temperature invariant.
Processor energy consumption depends on the type of
instructions executed in the pipeline path [25]. To simply
consider the energy dependence on instructions, we classify
processor operation into two states: computational state for
executing instructions and memory stall state mostly spent for
waiting for data from memory. When a processor is in the
memory stall state, switching energy consumption can be suppressed using clock gating while leakage energy consumption
is almost the same as the computational state. The reduction
ratio of switching energy, called clock gating fraction denoting
the fraction of the clock-gated circuit, is modeled as (0.1 in
our experiments). Thus, energy consumption per clock cycle
in each processor state can be calculated as follows:
ecomp
stall
e
comp
=
=
a s f bs + a l f bl + c
as f
bs
bl
+ al f + c
(6)
(7)
stall
and e
represent energy consumption per cycle
where e
in the computational and memory stall state, respectively.
Given a desired frequency level (f ), one can always find a
pair of Vdd and Vbb that gives minimum energy consumption
per cycle using the combined Vdd /Vbb scaling [11].
B. Runtime Workload Profiling
The total number of processor execution cycles, x, can be
expressed as a sum of the number of clock cycles for executing
instructions in a processor, xcomp , and that of stall cycles for
accessing an external memory, xstall , which is expressed as

a function of memory stall time, t stall , and frequency, f , as
follows:
x = xcomp + xstall = xcomp + f t stall .
(8)
For the decomposition of processor cycle into two processor

clock frequency-invariant components, i.e., xcomp and t stall ,
during program runs, we adopt an online profiling method
which uses performance counters in a processor as presented
in [4]. We model t stall using only the number of the lastlevel cache misses (N L2 miss , in our experiment, L2 is the
last-level cache). The rationale of modeling t stall only with
N L2 miss is twofold. First, the effect of last-level cache
miss dominates the others (TLB miss, interrupts, and so on)
according to our experiment. Second, the number of events
simultaneously monitored in a processor is usually limited (in
our experimental platform, two events). In our model, t stall is
expressed as follows:
t stall = ap N L2
miss
+ bp
(9)
where ap and bp are fitting parameters. Fig. 2 illustrates that

(9) (solid line) tracks quite well the measured memory stall
time (dots) when running H.264 decoder program in FFMPEG
[29].
In a typical software program, xcomp and t stall obtained from
running a code section are correlated with each other. It is
because t stall of a code section is proportional to the number
of external memory references which is highly correlated
with the number of executed memory instructions in a code
section, e.g., load and store. xcomp of a code section depends
on the type and number of executed instructions including
memory instructions. To consider the correlation between
computational cycle (xcomp ) and memory stall time (t stall ), we
model the distribution of xcomp and t stall of a code section using
a joint probability density function (PDF) as shown in Fig. 3.
During runtime, the joint PDF is obtained as follows. After the
execution of a code section, t stall is obtained from (9). Then,
from (8), xcomp is calculated with x and t stall . The probability
of occurrence of a pair of xcomp and t stall is defined as the ratio
of the number of occurrences of the pair to the total number
of executions of the code section.
114
Fig. 3. Joint PDF with respect to computational workload (xcomp ) and

memory stall time (t stall ).
Fig. 4. Solution inputs. (a) Software program (or source code) partitioned
into program regions. (b) Energy model (in terms of energy-per-cycle) as
a function of frequency. (c) fv table storing the energy-optimal pairs,
(Vdd , Vbb ), for N frequency levels.
IV. Problem Definition and Solution Overview

Fig. 4 illustrates three types of input required in the
proposed procedure. Fig. 4(a) shows a software program
partitioned into program regions each shown as a box. A
program region is defined as a code section with associated
voltage/frequency setting. The partition can be performed
manually by a designer or via an automatic tool [26] based
on execution cycles of code sections obtained by a priori
simulation of the software program. The ith program region is
denoted as ni while the first and the last program region are
called root (nroot ) and leaf (nleaf ) program region, respectively.
In this paper, we simply focused on a software program which
periodically runs from nroot to nleaf at every time interval. At
the start of a program region, voltage/frequency is set and
maintained until the end of the program region. At the end of
a program region, computational cycle and memory stall time
are profiled. Then, as explained in Section III-B, the joint PDF
of computational cycle and memory stall time are updated
as shown in Fig. 3. Fig. 4(b) shows an energy model (more
specifically, energy-per-cycle vs. frequency). Fig. 4(c) shows
a pre-characterized table called f-v table in which the energyoptimal pair, (Vdd , Vbb ), is stored for each frequency level (f ).
When the frequency is scaled, Vdd and Vbb are adjusted to the
corresponding level stored in the table. Note that, due to the
dependency of leakage energy on temperature, energy-optimal
values of (Vdd , Vbb ) corresponding to f vary depending on the
operating temperature. Therefore, we prepare f-v table for a
set of quantized temperature level.
Algorithm 1 : Overall flow

1: if (end of ni ) then
2:
Online profiling and calculation of statistics (Section III-B)
3:
if (ni == nleaf ) then
4:
iter++
5:
if ((iter % PHASE UNIT)==0) then
6:
for from nleaf to nroot do
7:
Workload prediction for each energy component
(Section VI)
8:
end for
9:
Program phase detection (Section VIII)
10:
end if
11:
end if
12: else if (start of ni ) then
13:
Finding workload of ni based on coordination (Section VII)
14:
Voltage/frequency scaling with feasibility check
15: end if
Given the three inputs in Fig. 4, we find the energy-optimal

opt
workload prediction, i.e., wi , of each program region during
program execution. Algorithm 1 shows the overall flow of the
proposed method. The proposed method is largely divided into
workload prediction (lines 111) and voltage/frequency (v/f)
setting (lines 1215) step, which are invoked at the end and
the start of every program region, respectively.
In the workload prediction step, we profile runtime information, i.e., xistall and tistall , and update the statistical parameters of
the runtime distributions, e.g., mean, standard deviation, and
skewness of xistall and tistall (lines 12). After the completion
of the leaf program region, the number of program runs, i.e.,
iter, is increased (line 4). At every PHASE UNIT program
runs (line 5), where PHASE UNIT is the predefined number
of program runs (e.g., 20-frame decoding in MPEG4), we
perform the workload prediction and program phase detection
by utilizing the profiled runtime information and its statistical
parameters (lines 510). The periodic workload prediction is
performed in the reverse order of program flow as presented in
[10], [11], and [19], i.e., from the end (nleaf ) to the beginning
(nroot ) of a program (lines 68). As will be explained in
Sections V and VI, in this step, we find local-optimal workload
predictions of ni , each of which minimizes each energy
component, instead of total energy.2 By utilizing the localoptimal workload predictions, the program phase detection is
performed to identify which program phase the current instant
belongs to (line 9).
In the v/f setting step (lines 1215), which is performed
at the start of each program region, a process called coordination determines energy-optimal global workload prediction,
opt
wi , with the combination of the local-optimal workload
predictions of the detected program phase (line 13). Based on
opt
wi , we set voltage/frequency while satisfying hard real-time
constraint (line 14).
V. Analytical Formulation of
Memory Stall Time-Aware DVFS
Assume that a program is partitioned into two program
regions, i.e., ni and ni+1 , and that each program region has
2 In this paper, total energy consumption is calculated as the sum of the five
independent energy components as shown in (11).
a distinct computational cycle and memory stall time. The

energy model presented in Section III-A is used. The total
energy consumption for running the two program regions, Ei ,
is calculated as follows:
comp
Ei = Ei
+ Eistall
(10)
comp
and Eistall represent the energy consumption for

where Ei
running computational workload and memory stall workload,
respectively.
comp
Ei
and Eistall , respectively, consist of three independent energy components: frequency-dependent switching encomp
ergy (Esi
and Esistall ), frequency-dependent leakage energy
comp
(Eli
and Elistall ), and frequency-independent energy called
comp
comp
base energy (Ebi
and Ebistall where Ebi = Ebi
+ Ebistall ).
Thus, Ei is expressed as follows:
comp
Ei = (Esi
comp
+ Eli
) + (Esistall + Elistall ) + Ebi .
(11)
Using (6)(8), the five energy components in (11) are expressed as follows:
comp
Esi
comp
= as fibs xi
comp
Eli
Esistall
Elistall
Ebi
=
=
comp
bs
+ as fi+1
xi+1
comp
comp
bl
xi
+ al fi+1
xi+1
bs
stall
(as fibs fi tistall + as fi+1
fi+1 ti+1
)
bl
bl
stall
stall
al fi fi ti + al fi+1 fi+1 ti+1
comp
comp
stall
c(xi
+ xi+1 + fi tistall + fi+1 ti+1
).
al fibl
(12)
(13)
(14)
(15)
(16)
Frequency of each program region, fi and fi+1 can be expressed as the ratio of the remaining computational workload
prediction (wi and wi+1 ) to the remaining time-to-deadline
prediction for running the computational workload, i.e., total
R
remaining time-to-deadline (tiR and ti+1
) minus remaining
memory stall time prediction (si and si+1 ), as shown in
wi
fi = R
(17)
ti s i
wi+1
.
(18)
fi+1 = R
ti+1 si+1
R
in (18) is expressed as follows:
ti+1
comp
R
ti+1
= tiR
xi
fi
tistall .
(19)
R
with (17) and (19), fi+1 in (18) is
By replacing fi and ti+1
rearranged as follows:
wi+1
fi+1 = R
(20)
(ti si )i
where
comp
tistall
xi
t stall
R i
wi
ti s i
(tistall + si+1 ) si .
1
(21)
(22)
stall
When memory stall time of ni and ni+1 , i.e., tistall and ti+1
,
are unit functions, remaining memory stall time prediction is
set to the sum of memory stall time of remaining program
stall
regions, i.e., si = tistall + ti+1
. In the same manner, si+1 is set
stall
to ti+1 because ni+1 is the leaf, i.e., last, program region in
this case. Therefore, tistall in (22) becomes zero, thereby, i
115
is independent of tiR . Since we perform workload prediction

from leaf to root program region as presented in [10], wi+1 is
opt
already known as wi+1 when calculating wi . With (17)(20),
(12)(16) can be expressed as functions of wi and tiR .
Since Ei is continuous and convex with respect to wi , the
energy-optimal workload prediction of computational workopt
load, i.e., wi , can be obtained by finding a point which
satisfies the following relation:
comp
Ei Esi
=
wi
wi
comp
Eli
wi
Esistall Elistall Ebi

+
+
= 0. (23)
wi
wi
wi
Since total energy consumption, Ei , is a function of wi as

opt
well as tiR , wi satisfying (23) varies with respect to tiR . In
opt
other words, wi has to be found for every tiR . Because tiR
has a wide range of values, performing a workload prediction
for every value of tiR is unrealistic. Therefore, we proposed
a solution which performs a workload prediction for a set of
quantized levels of tiR [28]. However, it also requires a lot
of workload predictions since more energy savings can be
obtained as tiR is quantized into larger number of quantization
levels. Thus, the method causes a large runtime overhead
if it is applied as the online solution while maintaining
its effectiveness (according to our experiment, the runtime
overhead is 3.4 times larger than the pure runtime for H.264
decoder when tiR is quantized into 30 levels).
To reduce the runtime overhead of finding an energyoptimal workload prediction, we propose a workload preopt
diction method which finds wi in two steps: 1) workload
prediction which minimizes each energy component, called
local-optimal workload prediction (in Section VI), and 2)
coordination of the local-optimal workload predictions to
opt
obtain global workload prediction wi (in Section VII).
A local-optimal workload prediction is to find the workload
prediction which minimizes each of the five energy components in (11) by adjusting voltage/frequency based on the
workload prediction. For example, voltage/frequency scaling
comp
based on the local-optimal workload prediction of Esi
only
comp
minimizes energy consumption of Esi . It can be obtained
by finding the point which equates the single derivative of
comp
(23) to zero, i.e., Esi /wi = 0. Note that a local-optimal
workload prediction can be calculated independently of tiR ,
because i in (21) is independent of tiR (tistall = 0).
A coordination of the local-optimal workload predictions
is to find the workload prediction which minimizes Ei by
utilizing the five local-optimal workload predictions. When
a derivative of one energy component with respect to wi
dominates others in (23), the workload prediction which
satisfies the (23) can be obtained by finding a point where
the derivative of the dominant energy component becomes
comp
opt
zero. For instance, when Esi /wi dominates others, wi
comp
is simply set to wsi . When there are multiple dominant
energy components, we need to coordinate them so as to find
the workload prediction with lower total energy consumption.
Finding the workload prediction [satisfying (23)] requires a
numerical solution whose complexity is too high to be applied
during runtime, as presented in [28]. In this paper, we present
an efficient approach to coordinate local-optimal workload
116
comp
comp
comp
comp
and xi+1 represent the average of xi

and xi+1 ,
where xi
comp
comp
is fixed as xi
since Ji is a
respectively. Note that xi
opt
comp
unit function in this case. wi+1 is replaced by wsi+1 since
comp
we perform the local-optimal workload prediction of Esi .
comp
comp
Since Esi
is continuous and convex on wi , wsi
can be
obtained by finding a point which satisfies
as bs wbi s 1
(tiR si )bs
comp
Esi
wi

comp
xi
comp
comp
+ (wsi+1 )bs xi+1
comp
xi
(25)

= 0.
comp
(wi xi )bs +1
comp
By rearranging (25) with respect to wi , we can express wsi

in a closed-form expression as follows:
comp
comp
comp
comp 1
= xi
+ (wsi+1 )bs xi+1 bs +1
wsi
comp
= xi
comp
si+1 .
+w
(26)
comp
wsi
comp
Fig. 5. Three cases. (a) Case 1: unit functions for both xi

and tistall . (b)
comp
and unit function of tistall . (c) Case 3:
Case 2: runtime distribution for xi
comp
runtime distributions for both xi
and tistall .
opt
predictions in a runtime-adaptive manner. We find wi at the

start of ni through the coordination of local-optimal workload
predictions.
VI. Workload Prediction for Minimizing Energy

Consumption of Single Energy Component
In this section, we assume that a program is partitioned into
two consecutive program regions, ni and ni+1 , and present
a method which finds a local-optimal workload prediction
while exploiting the runtime distribution of both computational
workload and memory stall time. As Fig. 5 shows, we will
explain the local-optimal workload prediction method in three
comp
different cases of Ji , the joint PDF of xi
and tistall . Case
1: when Ji is given as a unit function while Ji+1 is a general
comp
function as shown in Fig. 5(a). Case 2: when xi
alone has
a runtime distribution while tistall is a unit function as shown
comp
in Fig. 5(b). Case 3: when both xi
and tistall have runtime
distributions as shown in Fig. 5(c).
comp
A. Case 1: Both xi
and tistall Have Unit Functions
In this subsection, we explain the case where the joint PDF

of ni is given as a unit function as shown in Fig. 5(a). We
comp
comp
define wsi , wli , wsistall , wlistall , and wbi as the localcomp
comp
optimal workload prediction for minimizing Esi , Eli ,
stall
stall
Esi , Eli , and Ebi , respectively. Given the joint PDFs, Ji
and Ji+1 , average switching energy consumption for running
comp
computational workload, i.e., Esi , is calculated as the sum
comp
of Esi
with respect to Ji and Ji+1 as follows:

comp
comp
Esi
=
Esi Ji Ji+1
(24)
comp

bs
as
wsi+1
comp
comp
wbi s xi
= R
+
xi+1
comp
b
s
(ti si )
1 xi /wi
consists of two components:

Equation (26) shows that
comp
1) workload of the ith program region, i.e., xi , and 2)
comp
comp bs comp 1/(bs +1)
si+1 = ((wsi+1 ) xi+1 )
w
, called effective remaining
comp
workload of ni+1 with respect to Esi , corresponding to
the portion of remaining workload after program region ni .
comp
Fig. 5(a) illustrates the calculation of wsi
presented in
(26), where Ji and Ji+1 are replaced by their representative
comp
scomp
workloads, i.e., xi
and w
i+1 , respectively. In the same
comp
stall
stall
way, wli , wsi , wli , and wbi can be expressed as
follows:
comp comp 1
comp
comp
= xi
+ (wli+1 )bl xi+1 bl +1
wli
comp
comp
= xi
+ wl
(27)
i+1

wsistall
comp
xi
xi
comp
comp
xi
xi
comp
tistall
comp ,
wl
i+1
stall bl +1 stall
(wli+1
) ti+1
stall

+ wl
i+1
comp
xi
xi
comp
(28)
b 1+2
comp
xi

wbi
stall bs +1 stall
(wsi+1
) ti+1
tistall
sstall
w
i+1

wlistall
b 1+2
comp
xi
(29)
21
comp
xi
tistall
i+1
+ wb
stall
wbi+1 ti+1
(30)
stall
sstall
w
i+1 , wli+1 ,
i+1 are effective remaining

where
and wb
comp
workload of ni+1 with respect to Eli , Esistall , Elistall , and
Ebi , respectively. Since local-optimal workload can simply be
calculated by just summing effective remaining workloads of
program regions as shown in (26)(30), it can be obtained
during program runs with negligible runtime overhead.3
If the software program consists of a cascade of program
regions with conditional branches, we can still calculate the
effective remaining workload of program region in a similar
manner to [10].
3 The runtime overhead of the local-optimal workload prediction is presented
in Table VI.
comp
B. Case 2: xi
Unit Function
117
Has a Runtime Distribution and tistall Has a

comp
In this subsection, we explain the case where xi

has a
runtime distribution with tistall still assumed as unit function as
shown in Fig. 5(b). In this case, average energy consumption
of computational workload is expressed as follows:

comp
as
comp
Esi
=
Esi Ji Ji+1 = R
(31)
(ti si )bs

Nc
comp

pi (j)
comp
comp
comp
wbi s xi
+ (wsi+1 )bs xi+1
comp
(1 xi (j)/wi )bs
j=1
comp
can be obtained independently

Note that, in this case, wsi
of tiR , without loss of quality degradation. However, contrary
comp
comp
to (26), no explicit form exists for wsi . Thus, wsi
can
be obtained only through a numerical solution approach as
presented in [11], which is too time-consuming for runtime
application. Instead, being inspired by (26), we can model the
comp
solution, wsi
as follows:
comp
comp
i
= xs
comp
si+1
+w
(33)
comp
i
where xs
is the effective workload of program region ni
comp
scomp
for Esi . w
i+1 is obtained in the same way as presented
in (26). From our observation that energy-optimal workload
prediction tends to have a value near the average and depends
comp
on runtime distribution, we model xs
as follows:
i
comp
comp
comp
xs
= (1 + si ) xi
i
comp
comp
and Index 3:
follows:
comp
in its
where Nc is the number of quantized levels of xi
comp
comp
PDF. pi (j) represents the probability of xi
falling into
the jth quantized level. Note that, in this case where tistall is
comp
given as a unit function, the joint PDF (Ji ) of xi
and tistall
comp
comp
stall
is the same as the PDF of xi
at the given ti , i.e., pi .
comp
wsi
can be obtained by finding wi which satisfies the
following relation:

comp
Esi
as bs wibs 1
comp
= R
xi
+
(32)
wi
(ti si )bs

Nc
comp
comp

x
(j)p
(j)
comp
i
i
s
wbi+1
= 0.
xi+1
comp
bs +1
(w
x
(j))
i
i
j=1
wsi
comp
Fig. 6. scomp as a function of (a) Index 1: i

/xi
comp
comp
comp
xi
/
wsi+1 , and (b) Index 2: skewness (gi
), at 75 C.
(34)
where si
is a parameter which represents the ratio of
comp
comp
comp
the distance between xs
and xi
to xi . We calculate
i
comp
i
by exploiting the pre-characterization of solutions. First,
xs
comp
we prepare a lookup table LUTscomp for si
during design
comp
during runtime.
time and perform table lookup to obtain si
comp
si
depends on the shape of runtime distribution. Thus, we
derived the indexes of LUTscomp as follows:
comp comp
1) Index 1: i /xi , normalized standard deviation
comp
comp
(i ) with respect to the mean of ni (xi );
comp
comp
2) Index 2: gi , skewness of xi ;
comp
comp
comp
3) Index 3: xi /
wsi+1 , ratio of the mean of ni (xi ) to
comp
the effective remaining workload of ni+1 (
wsi+1 ).
The rationale of choosing the three indexes is as follows. By
comp
substituting wsi
with (33) and (34), (32) is rearranged as
comp
xi
comp
+ (
wsi+1 )bs +1
Nc
(35)
j=1
comp
xi
comp
((1 + si
comp
) xi
comp
(j)pi
(j)
comp
comp
si+1 xi
+w
(j))bs +1

= 0.
comp
Note that the optimal si

can be obtained by finding a
comp
point which satisfies (35). As shown in (35), si
depends
comp
comp
scomp
on xi , w
(Index
3),
and
the
PDF
of
x
, i.e.,
i
i+1
comp
comp
xi (j), pi (j), which is modeled as a skewed normal
distribution in this paper, since the PDF usually does not have
a nice normal distribution.4 The skewed normal distribution is
comp
comp
comp
characterized with three parameters: xi , i , and gi
comp
(Index 1 and Index 2). Fig. 6 illustrates si
as the indexes
change.
comp
Fig. 6 shows si
as a function of three indexes above. As
comp
shown in Fig. 6(a), si
increases with the wider distribution
comp
comp comp
of xi , i.e., i /xi
increases, and increases as the
workload of ni (relative to the effective remaining workload
comp
comp
of ni+1 ), i.e., xi /
wsi+1 , increases. It also increases as the
comp
gi , skewness of PDF, moves to the right (gi > 0) as
Fig. 6(b) shows.
comp
In the same way, wli , wsistall , wlistall , and wbi can also
comp
be calculated by finding li , sistall , listall , and bi from
LUTlcomp , LUTsstall , LUTlstall , and LUTb , respectively. Note
comp
bi can be obtained by performing table lookup
that si
with the statistical parameters (e.g., mean, standard deviation,
and skewness) and effective workload of ni+1 . Thus, it can
be performed with negligible runtime overhead to find a
local-optimal workload prediction while exploiting the runtime
distribution of computational workload.
comp
C. Case 3: Both xi
and tistall Have Runtime Distributions
comp
xi
and tistall have their runtime distributions

When both
as shown in Fig. 5(c), average switching energy consumption
comp
for running computational workload, i.e., Esi , can be
comp
calculated as the sum of Esi
with respect to the joint PDFs
4 Note that more accurate workload prediction can be performed with an
comp
additional effort, as presented in [19], where PDF of xi
is modeled as a
multimodal distribution with each mode given as a skewed normal distribution.
Although more energy savings can be obtained from the multimodal modeling,
in this paper, we simply approximated PDF as a single-modal skewed normal
distribution in order to reduce the runtime overhead. However, it can be easily
extended to the multimodal case [19].
118
Ji and Ji+1 as follows:

comp
comp
Esi
=
Esi Ji Ji+1
bs comp
as
comp
=
wi xi
+ Zsi
R
b
s
(ti si )
(36)
where
comp
Zsi
comp
comp
= (wsi+1 )bs xi+1
Ns
Nc

j=1 k=1
J(j, k)
(i (j, k))bs
(37)
where i (j, k) and J(j, k) denote i in (21) and the probability

comp
when (xi , tistall ) falls into the (j, k)th quantized level, respectively. Since we set the predicted remaining memory stall time
stall
(si ) to the sum of average of tistall and ti+1
, tistall [defined in
(22)] in i is not zero any longer. Due to the nonzero tistall ,
the local-optimal workload prediction is a function of tiR . To
reduce the solution complexity, we approximate the calculation
comp
of Zsi
in (37) as follows:

comp
comp
comp
comp
si
(wsi+1 )bs xi+1
Zsi
Nc

j=1
(1
comp
pi (j)
comp
xi (j)/wi )bs
TABLE II
Threshold Parameters Used in Coordination
Coordination
Step
C1
Threshold
Parameter
Condition
fscomp
b
(as f bs )
lf l )
c (af
f
b
b
s
(al f l )
c (asff )
f
b
+1
l
(al f
)
(cf )
f c
f
(al f bl +1 )
(cf )
c f
f
bl +1
(as f bs +1 )
c (al f f +cf )
f
b
+1
bs +1
l
(al f
+cf )
c (asff )
f
flcomp
C2
fbstall
flstall
(38)
C3
fsstall
fLstall
where
comp
si
Ns
Nc
j=1 k=1
comp
1 xi /wi
i (j, k)
bs
:
c
Ji (j, k).
depends on wi and si because i

As shown in (39),
(21) is a function of wi and si . Note that (wi , si ) will be
calculated at the end of the current program phase using the
joint PDFs (Ji and Ji+1 ) profiled during the time period of
the current program phase. To simplify the interdependence
comp
between si
and (wi , si ), we approximate the calculation
comp
comp
of si
by replacing (wi , si ) with (wsi , si ) of the current
program phase. By substituting (38) with the approximated
comp
si , we can rearrange (36) as follows:

as
comp
comp
comp
Esi
wbi s xi
+ si
(40)
(tiR si )bs

Nc
comp
comp b comp
p
(j)
i
.
(wsi+1 ) s xi+1
comp
(1 xi /wi )bs
j=1
comp
Note that (40) is the same as (31), except for si . Therefore,

comp
in a similar way as (32) and (33), we can express wsi ,
comp
which minimizes Esi , as follows:
comp
wsi
comp
i
= xs
comp
si+1
+w
1/(bs +1)

comp
comp bs comp
scomp
w
=
s
(ws
)
x
.
i
i+1
i+1
i+1
user-defined threshold value.
(39)
comp
si
where
opt
Fig. 7. Hierarchical coordination to obtain global workload prediction, wi ,

where C1C4 represent coordination steps.
(41)
(42)
comp
in the Case 1 and Case

Compared to the calculation of wsi
comp
2 in Fig. 5(a) and (b), the only difference is that (si )1/(bs +1)
is multiplied in the calculation of effective remaining workload
comp
5
scomp
of ni+1 , i.e., w
, wsistall , wlistall , and wbi can also
i+1 . wli
be calculated in the same way.
comp
5 Note that when memory stall has no distribution, i.e., t stall = 0, s

i
i
comp
si+1 becomes the same as Case I and Case II.
becomes 1, thereby w
Note that we perform the most time-consuming work of

comp
workload prediction, i.e., finding si
bi with respect
to the runtime distribution, in a design-time step, and then,
we store the parameters into LUTs. Thus, we can drastically
reduce the runtime overhead of finding workload prediction
while accurately considering the influence of the runtime
distribution in workload predictions because we only access the LUTs to find workload prediction during runtime.
However, it requires additional memory space to store the
pre-characterized data. The runtime and area overhead are
presented in Section IX-C.
VII. Frequency Selection Based on Coordination

In this section, we present a method called coordination to
opt
find the global workload prediction of ni (wi ) based on the
comp
comp
local-optimal workload predictions, i.e., wsi , wli , wsistall ,
stall
wli , and wbi . As (23) shows, the workload prediction which
minimizes average total energy consumption at given tiR varies
according to the sensitivity of each energy components with
comp
comp
respect to wi , i.e., Esi /wi , Eli /wi , Esistall /wi ,
stall
Eli /wi , and Ebi /wi in (23).
Since the coordination of workload predictions is performed
online, it needs to be done with low overhead. To achieve this
goal, we present a simple hierarchical method which finds
opt
wi from local-optimal workload predictions (independent of
R
ti ), as shown in Fig. 7. As Fig. 7 shows, first, we obtain
the workload prediction for each workload type, i.e., compucomp
tational workload (wi ) through a coordination step called
C1 and memory stall workload (wstall
i ) through coordination
opt
comp
steps called C2 and C3. Then, we find wi from wi
and
stall
wi through a coordination step called C4.
comp
Fig. 8. Linear coordination of (a) C1: wsi

comp
opt
C4: wi
and wstall
to find wi .
i
comp
and wli
comp
to find wi
. (b)
comp
wi
1) Coordination for
(C1): A workload prediction
comp
for computational workload, wi
represents the prediction
comp
comp
comp
which minimizes Ei , i.e., sum of Esi
and Eli . Therecomp
comp
comp
depends on wsi
and wli . In this coordination,
fore, wi
comp
has exponential dependency
we utilize the fact that Eli
on frequency in combined Vdd /Vbb scaling. The rationale is
explained as follows. In the low frequency region, high reverse
body bias voltage can be applied suppressing the leakage
energy consumption due to high Vth . As frequency increases,
|Vbb | is decreased to enable higher clock frequency operation
by reducing Vth , which drastically increases leakage energy
consumption.
In combined Vdd /Vbb scaling, increase of switching energy
consumption (with respect to frequency increase), i.e., es /f ,
dominates leakage energy consumption in the lower frequency
region while increase of leakage energy consumption, i.e.,
el /f , dominates others in relatively high frequency region
[27]. Therefore, when most operating frequency falls into the
frequency range where the sensitivity of switching energy
consumption is much larger than that of leakage energy
comp
comp
consumption, i.e., es /f el /f , wi
approaches wsi
because switching energy consumption is the major contributor in this frequency region. On the other hand, when the
operating frequency is within the frequency region where
comp
comp
es /f el /f , wi
approaches wli .
We partition the frequency range into three regions: switching energy-dominant, leakage energy-dominant, and intermediate regions. The partition is done with two threshold frequencies, fscomp and flcomp . The frequency range below fscomp
(above flcomp ) is called switching (leakage) energy-dominant
region while the frequency range between the two threshold
frequencies is called intermediate region. Each energy component has two threshold frequencies as shown in Table II.
In order to identify which frequency partition the current
program region belongs to, we introduce a simple evaluation
metric, fieval , as the upper bound of the operating frequency
in the remaining program regions from ni to nleaf
comp(k)
fieval =
comp(k)
WCECi
tiR WCETistall(k)
(43)
In (43), WCECi
and WCETistall(k) represent the remaining
worst-case execution cycle of computational workload and
remaining worst-case memory stall time from ni to nleaf
when a current program phase is the kth program phase,
respectively. The solid line in Fig. 8(a) illustrates a linear
comp
coordination method to find wi
by utilizing fieval . When
119
fieval is lower than the threshold value, fscomp (in the second
row in Table II where c is set to 5.0 in our experiment), we set
comp
comp
wi
to wsi
because that remaining program regions will
be operated within the switching energy-dominant frequency
region. When fieval is higher than the threshold value, flcomp
comp
comp
(in the third row in Table II), we set wi
to wli . As
comp
comp
eval
comp
< fi
the last case, i.e., fs
< fl
, we set wi
in
eval
comp
comp
proportion to the ratio of (fi fs
) to (fl
fscomp )
using a linear interpolation function L() defined as follows:
L(X
, Xupper , Ylower , Yupper , Xeval )
lower
eval Xlower
= XXupper
(Yupper Ylower ) + Ylower .
Xlower
(44)
comp
By applying Xlower = fscomp , Xupper = flcomp , Ylower = wsi ,

comp
comp
Yupper = wli , and Xeval = fieval , we can obtain wi
as the
output of the function L().
2) Coordination for wstall
(C2 and C3): A workload prei
diction for memory stall, wstall
represents the prediction which
i
minimizes Eistall . Since Eistall depends on Ebi as well as Esistall
and Elistall , wstall
can be derived from wbi as well as wsistall and
i
by coordinating the three local-optimal
wlistall . To obtain wstall
i
workload predictions, we perform the coordination in two
steps as shown in Fig. 7. First, we find wLstall
by coordinating
i
wlistall and wbi , i.e., C2, both of which are related to leakage
energy consumption. Then, we find wstall
by coordinating
i
wsistall and wLstall
,
i.e.,
C3.
Note
that
the
coordination
for
i
comp
wLstall
can
be
done
in
the
same
way
as
w
,
which
is
i
i
comp
comp
shown in Fig. 8(a), by simply substituting (wsi , wli )
by (wlistall , wbi ) and (fscomp , flcomp ) by (flstall , fbstall ), where
flstall and fbstall are threshold values defined in the fourth
and fifth rows in Table II, respectively. In the same way, the
coordination for wstall
can also be done by the substitution of
i
corresponding workload predictions and threshold values, i.e.,
fsstall and fLstall in Table II.
opt
3) Coordination for wi (C4): The last step of the coordiopt
comp
nation is to obtain wi from wi
and wstall
i . In CPU-bound
opt
comp
comp
since Ei
applications, wi approaches wi
dominates
Eistall . On the contrary, in case of memory-bound applications,
opt
wstall
contributes more to wi . We calculate the maximum
i
memory-boundedness of the remaining program region from
ni to nleaf , denoted by i , as the ratio of the worst-case remaining memory stall cycles from ni at fieval (43) to that of the comcomp(k)
.
putational cycles, i.e., i = fieval WCETistall(k) /WCECi
Fig. 8(b) illustrates the linear coordination method to find
opt
wi by utilizing i . As i becomes larger (smaller), the
remaining work is characterized to be more memory-bound
(CPU-bound). When i is smaller than a certain threshold
value, called comp (0.5, in our experiment), we regard that
opt
the remaining workload is CPU-bound, thereby, we set wi
comp
to wi . On the other hand, if i is larger than a certain
threshold value, called stall (=1/ comp , in our experiment),
opt
we set wi to wstall
since the remaining work is memory
i
bound. In the intermediate case, i.e., comp < < stall ,
opt
we set wi in proportion to the ratio of ( i comp ) to
stall
( comp ) using (44).
opt
After wi is obtained, voltage/frequency is set to fi =
opt
R
wi /(ti si ), where tiR is measured at the start of each program
120
region. When setting the voltage/frequency, we check to see

whether the performance level satisfies the given deadline
constraint even if the worst-case execution time occurs after
the frequency is set, which is called feasibility check. More
details are explained in [10] and [27].
VIII. Program Phase Detection
Program phase, especially, in terms of computational cycles
and memory stall time, during PHASE UNIT (as defined
in Algorithm 1) is characterized by a salient difference in
computational cycle and memory stall time. Conventionally,
the program phase is characterized by utilizing only average
execution cycle of basic blocks without exploiting the runtime
distributions of computational cycle and memory stall time
[14], [15]. To exploit the runtime distributions in characterizing a program phase, we define a new program phase vector
consisting of five local-optimal workload predictions for each
program region. Note that local-optimal workload predictions
reflect the correlation as well as the runtime distributions of
both computational cycle and memory stall time. Thus, a set
of local-workload predictions becomes a good indicator which
represents the joint PDF of each program region. The program
phase vector of the kth program phase is defined as follows:
W (k) =[Wscomp(k) , Wlcomp(k) , Wsstall(k) , Wlstall(k) , Wb(k) ]T
(45)
where
comp(k)
comp(k)
comp(k)
Wscomp(k) = wsroot , . . . , wsi
, . . . , wsleaf
comp(k)
comp(k)
comp(k)
Wlcomp(k) = wlroot , . . . , wli
, . . . , wlleaf
stall(k)
stall(k)
Wsstall(k) = wsroot
, . . . , wsistall(k) , . . . , wsleaf
stall(k)
stall(k)
, . . . , wlistall(k) , . . . , wlleaf
Wlstall(k) = wlroot
(k)
(k)
Wb(k) = wbroot
, . . . , wbi(k) , . . . , wbleaf
.
(46)
(47)
(48)
(49)
(50)
Periodically, i.e., PHASE UNIT (set to the period defined

as the time for decoding 20 frames in our experiments),
we check to see whether a program phase is changed. It
is evaluated by calculating Hamming distance between program phase vector of the current period and that of current
program phase. When the Hamming distance is greater than
the threshold called p (set to 10% of the magnitude of
the current program phase vector in our experiments), we
evaluate that the program phase is changed, and then, check
to see if there is any previous program phase whose Hamming
distance with the program phase vector of the current period is
within the threshold p . If so, we reuse local-optimal workload
predictions of the matched previous phase as that of the new
phase to set voltage/frequency. If there is no previous phase
satisfying the condition, we store the newly detected program
phase and use the local-optimal workload predictions of a
newly detected program phase to set voltage/frequency until
the next program phase detection.
IX. Experimental Results
A. Setup
In our experiments, we used two real-life multimedia programs, MPEG4 and H.264 decoder in FFMPEG [29]. We
applied two picture sets for the decoding. First, we used,

in total, 4200 frames of 1920 1080 video clip consisting
of eight test pictures, including Rush Hour (500 frames),
Station2 (300 frames), Sunflower (500 frames), Tractor (690
frames), SnowMnt (570 frames), InToTree (500 frames), ControlledBurn (570 frames), and TouchdownPass (500 frames)
in [30]. Second, we used 3000 frames of 1920 800 movie
clip (as excerpted from Dark Knight). We inserted nine
voltage/frequency setting points in each program: seven for
macroblock decoding and two for file write operation for decoded image. We performed profiling with PAPI [31] running
on LG XNOTE with Linux 2.6.3.
We performed experiments at 25, 50, 75, and 100 C.
We calculated the energy consumption using the processor
energy model with combined Vdd /Vbb shown in Section III-A.
The parameters in (1)(4) of the processor energy model
were obtained from PTscalar [21] and Cacti5.3 with BPTM
high-k/metal gate 32 nm HP model. We used seven discrete
frequency levels from 333 MHz to 2.333 GHz with 333 MHz
step size. We set 20 s as the time overhead for switching
voltage/frequency levels and calculate the energy overhead
using the model presented in [7].
We compared the following four methods.
1) RT-CM-AVG [4]: runtime DVFS method based on the
average ratio of memory stall time and computational
cycle (baseline).
2) RT-C-DIST [19]: runtime DVFS method which only
exploits the PDF of computational cycle.
3) DT-CM-DIST [28]: design-time DVFS method which
exploits the joint PDF of computational cycle and memory stall time.
4) RT-CM-DIST : runtime version of DT-CM-DIST (proposed).
We modified the original RT-CM-AVG [4], which runs intertask DVFS without real-time constraint, such that it supports
intratask DVFS with a real-time constraint. In running DTCM-DIST [28], we performed a workload prediction with
respect to 20 quantized levels of remaining time, i.e., bins,
using the joint PDF of the first 100 frames in design time.
B. Energy Savings
Table III(a) and (b) shows the comparisons of energy
consumption for MPEG4 and H.264 decoder, respectively,
at 75 C. The first column shows the name of test pictures.
Columns 2, 3, and 4 represent the energy consumption of each
DVFS method normalized with respect to that of RT-CM-AVG.
Compared with RT-CM-AVG [4], our method, RT-CMDIST offers 5.134.6% and 4.517.3% energy savings for
MPEG4 and H.264 decoder, respectively. Fig. 9 shows the
statistics of used frequency levels when running SnowMnt
in MPEG4 decoder. As Fig. 9 shows, RT-CM-AVG uses the
lowest frequency level, i.e., 333 MHz, more frequently than
other two methods. It also leads to the frequent use of high
frequency levels, i.e., frequency levels above 2.00 GHz where
energy consumption drastically increases as frequency rises,
in order to meet the real-time constraint. However, by considering the runtime distribution in RT-CM-DIST, high frequency
121
TABLE III
Comparison of Energy Consumption for Test Pictures at 75 C:
(a) MPEG4 (20 Frames/s) and (b) H.264 Decoder (12 Frames/s)
(a)
Image
Rush Hour
Station2
Sunflower
Tractor
SnowMnt
InToTree
ControlledBurn
TouchdownPass
Average
RT-C-DIST
[19]
1.08
1.34
0.99
1.01
1.02
0.97
0.88
1.15
1.05
DT-CM-DIST
[28]
0.83
0.97
0.76
0.78
0.90
0.79
0.67
0.91
0.83
RT-CM-DIST
(Proposed)
0.79
0.95
0.74
0.75
0.81
0.71
0.65
0.86
0.78
DT-CM-DIST
[28]
0.94
0.90
0.97
1.00
1.03
1.00
0.94
0.99
0.97
RT-CM-DIST
(Proposed)
0.93
0.83
0.88
0.96
0.84
0.93
0.88
0.94
0.90
Fig. 9.
Statistics of used frequency levels in MPEG4 for decoding SnowMnt.
(b)
Image
Rush Hour
Station2
Sunflower
Tractor
SnowMnt
InToTree
ControlledBurn
TouchdownPass
Average
RT-C-DIST
[19]
1.11
1.05
1.09
1.18
1.03
1.14
1.07
1.10
1.10
levels incurring high energy overhead are less frequently used

because the workload prediction with distribution awareness
is more conservative than average-based method.
Table IV shows energy savings results for one of the test
pictures, i.e., SnowMnt, at four temperatures, 25 C, 50 C,
75 C, and 100 C. As the table shows, more energy savings
can be achieved as temperature increases. It is because the
energy penalty caused by frequent use of high frequency level
can be more obviously observed as temperature increases,
since leakage energy consumption is exponentially increasing
according to the temperature. By considering the temperature
dependency of leakage energy consumption, RT-CM-DIST
sets voltage/frequency so as to use high frequency levels
less frequently as temperature increases while RT-CM-AVG
does not consider the temperature increases. Note that, in
most cases, MPEG4 decoder gives more energy savings than
H.264 case. It is because, as Fig. 10 shows, the distribution
of memory boundedness (defined as the ratio of memory
stall time to computational cycle) of MPEG4 has a wider
distribution than that of H.264 in terms of Max/Avg and
Max/Min ratios.
Compared with RT-C-DIST [19], which exploits only the
distribution of computational cycle in runtime, RT-CM-DIST
provides up to 20.828.9% and 15.121.0% further energy
savings for MPEG4 and H.264 decoder, respectively. The
amount of further energy savings represents the effectiveness
of considering the distribution of memory stall time as well as
the correlation between computational cycle and memory stall
time, i.e., the joint PDF of computational cycle and memory
stall time. RT-C-DIST regards the whole number of clock
cycles, which is profiled at the end of every program region,
Fig. 10. Distribution of memory boundedness in (a) MPEG4 and (b) H.264
decoder.
TABLE IV
Comparison of Energy Consumption for SnowMnt at Four
Temperature Levels
MPEG4 dec.
H.264 dec.
Temp
(C)
25
50
75
100
25
50
75
100
RT-C-DIST
[19]
1.15
1.10
01.02
0.96
1.06
1.05
1.03
1.01
DT-CM-DIST
[28]
0.94
0.92
0.90
0.89
1.02
1.02
1.03
1.03
RT-CM-DIST
(Proposed)
0.86
0.84
0.81
0.79
0.91
0.88
0.84
0.80
as the computational cycle. Thus, RT-C-DIST cannot consider

the joint PDF distribution of computational cycle and memory
stall time. As the consequence, it sets frequency levels higher
than required levels, as shown in Fig. 9.
In Table III, compared with DT-CM-DIST, which exploits
runtime distributions of both computational and memory stall
workload in design time, RT-CM-DIST provides 2.110.2%
and 1.218.1% further energy savings for MPEG4 and H.264
122
TABLE V
Comparison of Energy Savings for DarkKnight at 75 C
MPEG4 dec.
H.264 dec.
RT-C-DIST
[19]
1.26
1.16
DT-CM-DIST
[28]
1.20
1.16
RT-CM-DIST
(Proposed)
0.89
0.89
TABLE VI
Summary of Runtime Overhead
Source of Runtime Overhead
Local-optimal workload prediction
Coordination
Feasibility check
Amount
40 40052 400 cycles
27204780 cycles
4073560 cycles
decoder, respectively. The largest energy savings can be obtained at SnowMnt for both MPEG4 and H.264 decoder, which
has distinctive program phase behavior. Since DT-CM-DIST
finds the optimal workload using the first 100 frames (designtime fixed training input), which is totally different from that
of the remaining frames (runtime-varying input), it cannot
provide proper voltage and frequency setting.
To further investigate the effectiveness of considering
complex program phase behavior, we performed another
experiment using 3000 frames of Movie clip. Program phase
behavior is more obviously observed at Movie clip whose
scene is fast moving. Table V shows normalized energy
consumption at 75 C when decoding the movie clip from
Dark Knight in MPEG4 and H.264 decoder, respectively.
RT-CM-DIST outperforms DT-CM-DIST by up to 26.3%
and 23.3% for MPEG4 and H.264 decoder, respectively. It is
because, in the movie clip, complex program phase behavior
exists due to frequent scene change as Fig. 1(d) shows.
C. Overhead
1) Runtime Overhead: We measured the runtime overhead
of the proposed online method, i.e., RT-CM-DIST, using
PAPI [31]. The proposed method consists of three parts:
local-optimal workload prediction, coordination, and feasibility check. Table VI shows the runtime overhead of the
proposed method. The local-optimal workload prediction of
a program region consumes 40 40052 400 clock cycles when
PHASE UNIT is set to 20 frames. Note that the local-optimal
workload prediction is performed at every PHASE UNIT.
The runtime overhead of coordination and feasibility check,
which is performed at every start of program region, takes
27204780 and 4073560 clock cycles, respectively. The total
runtime overhead in Table VI amounts to 0.38% and 0.25%
of the average execution cycles in the case of MPEG4 and
H.264 decoder, respectively.
2) Memory Overhead of LUTs: As explained in Section VI-B, the presented method requires three temperatureindependent LUTs, i.e., LUTscomp , LUTsstall , and LUTb , and
two temperature-dependent LUTs, i.e., LUTlcomp and LUTlstall .
The LUTs incur memory overhead. The memory overhead
largely depends on the number of steps (scales) in the indexes
of the LUTs. The more steps are used, the more accurate
workload prediction will be achieved with a higher memory
area overhead. In our implementation, we built each LUT with

the ratio of standard deviation to mean (Index 1) ranging 0.05
0.30 with 0.05 step size, with skewness (Index 2) ranging
1.001.00 with 0.10 step size, and with the ratio of mean to
the effective remaining workload of the remaining program
regions (Index 3) ranging 0.101.00 with 0.10 step size.
Therefore, 1140 (= 19 6 10) entries are required for
each LUT where 8 bits are assigned to each entry. Thus,
about 1 kB memory space is required for each LUT. The area
overhead can be further reduced by trimming and compressing
entries. LUTs are built for four temperatures, i.e., 25, 50, 75,
and 100 C used in our experiment. The total area overhead
amounts to 11 kB [=(31 kB) + 4(21 kB)].
X. Conclusion
In this paper, we presented a novel online DVFS method
which exploits the distribution of both computational workload
and memory stall workload during program runs in combined
Vdd /Vbb scaling. To reduce the complexity of our previous
design-time solution [28], we presented a DVFS method
consisting of two steps: local-optimal workload prediction and
coordination. In the local-optimal workload prediction step, we
periodically calculated five local-optimal workload predictions
each of which minimized single energy component under the
joint PDF of computational cycle and memory stall time,
which is profiled during runtime. To further reduce the runtime
overhead, we prepared tables which are pre-characterized
in design time based on the analytical formulation. During
runtime, we utilized them to find local-optimal workloads. In
the coordination step, we found the global workload prediction
by coordinating the five local-optimal workload predictions.
Experimental results show that the proposed method offers up
to 34.6% and 17.3% energy savings for MPEG4 and H.264
decoder, respectively, compared with the existing method [4].
References
[1] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens,
Memory access scheduling, in Proc. ISCA, 2000, pp. 128138.
[2] K. Govil, E. Chan, and H. Wasserman, Comparing algorithms for
dynamic speed-setting of a low-power CPU, in Proc. MOBICOM, 1995,
pp. 1325.
[3] Y. Gu and S. Chakraborty, Control theory-based DVS for interactive
3-D games, in Proc. DAC, 2008, pp. 740745.
[4] K. Choi, R. Soma, and M. Pedram, Fine-grained dynamic voltage and
frequency scaling for precise energy and performance tradeoff based on
the ratio of off-chip access to on-chip computation times, IEEE Trans.
Comput.-Aided Design Integr. Circuits Syst., vol. 24, no. 1, pp. 1828,
Jan. 2005.
[5] W.-Y. Liang, S.-C. Chen, Y.-L. Chang, and J.-P. Fang, Memory-aware
dynamic voltage and frequency prediction for portable devices, in Proc.
RTCSA, 2008, pp. 229236.
[6] G. Dhiman and T. S. Rosing, System-level power management using
online learning, IEEE Trans. Comput.-Aided Design Integr. Circuits
Syst., vol. 28, no. 5, pp. 676689, May 2009.
[7] A. Azevedo, I. Issenin, R. Cornea, R. Gupta, N. Dutt, A. Veidenbaum,
and A. Nicolau, Profile-based dynamic voltage scheduling using program checkpoints, in Proc. DATE, 2002, pp. 168175.
[8] D. Shin and J. Kim, Optimizing intra-task voltage scheduling using
data flow analysis, in Proc. ASPDAC, 2005, pp. 703708.
[9] J. Seo, T. Kim, and J. Lee, Optimal intratask dynamic voltage-scaling
technique and its practical extensions, IEEE Trans. Comput.-Aided
Design Integr. Circuits Syst., vol. 25, no. 1, pp. 4757, Jan. 2006.
[10] S. Hong, S. Yoo, H. Jin, K.-M. Choi, J.-T. Kong, and S.-K. Eo, Runtime
distribution-aware dynamic voltage scaling, in Proc. ICCAD, 2006, pp.
587594.
[11] S. Hong, S. Yoo, B. Bin, K.-M. Choi, S.-K. Eo, and T. Kim, Dynamic
voltage scaling of supply and body bias exploiting software runtime
distribution, in Proc. DATE, 2008, pp. 242247.
[12] J. R. Lorch and A. J. Smith, Improving dynamic voltage scaling
algorithm with PACE, ACM SIGMETRICS Perform. Eval. Rev., vol.
29, no. 1, pp. 5061, Jun. 2001.
[13] C. Xian and Y.-H. Lu, Dynamic voltage scaling for multitasking realtime systems with uncertain execution time, in Proc. GLSVLSI, 2006,
pp. 392397.
[14] T. Sherwood, E. Perelman, G. Hamerly, S. Sair, and B. Calder, Discovering and exploiting program phases, IEEE Micro, vol. 23, no. 6,
pp. 8493, Nov. 2003.
[15] T. Sherwood, S. Sair, and B. Calder, Phase tracking and prediction,
in Proc. ISCA, 2003, pp. 336347.
[16] Q. Wu, M. Martonosi, D. W. Clark, V. J. Reddi, D. Connors, Y. Wu, J.
Lee, and D. Brooks, A dynamic compilation framework for controlling
microprocessor energy and performance, in Proc. IEEE MICRO, 2005,
pp. 271282.
[17] C. Isci, G. Contreras, and M. Martonosi, Live, runtime phase monitoring and prediction on real systems with application to dynamic power
management, in Proc. MICRO, 2006, pp. 359370.
[18] S.-Y. Bang, K. Bang, S. Yoon, and E.-Y. Chung, Run-time adaptive
workload estimation for dynamic voltage scaling, IEEE Trans. Comput.Aided Design Integr. Circuits Syst., vol. 28, no. 9, pp. 13341347, Sep.
2009.
[19] J. Kim, S. Yoo, and C.-M. Kyung, Program phase and runtime
distribution-aware online DVFS for combined Vdd /Vbb scaling, in Proc.
DATE, 2009, pp. 417422.
[20] T. Mudge, K. Flautner, D. Vlaauw, and S. M. Martin, Combined
dynamic voltage scaling and adaptive body biasing for lower power
microprocessors under dynamic workloads, in Proc. ICCAD, 2002, pp.
721725.
[21] W. Liao, L. He, and K. M. Lepak, Temperature and supply voltage
aware performance and power modeling at microarchitecture level,
IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 24, no. 7,
pp. 10421053, Jul. 2005.
[22] Cacti5.3 [Online]. Available: http://www.hpl.hp.com/research/cacti
[23] BPTM High-k/Metal Gate 32 nm High Performance Model [Online].
Available: http://www.eas.asu.edu/ptm
[24] K. Puttaswamy and G. H. Loh, Thermal herding: Microarchitecture
techniques for controlling hotspots in high-performance 3-D-integrated
processors, in Proc. HPCA, 2007, pp. 193204.
[25] N. Kavvadias, P. Neofotistos, S. Nikolaidis, C. A. Kosmatopoulos, and
T. Laopoulos, Measurement analysis of the software-related power
consumption in microprocessors, IEEE Trans. Instrum. Meas., vol. 53,
no. 4, pp. 11061112, Aug. 2004.
[26] S. Oh, J. Kim, S. Kim, and C.-M. Kyung, Task partitioning algorithm
for intra-task dynamic voltage scaling, in Proc. ISCAS, 2008, pp. 1228
1231.
[27] J. Kim, S. Oh, S. Yoo, and C.-M. Kyung, An analytical dynamic scaling
of supply voltage and body bias based on parallelism-aware workload
and runtime distribution, IEEE Trans. Comput.-Aided Design Integr.
Circuits Syst., vol. 28, no. 4, pp. 568581, Apr. 2009.
[28] J. Kim, Y. Lee, S. Yoo, and C.-M. Kyung, An analytical dynamic
scaling of supply voltage and body bias exploiting memory stall time
variation, in Proc. ASPDAC, 2010, pp. 575580.
[29] FFMPEG [Online]. Available: http://www.ffmpeg.org
[30] VQEG [Online]. Available: ftp://vqeg.its.bldrdoc.gov
[31] PAPI [Online]. Available: http://icl.cs.utk.edu/papi
Jungsoo Kim (S06) received the B.S. degree in

electrical engineering from the Korea Advanced
Institute of Science and Technology (KAIST), Daejeon, South Korea, in 2005, and graduated the unified course of the M.S. and Ph.D. degrees from the
Department of Electrical Engineering and Computer
Science, KAIST, in 2010.
Since 2010, he has been in a post-doctoral position with KAIST. His current research interests
include dynamic power and thermal management,
multiprocessor system-on-a-chip design, and lowpower wireless surveillance system design.
123
Sungjoo Yoo (M00) received the B.S., Masters, and

Ph.D. degrees in electronics engineering from Seoul
National University, Seoul, South Korea, in 1992,
1995, and 2000, respectively.
He was a Researcher with the TIMA Laboratory,
Grenoble, France, from 2000 to 2004, and was a
Senior and Principal Engineer with Samsung Electronics, Seoul, from 2004 to 2008. Since 2008, he
has been with the Pohang University of Science and
Technology, Pohang, South Korea. His current research interests include dynamic power and thermal
management, on-chip network, multithreaded software and architecture, and
fault tolerance of solid-state disk.
Chong-Min Kyung (S06M81SM99F08) received the B.S. degree in electronics engineering

from Seoul National University, Seoul, South Korea,
in 1975, and the M.S. and Ph.D. degrees in electrical
engineering from the Korea Advanced Institute of
Science and Technology (KAIST), Daejeon, South
Korea, in 1977 and 1981, respectively.
From April 1981 to January 1983, he was with Bell
Telephone Laboratories, Murray Hill, NJ, in a postdoctoral position. Since he joined KAIST in 1983,
he has been working on system-on-a-chip design
and verification methodology, processor, and graphics architectures for highspeed and/or low-power applications, including mobile video codec. He was
a Visiting Professor with the University of Karsruhe, Karsruhe, Germany,
in 1989, as an Alexander von Humboldt Fellow, a Visiting Professor with
the University of Tokyo, Tokyo, Japan, from January 1985 to February
1985, a Visiting Professor with the Technical University of Munich, Munich,
Germany, from July 1994 to August 1994, with Waseda University, Tokyo,
from 2002 to 2005, with the University of Auckland, Auckland, New Zealand,
from February 2004 to February 2005, and with Chuo University, Tokyo, from
July 2005 to August 2005.
Dr. Kyung is the Director of the Integrated Circuit Design Education Center,
Daejeon, established in 1995 to promote the integrated circuit (IC) design
education in Korean universities through computer-aided design environment
setup, and chip fabrication services. He is the Director of the SoC Initiative for Ubiquity and Mobility Research Center established to promote
academia/industry collaboration in the SoC design-related area. From 1993 to
1994, he served as an Asian Representative in the International Conference
on Computer-Aided Design Executive Committee. He received the Most
Excellent Design Award, and the Special Feature Award from the University
Design Contest in the ASP-DAC 1997 and 1998, respectively. He received the
Best Paper Awards at the 36th DAC, New Orleans, LA, the 10th International
Conference on Signal Processing Application and Technology, Orlando, FL, in
September 1999, and the 1999 International Conference on Computer Design,
Austin, TX. He was the General Chair of the Asian Solid-State Circuits
Conference 2007, and ASP-DAC 2008. In 2000, he received the National
Medal from the Korean Government for his contribution to research and
education in the IC design. He is a member of the National Academy of
Engineering Korea and the Korean Academy of Science and Technology. He
is a Hynix Chair Professor with KAIST.

00 B 495219 Beebdbdaa 000000

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

00 B 495219 Beebdbdaa 000000

Uploaded by

Copyright:

Available Formats

110

Program Phase-Aware Dynamic Voltage Scaling

AbstractMost complex software programs are characterized

frequency level of the processor is set as the ratio of remaining

Fig. 1. Per-frame profile results of MPEG4 decoder when decoding Dark

memory stall workload shown in Fig. 1(c) results mostly

the clock frequency by an amount proportional to the average

lot during runtime, the distribution of memory stall time is

Fig. 2. Memory stall time vs. number of L2 cache misses as approximated

accessing an external memory, xstall , which is expressed as

For the decomposition of processor cycle into two processor

where ap and bp are fitting parameters. Fig. 2 illustrates that

Fig. 3. Joint PDF with respect to computational workload (xcomp ) and

IV. Problem Definition and Solution Overview

Algorithm 1 : Overall flow

Given the three inputs in Fig. 4, we find the energy-optimal

a distinct computational cycle and memory stall time. The

and Eistall represent the energy consumption for

) + (Esistall + Elistall ) + Ebi .

is independent of tiR . Since we perform workload prediction

Esistall Elistall Ebi

Since total energy consumption, Ei , is a function of wi as

and xi+1 represent the average of xi

+ (wsi+1 )bs xi+1

By rearranging (25) with respect to wi , we can express wsi

Fig. 5. Three cases. (a) Case 1: unit functions for both xi

predictions in a runtime-adaptive manner. We find wi at the

VI. Workload Prediction for Minimizing Energy

and tistall Have Unit Functions

In this subsection, we explain the case where the joint PDF

consists of two components:

 i+1 are effective remaining

Has a Runtime Distribution and tistall Has a

In this subsection, we explain the case where xi

can be obtained independently

Fig. 6. scomp as a function of (a) Index 1: i

Note that the optimal si

and tistall Have Runtime Distributions

and tistall have their runtime distributions

Ji and Ji+1 as follows:

= (wsi+1 )bs xi+1

where i (j, k) and J(j, k) denote i in (21) and the probability

depends on wi and si because i

Note that (40) is the same as (31), except for si . Therefore,

user-defined threshold value.

Fig. 7. Hierarchical coordination to obtain global workload prediction, wi ,

in the Case 1 and Case

5 Note that when memory stall has no distribution, i.e., t stall = 0, s

Note that we perform the most time-consuming work of

VII. Frequency Selection Based on Coordination

Fig. 8. Linear coordination of (a) C1: wsi

By applying Xlower = fscomp , Xupper = flcomp , Ylower = wsi ,

region. When setting the voltage/frequency, we check to see

Periodically, i.e., PHASE UNIT (set to the period defined

applied two picture sets for the decoding. First, we used,

Statistics of used frequency levels in MPEG4 for decoding SnowMnt.

levels incurring high energy overhead are less frequently used

as the computational cycle. Thus, RT-C-DIST cannot consider

area overhead. In our implementation, we built each LUT with

Jungsoo Kim (S06) received the B.S. degree in

Sungjoo Yoo (M00) received the B.S., Masters, and

Chong-Min Kyung (S06M81SM99F08) received the B.S. degree in electronics engineering

You might also like

i+1 are effective remaining