Professional Documents
Culture Documents
IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 1, JANUARY 2011
I. Introduction
YNAMIC voltage and frequency scaling (DVFS) is one
of the most effective methods for lowering energy consumption. DVFS is used to suppress the leakage energy by a
dynamic control of supply voltage (Vdd ) and body bias voltage
(Vbb ). Accurate prediction of remaining workload (hereafter,
workload prediction) plays a central role in DVFS where the
Manuscript received March 15, 2010; accepted July 27, 2010. Date of
current version December 17, 2010. This work was supported in part by
the National Research Foundation of Korea Grant funded by the Korean
Government, under Grant 2010-0000823, and the Brain Korea 21 Project,
the School of Information Technology, Korea Advanced Institute of Science
and Technology in 2010. This paper was recommended by Associate Editor
H.-H. S. Lee.
J. Kim and C.-M. Kyung are with the Korea Advanced Institute of
Science and Technology, Daejeon 305-701, South Korea (e-mail: jungsoo.kim83@gmail.com; kyung@ee.kaist.ac.kr).
S. Yoo is with the Pohang University of Science and Technology, Pohang
790-784, South Korea (e-mail: sungjoo.yoo@postech.ac.kr).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCAD.2010.2068630
c 2010 IEEE
0278-0070/$26.00
KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD
111
excerpted from Dark Knight. The x-axis and the left-hand side
y-axis represent frame index and per-frame decoding cycles,
respectively. The right-hand side y-axis represents program
phase index. Note that the program phase index does not correspond to the required performance level of the corresponding
program phase in this example. As shown in Fig. 1(d), the
entire time for decoding 1000 frames is classified into nine
program phases, and, within a program phase, per-frame
decoding cycle has a runtime distribution. Fig. 1(e) shows
runtime distributions of three representative program phases
out of nine program phases to illustrate that there can be a wide
runtime distribution within each program phase characterized
by its runtime distribution.
A. Our Approach
Our observation on the runtime characteristics of software
program suggests that, as shown in Fig. 1, the program workload has two characteristics: nonstationary program phase behavior and runtime distribution (even within a program phase)
of computational workload and memory stall time. Based on
the observations above, this paper presents an online DVFS
method that tackles the characteristics of program workload in
order to minimize the average energy consumption of software
program. We address the online DVFS problem in two ways:
intraphase workload prediction and program phase detection.
The intraphase workload prediction predicts workloads based
on the runtime distribution of computational workload and
memory stall time in the current program phase. The program
phase detection identifies to which program phase the current
instant belongs and then obtains the intraphase workload
prediction of the corresponding program phase, which is used
to set voltage and frequency during the program phase.
Leakage power consumption often dominates total power
consumption especially at high temperature. Our method tackles leakage power consumption with a temperature-aware combined Vdd /Vbb scaling. During runtime, based on temperature
readings as well as the runtime distribution, the online method
selects a set of appropriate Vdd and Vbb corresponding to
frequency level from the solution table (which was prepared
during design time).
This paper is organized as follows. Section II reviews related
works. Section III gives preliminaries on our energy model and
profiling method. Section IV presents the problem definition
and solution overview, followed by analytical formulation of
our problem in Section V. Sections VI and VII explain the proposed runtime distribution-aware DVFS. Section VIII presents
the program phase detection method. Section IX reports experimental results followed by the conclusion in Section X.
II. Related Works
There are a number of methods on the workload prediction
for online DVFS based on (weighted) average, maximum,
or the most frequent workload, or finding a repeated pattern
among N recent workloads [2]. Recently, a control theorybased workload prediction method was proposed to accurately
capture the transient behavior of workload [3]. To exploit
memory stall time, [4] and [5] present memory stall timeaware DVFS for soft real-time intertask DVFS which lowers
112
IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 1, JANUARY 2011
2
Ceff Vdd
+ Ng f 1 (Vdd K1 exp(K2 Vdd ) exp(K3 Vbb )
(1)
+Vdd K4 exp(K5 Vdd ) + |Vbb |Ij
where Ceff and Ng are effective capacitance and effective number of gates of the target processor, respectively. K1 , K2 , K3 ,
K4 , K5 , and Ij are process-dependent curve-fitting paramgate
junc
, respectively. Especially, the
eter sets for esub
l , el , and el
values of K1 , K2 , K3 are functions of operating temperature
increases exponentially as the operating tem(T ), since esub
l
perature increases. According to BSIM4 model and [21], the
temperature dependence of the parameters (K1 , K2 , and K3 )
is modeled as follows:
T 2
K6
Tref
K1 (T )
exp
(1
) K1 (Tref ) (2)
Tref
Tref
T
Tref
K2 (T )
K2 (Tref )
(3)
T
Tref
(4)
K3 (Tref )
K3 (T )
T
where Tref is reference temperature and K6 is a curve-fitting
parameter. Thus, K1 , K2 , K3 at temperature T can be obtained from the values at Tref using the relationship in (2)(4).
Since the temperature-aware energy model shown in (1)
(4) is too complicated to be used in our optimization, we
adopted a simplified energy model of combined Vdd /Vbb
scaling to approximate the energy consumption per cycle at
each temperature T as follows:
e(f, T ) as (T )f bs (T ) + al (T )f bl (T ) + c(T )
(5)
KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD
113
TABLE I
Energy Fitting Parameters for Approximating the Processor
Energy Consumption to the Accurate Estimation Obtained
from PTscalar with BPTM High-k/Metal Gate 32 nm HP Model
for Different Temperatures, Along with the Corresponding
(Maximal and Average) Errors
Temperature
(C)
Fitting Parameters
as
1.2101
1.2101
1.2101
1.2101
25
50
75
100
bs
1.3
1.3
1.3
1.3
al
4.6109
2.0107
2.2106
1.4105
Maximum (Avg)
Error (%)
bl
20.5
16.6
14.2
12.4
c
0.11
0.12
0.14
0.15
2.8
1.4
1.4
1.7
[0.9]
[0.5]
[0.4]
[0.7]
where as (T ), bs (T ) and al (T ), bl (T ) are sets of curvefitting parameters which model frequency-dependent portion in
es (T ) and el (T ), respectively. c(T ) is a curve-fitting parameter
corresponding to the amount of frequency-independent energy
portion in e(f, T ). Table I shows examples of fitting parameters which approximate the processor energy consumption
obtained from PTscalar [21] and Cacti5.3 [22] with the energy
model, i.e., (1)(4), for Berkeley predictive technology model
(BPTM) high-k/metal gate 32 nm HP model [23] at 25 C,
50 C, 75 C, and 100 C. In the modeling, we configured a
target processor in PTscalar as the best-effort estimate of Core
2-class microarchitecture using the parameters presented in
[24]. As Table I shows, the simplified energy model tracks
the original energy model within 2.8% of maximum error for
all the operating temperatures. Note that fitting parameters
for modeling switching energy consumption, i.e., as and bs ,
are unchanged as temperature varies because switching energy
consumption is temperature invariant.
Processor energy consumption depends on the type of
instructions executed in the pipeline path [25]. To simply
consider the energy dependence on instructions, we classify
processor operation into two states: computational state for
executing instructions and memory stall state mostly spent for
waiting for data from memory. When a processor is in the
memory stall state, switching energy consumption can be suppressed using clock gating while leakage energy consumption
is almost the same as the computational state. The reduction
ratio of switching energy, called clock gating fraction denoting
the fraction of the clock-gated circuit, is modeled as (0.1 in
our experiments). Thus, energy consumption per clock cycle
in each processor state can be calculated as follows:
ecomp
stall
e
comp
=
=
a s f bs + a l f bl + c
as f
bs
bl
+ al f + c
(6)
(7)
stall
and e
represent energy consumption per cycle
where e
in the computational and memory stall state, respectively.
Given a desired frequency level (f ), one can always find a
pair of Vdd and Vbb that gives minimum energy consumption
per cycle using the combined Vdd /Vbb scaling [11].
B. Runtime Workload Profiling
The total number of processor execution cycles, x, can be
expressed as a sum of the number of clock cycles for executing
instructions in a processor, xcomp , and that of stall cycles for
(8)
miss
+ bp
(9)
114
IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 1, JANUARY 2011
Fig. 4. Solution inputs. (a) Software program (or source code) partitioned
into program regions. (b) Energy model (in terms of energy-per-cycle) as
a function of frequency. (c) fv table storing the energy-optimal pairs,
(Vdd , Vbb ), for N frequency levels.
KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD
Ei = Ei
+ Eistall
(10)
comp
Ei = (Esi
comp
+ Eli
(11)
Using (6)(8), the five energy components in (11) are expressed as follows:
comp
Esi
comp
= as fibs xi
comp
Eli
Esistall
Elistall
Ebi
=
=
comp
bs
+ as fi+1
xi+1
comp
comp
bl
xi
+ al fi+1
xi+1
bs
stall
(as fibs fi tistall + as fi+1
fi+1 ti+1
)
bl
bl
stall
stall
al fi fi ti + al fi+1 fi+1 ti+1
comp
comp
stall
c(xi
+ xi+1 + fi tistall + fi+1 ti+1
).
al fibl
(12)
(13)
(14)
(15)
(16)
Frequency of each program region, fi and fi+1 can be expressed as the ratio of the remaining computational workload
prediction (wi and wi+1 ) to the remaining time-to-deadline
prediction for running the computational workload, i.e., total
R
remaining time-to-deadline (tiR and ti+1
) minus remaining
memory stall time prediction (si and si+1 ), as shown in
wi
fi = R
(17)
ti s i
wi+1
.
(18)
fi+1 = R
ti+1 si+1
R
in (18) is expressed as follows:
ti+1
comp
R
ti+1
= tiR
xi
fi
tistall .
(19)
R
with (17) and (19), fi+1 in (18) is
By replacing fi and ti+1
rearranged as follows:
wi+1
fi+1 = R
(20)
(ti si )i
where
comp
tistall
xi
t stall
R i
wi
ti s i
(tistall + si+1 ) si .
1
(21)
(22)
stall
When memory stall time of ni and ni+1 , i.e., tistall and ti+1
,
are unit functions, remaining memory stall time prediction is
set to the sum of memory stall time of remaining program
stall
regions, i.e., si = tistall + ti+1
. In the same manner, si+1 is set
stall
to ti+1 because ni+1 is the leaf, i.e., last, program region in
this case. Therefore, tistall in (22) becomes zero, thereby, i
115
Ei Esi
=
wi
wi
comp
Eli
wi
116
IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 1, JANUARY 2011
comp
comp
comp
comp
(tiR si )bs
comp
Esi
wi
comp
xi
comp
comp
comp
xi
(25)
= 0.
comp
(wi xi )bs +1
comp
= xi
comp
si+1 .
+w
(26)
comp
wsi
comp
A. Case 1: Both xi
Esi Ji Ji+1
(24)
comp
bs
as
wsi+1
comp
comp
wbi s xi
= R
+
xi+1
comp
b
s
(ti si )
1 xi /wi
wsistall
comp
xi
xi
comp
comp
xi
xi
comp
tistall
comp ,
wl
i+1
stall bl +1 stall
(wli+1
) ti+1
stall
+ wl
i+1
comp
xi
xi
comp
(28)
b 1+2
comp
xi
wbi
stall bs +1 stall
(wsi+1
) ti+1
tistall
sstall
w
i+1
wlistall
b 1+2
comp
xi
(29)
21
comp
xi
tistall
i+1
+ wb
stall
wbi+1 ti+1
(30)
stall
sstall
w
i+1 , wli+1 ,
KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD
comp
B. Case 2: xi
Unit Function
117
Esi Ji Ji+1 = R
(31)
(ti si )bs
Nc
comp
pi (j)
comp
comp
comp
wbi s xi
+ (wsi+1 )bs xi+1
comp
(1 xi (j)/wi )bs
j=1
comp
comp
i
= xs
comp
si+1
+w
(33)
comp
i
where xs
is the effective workload of program region ni
comp
scomp
for Esi . w
i+1 is obtained in the same way as presented
in (26). From our observation that energy-optimal workload
prediction tends to have a value near the average and depends
comp
on runtime distribution, we model xs
as follows:
i
comp
comp
comp
xs
= (1 + si ) xi
i
comp
comp
and Index 3:
follows:
comp
in its
where Nc is the number of quantized levels of xi
comp
comp
PDF. pi (j) represents the probability of xi
falling into
the jth quantized level. Note that, in this case where tistall is
comp
given as a unit function, the joint PDF (Ji ) of xi
and tistall
comp
comp
stall
is the same as the PDF of xi
at the given ti , i.e., pi .
comp
wsi
can be obtained by finding wi which satisfies the
following relation:
comp
Esi
as bs wibs 1
comp
= R
xi
+
(32)
wi
(ti si )bs
Nc
comp
comp
x
(j)p
(j)
comp
i
i
s
wbi+1
= 0.
xi+1
comp
bs +1
(w
x
(j))
i
i
j=1
wsi
comp
(34)
where si
is a parameter which represents the ratio of
comp
comp
comp
the distance between xs
and xi
to xi . We calculate
i
comp
i
by exploiting the pre-characterization of solutions. First,
xs
comp
we prepare a lookup table LUTscomp for si
during design
comp
during runtime.
time and perform table lookup to obtain si
comp
si
depends on the shape of runtime distribution. Thus, we
derived the indexes of LUTscomp as follows:
comp comp
1) Index 1: i /xi , normalized standard deviation
comp
comp
(i ) with respect to the mean of ni (xi );
comp
comp
2) Index 2: gi , skewness of xi ;
comp
comp
comp
3) Index 3: xi /
wsi+1 , ratio of the mean of ni (xi ) to
comp
the effective remaining workload of ni+1 (
wsi+1 ).
The rationale of choosing the three indexes is as follows. By
comp
substituting wsi
with (33) and (34), (32) is rearranged as
comp
xi
comp
+ (
wsi+1 )bs +1
Nc
(35)
j=1
comp
xi
comp
((1 + si
comp
) xi
comp
(j)pi
(j)
comp
comp
si+1 xi
+w
(j))bs +1
= 0.
comp
C. Case 3: Both xi
comp
xi
118
IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 1, JANUARY 2011
Esi Ji Ji+1
bs comp
as
comp
=
wi xi
+ Zsi
R
b
s
(ti si )
(36)
where
comp
Zsi
comp
comp
Ns
Nc
j=1 k=1
J(j, k)
(i (j, k))bs
(37)
(1
comp
pi (j)
comp
xi (j)/wi )bs
TABLE II
Threshold Parameters Used in Coordination
Coordination
Step
C1
Threshold
Parameter
Condition
fscomp
b
(as f bs )
lf l )
c (af
f
b
b
s
(al f l )
c (asff )
f
b
+1
l
(al f
)
(cf )
f c
f
(al f bl +1 )
(cf )
c f
f
bl +1
(as f bs +1 )
c (al f f +cf )
f
b
+1
bs +1
l
(al f
+cf )
c (asff )
f
flcomp
C2
fbstall
flstall
(38)
C3
fsstall
fLstall
where
comp
si
Ns
Nc
j=1 k=1
comp
1 xi /wi
i (j, k)
bs
:
c
Ji (j, k).
wbi s xi
+ si
(40)
(tiR si )bs
Nc
comp
comp b comp
p
(j)
i
.
(wsi+1 ) s xi+1
comp
(1 xi /wi )bs
j=1
comp
wsi
comp
i
= xs
comp
si+1
+w
1/(bs +1)
comp
comp bs comp
scomp
w
=
s
(ws
)
x
.
i
i+1
i+1
i+1
(39)
comp
si
where
opt
(41)
(42)
comp
KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD
comp
comp
and wli
comp
to find wi
. (b)
comp
wi
1) Coordination for
(C1): A workload prediction
comp
for computational workload, wi
represents the prediction
comp
comp
comp
which minimizes Ei , i.e., sum of Esi
and Eli . Therecomp
comp
comp
depends on wsi
and wli . In this coordination,
fore, wi
comp
has exponential dependency
we utilize the fact that Eli
on frequency in combined Vdd /Vbb scaling. The rationale is
explained as follows. In the low frequency region, high reverse
body bias voltage can be applied suppressing the leakage
energy consumption due to high Vth . As frequency increases,
|Vbb | is decreased to enable higher clock frequency operation
by reducing Vth , which drastically increases leakage energy
consumption.
In combined Vdd /Vbb scaling, increase of switching energy
consumption (with respect to frequency increase), i.e., es /f ,
dominates leakage energy consumption in the lower frequency
region while increase of leakage energy consumption, i.e.,
el /f , dominates others in relatively high frequency region
[27]. Therefore, when most operating frequency falls into the
frequency range where the sensitivity of switching energy
consumption is much larger than that of leakage energy
comp
comp
consumption, i.e., es /f el /f , wi
approaches wsi
because switching energy consumption is the major contributor in this frequency region. On the other hand, when the
operating frequency is within the frequency region where
comp
comp
es /f el /f , wi
approaches wli .
We partition the frequency range into three regions: switching energy-dominant, leakage energy-dominant, and intermediate regions. The partition is done with two threshold frequencies, fscomp and flcomp . The frequency range below fscomp
(above flcomp ) is called switching (leakage) energy-dominant
region while the frequency range between the two threshold
frequencies is called intermediate region. Each energy component has two threshold frequencies as shown in Table II.
In order to identify which frequency partition the current
program region belongs to, we introduce a simple evaluation
metric, fieval , as the upper bound of the operating frequency
in the remaining program regions from ni to nleaf
comp(k)
fieval =
comp(k)
WCECi
tiR WCETistall(k)
(43)
In (43), WCECi
and WCETistall(k) represent the remaining
worst-case execution cycle of computational workload and
remaining worst-case memory stall time from ni to nleaf
when a current program phase is the kth program phase,
respectively. The solid line in Fig. 8(a) illustrates a linear
comp
coordination method to find wi
by utilizing fieval . When
119
fieval is lower than the threshold value, fscomp (in the second
row in Table II where c is set to 5.0 in our experiment), we set
comp
comp
wi
to wsi
because that remaining program regions will
be operated within the switching energy-dominant frequency
region. When fieval is higher than the threshold value, flcomp
comp
comp
(in the third row in Table II), we set wi
to wli . As
comp
comp
eval
comp
< fi
the last case, i.e., fs
< fl
, we set wi
in
eval
comp
comp
proportion to the ratio of (fi fs
) to (fl
fscomp )
using a linear interpolation function L() defined as follows:
L(X
, Xupper , Ylower , Yupper , Xeval )
lower
eval Xlower
= XXupper
(Yupper Ylower ) + Ylower .
Xlower
(44)
comp
120
IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 1, JANUARY 2011
(45)
where
comp(k)
comp(k)
comp(k)
Wscomp(k) = wsroot , . . . , wsi
, . . . , wsleaf
comp(k)
comp(k)
comp(k)
Wlcomp(k) = wlroot , . . . , wli
, . . . , wlleaf
stall(k)
stall(k)
Wsstall(k) = wsroot
, . . . , wsistall(k) , . . . , wsleaf
stall(k)
stall(k)
, . . . , wlistall(k) , . . . , wlleaf
Wlstall(k) = wlroot
(k)
(k)
Wb(k) = wbroot
, . . . , wbi(k) , . . . , wbleaf
.
(46)
(47)
(48)
(49)
(50)
KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD
121
TABLE III
Comparison of Energy Consumption for Test Pictures at 75 C:
(a) MPEG4 (20 Frames/s) and (b) H.264 Decoder (12 Frames/s)
(a)
Image
Rush Hour
Station2
Sunflower
Tractor
SnowMnt
InToTree
ControlledBurn
TouchdownPass
Average
RT-C-DIST
[19]
1.08
1.34
0.99
1.01
1.02
0.97
0.88
1.15
1.05
DT-CM-DIST
[28]
0.83
0.97
0.76
0.78
0.90
0.79
0.67
0.91
0.83
RT-CM-DIST
(Proposed)
0.79
0.95
0.74
0.75
0.81
0.71
0.65
0.86
0.78
DT-CM-DIST
[28]
0.94
0.90
0.97
1.00
1.03
1.00
0.94
0.99
0.97
RT-CM-DIST
(Proposed)
0.93
0.83
0.88
0.96
0.84
0.93
0.88
0.94
0.90
Fig. 9.
(b)
Image
Rush Hour
Station2
Sunflower
Tractor
SnowMnt
InToTree
ControlledBurn
TouchdownPass
Average
RT-C-DIST
[19]
1.11
1.05
1.09
1.18
1.03
1.14
1.07
1.10
1.10
Fig. 10. Distribution of memory boundedness in (a) MPEG4 and (b) H.264
decoder.
TABLE IV
Comparison of Energy Consumption for SnowMnt at Four
Temperature Levels
MPEG4 dec.
H.264 dec.
Temp
(C)
25
50
75
100
25
50
75
100
RT-C-DIST
[19]
1.15
1.10
01.02
0.96
1.06
1.05
1.03
1.01
DT-CM-DIST
[28]
0.94
0.92
0.90
0.89
1.02
1.02
1.03
1.03
RT-CM-DIST
(Proposed)
0.86
0.84
0.81
0.79
0.91
0.88
0.84
0.80
122
IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 1, JANUARY 2011
TABLE V
Comparison of Energy Savings for DarkKnight at 75 C
MPEG4 dec.
H.264 dec.
RT-C-DIST
[19]
1.26
1.16
DT-CM-DIST
[28]
1.20
1.16
RT-CM-DIST
(Proposed)
0.89
0.89
TABLE VI
Summary of Runtime Overhead
Source of Runtime Overhead
Local-optimal workload prediction
Coordination
Feasibility check
Amount
40 40052 400 cycles
27204780 cycles
4073560 cycles
decoder, respectively. The largest energy savings can be obtained at SnowMnt for both MPEG4 and H.264 decoder, which
has distinctive program phase behavior. Since DT-CM-DIST
finds the optimal workload using the first 100 frames (designtime fixed training input), which is totally different from that
of the remaining frames (runtime-varying input), it cannot
provide proper voltage and frequency setting.
To further investigate the effectiveness of considering
complex program phase behavior, we performed another
experiment using 3000 frames of Movie clip. Program phase
behavior is more obviously observed at Movie clip whose
scene is fast moving. Table V shows normalized energy
consumption at 75 C when decoding the movie clip from
Dark Knight in MPEG4 and H.264 decoder, respectively.
RT-CM-DIST outperforms DT-CM-DIST by up to 26.3%
and 23.3% for MPEG4 and H.264 decoder, respectively. It is
because, in the movie clip, complex program phase behavior
exists due to frequent scene change as Fig. 1(d) shows.
C. Overhead
1) Runtime Overhead: We measured the runtime overhead
of the proposed online method, i.e., RT-CM-DIST, using
PAPI [31]. The proposed method consists of three parts:
local-optimal workload prediction, coordination, and feasibility check. Table VI shows the runtime overhead of the
proposed method. The local-optimal workload prediction of
a program region consumes 40 40052 400 clock cycles when
PHASE UNIT is set to 20 frames. Note that the local-optimal
workload prediction is performed at every PHASE UNIT.
The runtime overhead of coordination and feasibility check,
which is performed at every start of program region, takes
27204780 and 4073560 clock cycles, respectively. The total
runtime overhead in Table VI amounts to 0.38% and 0.25%
of the average execution cycles in the case of MPEG4 and
H.264 decoder, respectively.
2) Memory Overhead of LUTs: As explained in Section VI-B, the presented method requires three temperatureindependent LUTs, i.e., LUTscomp , LUTsstall , and LUTb , and
two temperature-dependent LUTs, i.e., LUTlcomp and LUTlstall .
The LUTs incur memory overhead. The memory overhead
largely depends on the number of steps (scales) in the indexes
of the LUTs. The more steps are used, the more accurate
workload prediction will be achieved with a higher memory
X. Conclusion
In this paper, we presented a novel online DVFS method
which exploits the distribution of both computational workload
and memory stall workload during program runs in combined
Vdd /Vbb scaling. To reduce the complexity of our previous
design-time solution [28], we presented a DVFS method
consisting of two steps: local-optimal workload prediction and
coordination. In the local-optimal workload prediction step, we
periodically calculated five local-optimal workload predictions
each of which minimized single energy component under the
joint PDF of computational cycle and memory stall time,
which is profiled during runtime. To further reduce the runtime
overhead, we prepared tables which are pre-characterized
in design time based on the analytical formulation. During
runtime, we utilized them to find local-optimal workloads. In
the coordination step, we found the global workload prediction
by coordinating the five local-optimal workload predictions.
Experimental results show that the proposed method offers up
to 34.6% and 17.3% energy savings for MPEG4 and H.264
decoder, respectively, compared with the existing method [4].
References
[1] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens,
Memory access scheduling, in Proc. ISCA, 2000, pp. 128138.
[2] K. Govil, E. Chan, and H. Wasserman, Comparing algorithms for
dynamic speed-setting of a low-power CPU, in Proc. MOBICOM, 1995,
pp. 1325.
[3] Y. Gu and S. Chakraborty, Control theory-based DVS for interactive
3-D games, in Proc. DAC, 2008, pp. 740745.
[4] K. Choi, R. Soma, and M. Pedram, Fine-grained dynamic voltage and
frequency scaling for precise energy and performance tradeoff based on
the ratio of off-chip access to on-chip computation times, IEEE Trans.
Comput.-Aided Design Integr. Circuits Syst., vol. 24, no. 1, pp. 1828,
Jan. 2005.
[5] W.-Y. Liang, S.-C. Chen, Y.-L. Chang, and J.-P. Fang, Memory-aware
dynamic voltage and frequency prediction for portable devices, in Proc.
RTCSA, 2008, pp. 229236.
[6] G. Dhiman and T. S. Rosing, System-level power management using
online learning, IEEE Trans. Comput.-Aided Design Integr. Circuits
Syst., vol. 28, no. 5, pp. 676689, May 2009.
[7] A. Azevedo, I. Issenin, R. Cornea, R. Gupta, N. Dutt, A. Veidenbaum,
and A. Nicolau, Profile-based dynamic voltage scheduling using program checkpoints, in Proc. DATE, 2002, pp. 168175.
[8] D. Shin and J. Kim, Optimizing intra-task voltage scheduling using
data flow analysis, in Proc. ASPDAC, 2005, pp. 703708.
[9] J. Seo, T. Kim, and J. Lee, Optimal intratask dynamic voltage-scaling
technique and its practical extensions, IEEE Trans. Comput.-Aided
Design Integr. Circuits Syst., vol. 25, no. 1, pp. 4757, Jan. 2006.
KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD
[10] S. Hong, S. Yoo, H. Jin, K.-M. Choi, J.-T. Kong, and S.-K. Eo, Runtime
distribution-aware dynamic voltage scaling, in Proc. ICCAD, 2006, pp.
587594.
[11] S. Hong, S. Yoo, B. Bin, K.-M. Choi, S.-K. Eo, and T. Kim, Dynamic
voltage scaling of supply and body bias exploiting software runtime
distribution, in Proc. DATE, 2008, pp. 242247.
[12] J. R. Lorch and A. J. Smith, Improving dynamic voltage scaling
algorithm with PACE, ACM SIGMETRICS Perform. Eval. Rev., vol.
29, no. 1, pp. 5061, Jun. 2001.
[13] C. Xian and Y.-H. Lu, Dynamic voltage scaling for multitasking realtime systems with uncertain execution time, in Proc. GLSVLSI, 2006,
pp. 392397.
[14] T. Sherwood, E. Perelman, G. Hamerly, S. Sair, and B. Calder, Discovering and exploiting program phases, IEEE Micro, vol. 23, no. 6,
pp. 8493, Nov. 2003.
[15] T. Sherwood, S. Sair, and B. Calder, Phase tracking and prediction,
in Proc. ISCA, 2003, pp. 336347.
[16] Q. Wu, M. Martonosi, D. W. Clark, V. J. Reddi, D. Connors, Y. Wu, J.
Lee, and D. Brooks, A dynamic compilation framework for controlling
microprocessor energy and performance, in Proc. IEEE MICRO, 2005,
pp. 271282.
[17] C. Isci, G. Contreras, and M. Martonosi, Live, runtime phase monitoring and prediction on real systems with application to dynamic power
management, in Proc. MICRO, 2006, pp. 359370.
[18] S.-Y. Bang, K. Bang, S. Yoon, and E.-Y. Chung, Run-time adaptive
workload estimation for dynamic voltage scaling, IEEE Trans. Comput.Aided Design Integr. Circuits Syst., vol. 28, no. 9, pp. 13341347, Sep.
2009.
[19] J. Kim, S. Yoo, and C.-M. Kyung, Program phase and runtime
distribution-aware online DVFS for combined Vdd /Vbb scaling, in Proc.
DATE, 2009, pp. 417422.
[20] T. Mudge, K. Flautner, D. Vlaauw, and S. M. Martin, Combined
dynamic voltage scaling and adaptive body biasing for lower power
microprocessors under dynamic workloads, in Proc. ICCAD, 2002, pp.
721725.
[21] W. Liao, L. He, and K. M. Lepak, Temperature and supply voltage
aware performance and power modeling at microarchitecture level,
IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 24, no. 7,
pp. 10421053, Jul. 2005.
[22] Cacti5.3 [Online]. Available: http://www.hpl.hp.com/research/cacti
[23] BPTM High-k/Metal Gate 32 nm High Performance Model [Online].
Available: http://www.eas.asu.edu/ptm
[24] K. Puttaswamy and G. H. Loh, Thermal herding: Microarchitecture
techniques for controlling hotspots in high-performance 3-D-integrated
processors, in Proc. HPCA, 2007, pp. 193204.
[25] N. Kavvadias, P. Neofotistos, S. Nikolaidis, C. A. Kosmatopoulos, and
T. Laopoulos, Measurement analysis of the software-related power
consumption in microprocessors, IEEE Trans. Instrum. Meas., vol. 53,
no. 4, pp. 11061112, Aug. 2004.
[26] S. Oh, J. Kim, S. Kim, and C.-M. Kyung, Task partitioning algorithm
for intra-task dynamic voltage scaling, in Proc. ISCAS, 2008, pp. 1228
1231.
[27] J. Kim, S. Oh, S. Yoo, and C.-M. Kyung, An analytical dynamic scaling
of supply voltage and body bias based on parallelism-aware workload
and runtime distribution, IEEE Trans. Comput.-Aided Design Integr.
Circuits Syst., vol. 28, no. 4, pp. 568581, Apr. 2009.
[28] J. Kim, Y. Lee, S. Yoo, and C.-M. Kyung, An analytical dynamic
scaling of supply voltage and body bias exploiting memory stall time
variation, in Proc. ASPDAC, 2010, pp. 575580.
[29] FFMPEG [Online]. Available: http://www.ffmpeg.org
[30] VQEG [Online]. Available: ftp://vqeg.its.bldrdoc.gov
[31] PAPI [Online]. Available: http://icl.cs.utk.edu/papi
123