You are on page 1of 4

Task Partitioning Algorithm for Intra-Task Dynamic

Voltage Scaling
Seungyong Oh, Jungsoo Kim, Seonpil Kim and Chong-Min Kyung
Dept. of Electrical Engineering & Computer Science
KAIST (Korea Advanced Institute of Science and Technology)
Email: {syoh, jskim, spkim} @vslab.kaist.ac.kr, kyung@ee.kaist.ac.kr

Abstract Dynamic voltage scaling (DVS) is a very powerful


technique in reducing dynamic power consumption of CMOS
circuits. Recent studies showed that intra-task DVS method
which adjusts the voltage level during program execution can
achieve significant energy reduction. However, the overhead of
large number of voltage switching becomes a limitation for its
practical implementation. In this paper, we propose a novel
task partitioning algorithm for intra-task DVS which partitions
a given task so that the DVS can be applied more effectively
with the minimum number of voltage switching. Experimental
result using H.264 decoder software shows that the proposed
algorithm reduces the energy consumption by up to 25% over
the conventional method.

I. I NTRODUCTION
Power-efciency is one of the biggest concerns in modern
embedded system design. Many techniques have been developed to reduce the energy consumption of the microprocessor
which is one of the major contributors to energy consumption
of embedded system. Dynamic voltage scaling (DVS) reduces
the power consumption of a processor using a quadratic
dependence of active power consumption on supply voltage.
DVS can be classied into two different methods according
to its scaling granularity. The rst one is inter-task DVS which
adjusts the voltage/frequency level between tasks. The task in
this context means the independent software which can be
scheduled by an operating system. The second one is intratask DVS which adjusts the voltage/frequency level a number
of times in a single task boundary. This method is based on the
fact that the execution cycle of the software is not deterministic
but has a prole of large variation. Hence, it can utilize the
time slack caused by run-time variation of execution cycle
efciently.
The effectiveness of intra-task DVS method has been studied throughout many researches. In most cases, intra-task DVS
method outperforms the inter-task DVS. However, it also has a
limitation. The main drawback comes from the fact that a large
number of voltage switching should be executed to achieve
a large energy reduction. Because the intra-task DVS has to
adjust its supply voltage level dynamically during the tasks
execution, the overhead which comes from voltage switching
is unavoidable. In fact, it has been shown that intra-task DVS
becomes more powerful when the number of voltage switching
increases [3] [7].
This voltage switching overhead becomes a main bottleneck

978-1-4244-1684-4/08/$25.00 2008 IEEE

in its practical implementation. Even a current state-of-theart DC-DC converter takes tens of sec in voltage switching
[1]. As the number of voltage switching increases, the time
consumed for voltage switching also increases, thus the time
for execution of real application decreases. Hence, the voltage
switching has to be performed very efciently to minimize
this switching overhead in intra-task DVS.
The goal of our research is to maximize the effectiveness of
intra-task DVS by minimizing the voltage switching overhead
for hard real-time application. More specically, we propose
an algorithm which partitions the program into number of code
sections in a way that voltage switching is only performed
when it is necessary. In our experiment, it is shown that the
number of voltage-switching can be reduced drastically while
keeping the same amount of energy reduction. Furthermore,
additional energy reduction is also achievable by proposed
method.
The rest of paper is organized as follows. In section II,
we present related works on intra-task DVS. Section III describes the details of proposed method called task partitioning
algorithm. Next, Section IV shows the experimental result of
proposed method. Finally, we draw our conclusions in section
V.
II. R ELATED WORK
Lee and Sakurai [3] proposed an idea of the intra-task DVS
method. A. Azevedo et al. [4] used the program check point
to apply intra-task DVS. Their works utilized the time slack
by analyzing the worst-case execution cycle of remaining task.
However, the variation of remaining workload and execution
path was not considered. Seo et al. [5] proposed the virtual
execution path. In this work, they tried to nd the optimal
execution path among many paths caused by branches. As in
[4], they also took the worst-case execution cycle as a remaining workload of each path. Hong et al. [7] proposed prolebased remaining workload predicting method. In this work,
statistically-optimal remaining workload was determined using
each performance regions prole information.
All these previous works have been focused on which level
the voltage has to be adjusted to. However, no work has been
done on at which points of the programs execution the DVS
should be applied considering voltage switching overhead.

1228

III. P ROPOSED M ETHOD

B. Motivation

A. Preliminary

Our proposed method is based on the two basic


observations. The following two subsections explain those
with examples, and then details of proposed algorithm will
be presented.

Task 1
WR1

Code Section 1

WR2

Code Section 2

Fig. 1.

...

WRN

Code Section N

Intra-task DVS

1) Intra-task Dynamic Voltage Scaling: In intra-task Dynamic Voltage Scaling, the task is divided into a number of
code sections as shown in Fig.1. This code section becomes
the basic unit of voltage scaling. In the same code section, the
voltage is maintained at the same level and the voltage can be
switched to a different level between code sections. If we can
estimate the remaining workload (how many cycles are left)
exactly, it would be possible to lower the frequency as much
as possible for each section by using this simple equation.
wRi
(1)
fi =
ti

where wRi is the remaining workload of the i-th code


section, ti is a run-time parameter which denotes the
remaining time from the start time of i-th code section to
tasks deadline. Many methods to nd the value of wRi have
been proposed in the researches introduced in the previous
section.
2) Remaining workload prediction: The prediction method
for remaining workload of each code section (wRi ) makes
a signicant effect on the performance of intra-task DVS.
In this paper, we used the prole-based remaining workload
prediction method in [7]. In [7], the value of wRi which
minimizes the average energy consumption is determined as
an analytical solution of the energy equation. The wRi for
each code section is stored in the table, and is referenced to
calculate the proper frequency level using (1) during tasks
execution.
3) Energy model: We assume that the amount of energy
spent for executing a code section is expressed as
E f 2 ntotal

(2)

where f is the performance level(i.e. frequency) and ntotal is


the total execution cycle of the code section. The validity of
this equation is shown in [6] [7].
4) Hard real-time application: The real-time application
can be classied into two types. The rst one is soft real-time
application which has to keep a certain level of quality of
service (QoS) while its execution does not necessarily meet
the deadline. To these applications, occasional deadline misses
happen so the quality degradation is accompanied. The other
is hard real-time application which has to meet the deadline
in any circumstances. This type of application guarantees the
quality of the output and is the target of this work.

1) Profile variation: The rst observation is that DVS


should be applied more frequently to the code section which
has a large variation in its execution cycle.
C ase B

C ase A

n1

P r o b a b ility

n1

(3 0 0 )

(1 5 0 )

100

(A -2 )

n2

300

50 150 300

600

(A -3 )

n2

(1 5 0 )

(4 0 0 )

50 150 300

400

PE

(B -3 )

(7 0 0 )

300

A -1

(B -2 )

n3

n3

(3 0 0 )

n2

n1

f1 = 1 0 0 H z

A -2
PE
n1
X VZ

f1 = 1 0 0 H z

A -3
PE
Y
f1 = 1 0 0 H z

E x e c u tio n
c y c le

f2 = 1 0 0 H z

n2

f2 = 7 7 .8 H z

n1
f2 = 1 7 5 H z

n3
f3 = 1 0 0 H z

n3

f3 = 7 7 .8 H z

n2

n3

f3 = 1 7 5 H z

1 0 0 .0 0
5 2 .3 8
X 1 /3
2 7 4 .3 8
x2
< E n e r g y c o n s u m p tio n (n o m a liz e d ) >

700

B -1

PE

n2

n1

f1 = 1 0 0 H z

B -2
X VZ P E

n n2
1

f1 = 1 0 0 H z

B -3
PE
Y

n3
f2 = 1 0 0 H z

n1

f1 = 1 0 0 H z

n3

f2 = 8 9 .5 H z

n2
f2 = 1 2 1 .4 H z

f3 = 1 0 0 H z

f3 = 7 8 .3 H z

n3
f3 = 1 5 4 .5 H z

1 0 0 .0 0
5 1 .9 3
X 1 /3
2 4 1 .2 6
x2
< E n e r g y c o n s u m p tio n (n o m a liz e d ) >

Fig. 2. Example 1 : Difference of energy consumption when variance of


execution cycle is considered

In Fig.2, one node represents a code section and its average


execution cycle is shown below each nodes name. In case A,
the task is partitioned so that each section has similar execution
cycles. The large variation in execution cycle of n1 is shown as
a prole. In case B, however, n1 is divided into two nodes, and
the n2 and n3 are merged into one node so that the variance of
each node becomes smaller. The predicted remaining workload
of each node is assumed to be the sum of average cycle of
the node and its subsequent nodes for simplicity. For example,
remaining workload prediction of n1 in case A is 1k cycles.
When the same execution cycles with the predicted value
are actually taken, there is no difference between two cases in
their energy consumption. However, when the execution cycle
of the n1 in case A become 1/3 and 2 times of its average
value, and the same happens to the n1 and n2 in case B, the
energy consumption of case B becomes smaller than that of
case A by 0.8% and 12% respectively.
This result shows that the code section with a large variance
is better to be divided into smaller code sections so that the
frequency can be adapted to its variation quickly. It also
indicates that the node with a small variance can be merged
into one node to reduce the voltage transition overhead.
2) Remaining workload: When the actual execution cycle
of the code section differs from its predicted value by [7]s
method, additional energy called misprediction penalty is
consumed. This misprediction penalty for the same variation
grows as remaining workload decreases.

1229

C ase B

C ase A
n1

n2

(5 0 0 )

(5 0 0 )

n1

n2

(7 0 0 )

(3 0 0 )

x 10
3

RXWW

n 1 (5 0 0 )
f1 = 1 0 0 H z

PE

n 1 (6 0 0 )
f1 = 1 0 0 H z

PE

n 2 (5 0 0 )
f2 = 1 0 0 H z
RXWW

n 2 (5 0 0 )

1 0 0 .0 0
1 3 8 .1 3

+100

PE

n 2 (3 0 0 )
f2 = 1 5 0 H z

1 0 0 .0 0
1 4 7 .5 0

+100

< E n e r g y c o n s u m p tio n (n o m a liz e d ) >

2.5

f2 = 1 0 0 H z

n 1 (8 0 0 )
f1 = 1 0 0 H z

f2 = 1 2 5 H z

n 2 (3 0 0 )

n 1 (7 0 0 )
f1 = 1 0 0 H z

Energy penalty

PE

2
1.5
1
0.5

< E n e r g y c o n s u m p tio n (n o m a liz e d ) >


0
0
2000

Fig. 3. Example 2: Variation in energy penalty by misprediction with different


remaining workloads

10000
1000

20000
0

30000

Remaining workload

Execution cycle variation

3) Energy penalty for misprediction: Now, we present


derivation of the energy penalty equation.
n

1
(X 1 )

0
(X 0 )

PE

n 0 (X 0)

n 1 (X 1)

{
R PE

n 0 (X 0 +X)

n 1 (X 1)

Fig. 4.

Energy penalty for misprediction

In Fig. 4, X0 and X1 denote the execution cycle of the node


n0 and n1 respectively. And X represents the augmented
cycle by misprediction, T is the program deadline. Energy
consumption for Fig.4 can be expressed as follows.
Eopt

Emiss

= (

In this section, we present the design ow for intra task


DVS with our task partitioning algorithm (TPA).
At the beginning, the original source code has to be divided
into maximum number of code sections. These code sections
become the unit code sections which cant be divided into
smaller sections. Then, the prole of execution cycle of each
node is obtained by static simulation and its misprediction
penalty is calculated by using eq. (3). This penalty is used
as a measure to decide whether this node should be merged
or not. If this penalty of the node is smaller than threshold
level which is given as an input parameter of this algorithm,
the node should be merged with the next node to reduce the
overhead of voltage switching.1 Otherwise, the node remains
unchanged to take the best advantage of DVS.
Once a node is merged, the misprediction penalty of whole
nodes should be re-calculated. This procedure is repeated until
the misprediction penalty of all nodes becomes larger than a
threshold level. The optimal number of nodes which gives
the maximum energy reduction is obtained by changing a
threshold level from the lowest value to the maximum value of
each nodes energy penalty. The design ow with code section
partitioning algorithm is shown in Fig. 6.
Original software

Fig. 6.

Static simulation
( profiling info. )

Task partitioning

Complete

2
x1
x0 + x1 2
]
) (x0 + x) + [
x1
= (
x
+x
T
T ( x00 +x1 )T

C. Task partitioning algorithm

2
x1
x0 + x1 2
] x1
) x0 + [
x0
T ( x0 +x1 )T
T

Misprediction penalty

E pnt.

n>

E th

Merging nodes

The design ow with task partitioning algorithm

D. Feasibility check

where Eopt and Emiss are the energy consumption of the


node when the remaining workload is correctly predicted and
mispredicted with amount of X. The energy misprediction
penalty is expressed as a difference between these two values.
Emiss Eopt
x1
x0 + x1 2
)2 1]] (3)
) [x + x1 [(
= (
x1 x
T
This penalty increases as execution cycle variation grows
and the remaining workload decreases as we expected.
Epnt

Fig. 5. Energy penalty by misprediction with variation of remaining workload


and execution cycle

In Fig.3, total execution cycles for both cases are same. In


case A, however, remaining workload after n1 is 200 cycles
larger than that of case B. when 100 cycles of misprediction
occurred in both cases, the energy consumption in case B is
much larger than that of case A as shown in Fig.3.
This example shows that for the same amount of
misprediction, the node with the smaller remaining workload
has a larger energy penalty. In other words, as remaining
workload decreases (program proceeds), misprediction
penalty for the same variation grows. This is because when
the remaining workload is large, the penalty for misprediction
can be amortized over a large number of remaining cycles.

The condition of feasibility check for real-time constraint


can be expressed as follows.

ncur .xwst
nremain .xwst
) Tn
(4)
)+(
(
fmax
ncur .fn
where fn and Tn denotes the desired frequency level and
the remaining time for current node, repectively, and xwst
N nodes are merged 
with their prole Xi , the
prole
variance
N
and
N
N
2 =
of merged node become X =
X and X
C
i=1 i
i=1
j=1 Xi Xj

1230

1 When

TABLE I
C OMPARISON OF ENERGY CONSUMPTION

denotes the worst-case execution cycle of the node. When this


condition is not satised, the frequency of the node has to be
increased until it meets the condition.
IV. E XPERIMENTAL RESULT
We performed our experiment on ARM-based embedded
system platform. Our platform consists of ARM946 processor
with cache enabled, AHB bus and memory controller with
SDRAM.
The maximum and the minimum voltage level for the
processor was set to 1.2V and 0.6V with 12 discrete voltage/frequency levels. The frequency can vary from 120MHz
to 480MHz with its corresponding voltage level. At run-time,
the frequency level is determined as the smallest value among
available frequency levels larger than fiopt [7].
We also took into account the delay overhead accompanied
with DVS. The time spent on executing DVS function call was
assumed to be 1k cycles as a constant, and the maximum 22
sec of voltage transition time was also assumed without the
loss of generality[1].
Industrial H.264 decoder was used as a hard real-time
application. This target application decodes QCIF(176x144)
images with 30fps.
In our experiments, the target application was divided into
300 basic code sections after all the big loops in the original
source code were unrolled to prevent re-execution of the same
section. The task is assumed to be run under non-preemptive
scheduling. We used the SoC Desinger [8] cycle-accurate
model to prole the execution cycle of each code section.
We applied proposed method with various energy threshold
levels. As energy threshold level increases, more nodes have to
be merged, thus the total number of code sections is decreased.
We also performed the same experiment with the method in [7]
for comparison. In [7], code sections were merged with their
adjacent code sections. The energy consumption measured in
each case is shown in Fig. 7


:

# code sections
1
30
60
120
180
210
270
300

E

Fig. 7.

Reduction(%)
0.00
6.69
14.70
23.27
23.72
27.61
13.83
0.00

ZWGG
w

n0

n78

n119 n150

n183 n207 n226 n

(78)

(41)

(31)

(24)

(33)

(21)

(19)

238

n n n n n n n n n n n n n n n n n n n n n n n n n n n n
j n0 n
10 20 30 40 50 60 70 80 90 100 120 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290

Fig. 8.

Task partitioning result ( with 30 code sections )

The number in parenthesis denotes the number of nodes


merged in. We can see that in proposed method, a large number
of nodes in the beginning are merged together because of its
small misprediction penalty.
V. C ONCLUSION
We presented a novel task partitioning algorithm for intratask dynamic voltage scaling. Proposed method partitions the
task into code sections by considering misprediction penalty
so that intra-task DVS can be applied more effectively. The
experimental results show that the proposed algorithm can
achieve substantial power reduction for industrial multimedia
application.
R EFERENCES

proposed
122.16
93.64
75.17
56.10
51.55
50.31 *
63.17
77.35

The energy consumed in each case is shown in Table I. The


minimum energy consumption (with asterisks) is 50.31mJ with
210 code sections when proposed method is applied while it
is 67.58mJ with 180 code sections when conventional method
is applied. This result shows that almost 25% of additional
energy reduction can be achievable when proposed algorithm
is used. The result of task partitioning using both methods is
shown in Fig. 8 when the total number of code section is 30.

Hong[7]
122.16
100.35
88.12
73.11
67.58 *
69.50
73.31
77.35

Energy consumption : H.264 software

In Fig.7, Inter represents the case that inter-task DVS


is used. Note that 60 times of voltage scaling with proposed method has almost the same energy reduction with
120 times of voltage scaling with conventional method. This
result proves that the voltage switching has performed more
efciently in proposed method.

[1] O. Trescases and W. Ng. Variable Output, Soft-Switching DC/DC Converter for VLSI Dynamic Voltage Scaling Power Supply Applications,
PESC, 2004
[2] T.Simunic, L.Benini and G. DeMicheli Cycle-accurate simulation of
energy consuption in embedded systems,DAC,1999
[3] S.Lee and T.Sakurai. Run-time Voltage Hopping for Low-power RealTime Systems, DAC, 2000
[4] A. Azevedo, I. Issenin, R. Cornea, R. Gupta, N. Dutt, A. Veidenbaum, and
A. Nicolau, Profile-Based Dynamic Voltage Scheduling Using Program
Checkpoints, DATE, 2002.
[5] J. Seo, T. Kim, and K. Chung, Profile-Based Optimal Intra-Task Voltage
Scheduling for Hard Real-Time Applications, DAC, 2004.
[6] D. Shin, S. Lee, J. Kim Intra-Task voltage scheduling for low-energy
hard real-time applications, IEEE Design & Test of computers.2001.
[7] S. Hong, S. Yoo, H. Jin, K. Choi, J. Kong, S. Eo, Runtime DistributionAware Dynamic Voltage Scaling, ICCAD, 2006.
[8] ARM SoC Designer, available at http://www.arm.com/product/DevTools
/MaxSim.html

1231

You might also like