5 views

Uploaded by Armin Ahmadzadeh

erteergerge er rg df df dsf

- COA_02_Computer Evolution and Performance
- MP&MC Unit i
- ARK _ Intel® Core™ i5-480M Processor (3M Cache, 2
- Unit 1CO Autonomous
- 08diag
- NWU-EECS-06-16
- A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTTLENECK
- SOG_16h_52128_PUB_Rev1_1
- Merging Write Buffers
- Sidechannel_isca07
- awr
- Cortex R4 White Paper
- 03-MissPenaltyReduction
- How Microprocessors Work
- L19-MemoryHierarchy
- 2. CSE_321_2
- Misaglignment Data and Instruction Prrfetch
- scimakelatex.5521.Automatic+.CS+.Paper.Generator
- Arch
- Cache Memory

You are on page 1of 14

IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 1, JANUARY 2011

Under Variable Computational Workload and

Memory Stall Environment

Jungsoo Kim, Student Member, IEEE, Sungjoo Yoo, Member, IEEE, and Chong-Min Kyung, Fellow, IEEE

by program phase behavior and runtime distribution. Dynamism

of the two characteristics often makes the design-time workload

prediction difficult and inefficient. Especially, memory stall time

whose variation is significant in memory-bound applications has

been mostly neglected or handled in a too simplistic manner

in previous works. In this paper, we present a novel online

dynamic voltage and frequency scaling (DVFS) method which

takes into account both program phase behavior and runtime

distribution of memory stall time, as well as computational

workload. The online DVFS problem is addressed in two ways:

intraphase workload prediction and program phase detection.

The intraphase workload prediction is to predict the workload

based on the runtime distribution of computational workload

and memory stall time in the current program phase. The

program phase detection is to identify to which program phase

the current instant belongs and then to obtain the predicted

workload corresponding to the detected program phase, which

is used to set voltage and frequency during the program phase.

The proposed method considers leakage power consumption as

well as dynamic power consumption by a temperature-aware

combined Vdd /Vbb scaling. Compared to a conventional method,

experimental results show that the proposed method provides

up to 34.6% and 17.3% energy reduction for two multimedia

applications, MPEG4 and H.264 decoder, respectively.

Index TermsDynamic voltage and frequency scaling (DVFS),

energy optimization, memory stall, phase, runtime distribution.

I. Introduction

YNAMIC voltage and frequency scaling (DVFS) is one

of the most effective methods for lowering energy consumption. DVFS is used to suppress the leakage energy by a

dynamic control of supply voltage (Vdd ) and body bias voltage

(Vbb ). Accurate prediction of remaining workload (hereafter,

workload prediction) plays a central role in DVFS where the

Manuscript received March 15, 2010; accepted July 27, 2010. Date of

current version December 17, 2010. This work was supported in part by

the National Research Foundation of Korea Grant funded by the Korean

Government, under Grant 2010-0000823, and the Brain Korea 21 Project,

the School of Information Technology, Korea Advanced Institute of Science

and Technology in 2010. This paper was recommended by Associate Editor

H.-H. S. Lee.

J. Kim and C.-M. Kyung are with the Korea Advanced Institute of

Science and Technology, Daejeon 305-701, South Korea (e-mail: jungsoo.kim83@gmail.com; kyung@ee.kaist.ac.kr).

S. Yoo is with the Pohang University of Science and Technology, Pohang

790-784, South Korea (e-mail: sungjoo.yoo@postech.ac.kr).

Color versions of one or more of the figures in this paper are available

online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCAD.2010.2068630

workload to time-to-deadline.

Workload of software program varies due to data dependency (e.g., loop counts), control dependency (e.g., if/else,

switch/case statement), and architectural dependency [e.g.,

cache hit/miss, translation lookaside buffer (TLB) hit/miss,

and so on]. To tackle the workload variation, extensive works

have been proposed [9][13], [19] assuming that workload

(i.e., elapsed number of clock cycles seen by processor) is

invariant to processor frequency scaling. However, the assumption is not appropriate for applications having significant

memory accesses. Fig. 1(a) shows the distribution of per-frame

workload of MPEG4 decoder at two different frequency levels,

i.e., 1 and 2 GHz. It is obtained from decoding 3000 frames

of 1920 800 movie clip (an excerpt from Dark Knight)

on LG XNOTE LW25 laptop.1 As shown in Fig. 1(a), the

workload increases as processor frequency increases. This is

due to the processor stall cycles spent while waiting for data

from external memory (e.g., SDRAM, SSD, and so on). For

example, when the memory access time is 100 ns, each offchip memory access takes 100 and 200 processor clock cycles

at 1 GHz and 2 GHz, respectively. Since the memory access

time, called memory stall time, is invariant to processor clock

frequency, the number of processor clock cycles spent for

memory access grows as the clock frequency increases.

To consider memory stall time in clock frequency scaling,

[4][6] present DVFS methods which set the clock frequency

of processor based on the decomposition of whole workload

into two clock frequency-invariant workloads: computational

and memory stall workloads. Computational workload is the

number of clock cycles spent for instruction execution, and

memory stall workload corresponds to memory stall time.

Based on the decomposed workloads, previous methods set

clock frequency, f , as f = wcomp /(tdR t stall ), where wcomp and

t stall represent average (or worst-case) computational workload

and memory stall time, respectively. tdR is the time-to-deadline.

Generally, computational workload and memory stall time

have distributions as shown in Fig. 1(b) and (c). Fig. 1(b)

shows the distribution of computational workload caused by

data, control, and architectural dependency. Distribution of

1 LG XNOTE LW25 laptop consists of 2 GHz Intel Core2Duo T7200 processor with 128 KB L1 instruction and data cache, 4 MB shared L2 cache, and

667 MHz 2 GB DDR2 SDRAM.

c 2010 IEEE

0278-0070/$26.00

KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD

111

excerpted from Dark Knight. The x-axis and the left-hand side

y-axis represent frame index and per-frame decoding cycles,

respectively. The right-hand side y-axis represents program

phase index. Note that the program phase index does not correspond to the required performance level of the corresponding

program phase in this example. As shown in Fig. 1(d), the

entire time for decoding 1000 frames is classified into nine

program phases, and, within a program phase, per-frame

decoding cycle has a runtime distribution. Fig. 1(e) shows

runtime distributions of three representative program phases

out of nine program phases to illustrate that there can be a wide

runtime distribution within each program phase characterized

by its runtime distribution.

Knight movie clip. (a) Total workload at 1 and 2 GHz. (b) Computational

workload. (c) Memory stall time. (d) Phase behavior in the per-frame

workload. (e) Runtime distributions of three representative phases.

from L2 cache hit/miss, page hit/miss, and interference (e.g.,

memory access scheduling [1]) in accessing DRAM. As the

distribution of memory stall workload becomes significant,

previous DVFS methods based on average (or worst-case)

memory stall workload become inappropriate in reducing

energy consumption.

Long-running software programs are mostly characterized

by nonstationary phase behavior [14], [15]. For example,

multimedia programs (e.g., MPEG4 and H.264 CODEC) have

distinct time durations whose workload characteristics (e.g.,

mean, standard deviation, and max value of runtime) are

clearly different from other time durations. We call each such

distinct time duration program phase [14], [15]. Formal

definition of program phase will be given later in Section VIII.

Fig. 1(d) exemplifies the program phase behavior of MPEG4

decoder when decoding the first 1000 frames of the movie clip

A. Our Approach

Our observation on the runtime characteristics of software

program suggests that, as shown in Fig. 1, the program workload has two characteristics: nonstationary program phase behavior and runtime distribution (even within a program phase)

of computational workload and memory stall time. Based on

the observations above, this paper presents an online DVFS

method that tackles the characteristics of program workload in

order to minimize the average energy consumption of software

program. We address the online DVFS problem in two ways:

intraphase workload prediction and program phase detection.

The intraphase workload prediction predicts workloads based

on the runtime distribution of computational workload and

memory stall time in the current program phase. The program

phase detection identifies to which program phase the current

instant belongs and then obtains the intraphase workload

prediction of the corresponding program phase, which is used

to set voltage and frequency during the program phase.

Leakage power consumption often dominates total power

consumption especially at high temperature. Our method tackles leakage power consumption with a temperature-aware combined Vdd /Vbb scaling. During runtime, based on temperature

readings as well as the runtime distribution, the online method

selects a set of appropriate Vdd and Vbb corresponding to

frequency level from the solution table (which was prepared

during design time).

This paper is organized as follows. Section II reviews related

works. Section III gives preliminaries on our energy model and

profiling method. Section IV presents the problem definition

and solution overview, followed by analytical formulation of

our problem in Section V. Sections VI and VII explain the proposed runtime distribution-aware DVFS. Section VIII presents

the program phase detection method. Section IX reports experimental results followed by the conclusion in Section X.

II. Related Works

There are a number of methods on the workload prediction

for online DVFS based on (weighted) average, maximum,

or the most frequent workload, or finding a repeated pattern

among N recent workloads [2]. Recently, a control theorybased workload prediction method was proposed to accurately

capture the transient behavior of workload [3]. To exploit

memory stall time, [4] and [5] present memory stall timeaware DVFS for soft real-time intertask DVFS which lowers

112

IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 1, JANUARY 2011

ratio of external memory access per instruction to clock

cycle per instruction. However, these memory stall time-aware

DVFS methods are based on average memory stall time and

do not exploit the workload distribution and nonstationary

program phase behavior.

Runtime distribution in computational workload (in most

cases, assuming a constant memory stall time) has been

studied mostly in intratask DVFS methods where performance

level is set dynamically during the execution of a task.

There are several intratask DVFS methods where workload is

predicted based on the program execution paths, e.g., worstcase execution path [7], average-case execution path [8], and

virtual execution path based on the probability of branch

invocation [9]. [10] presents an analytic workload prediction

method which minimizes statistical average dynamic energy

consumption. [11] presents a numerical solution for combined

Vdd /Vbb scaling to tackle leakage energy. [12] and [13] present

a DVFS method, called accelerating frequency schedules,

which considers the per-task runtime distribution for a set

of independent tasks. All the works mentioned above assume

constant memory stall time and single program phase.

Program phase concept has been one of the hottest research

issues because it allows new opportunities of performance

optimization, e.g., program phase-aware dynamic adaptations

of cache architecture [14], [15]. Various methods have been

proposed to characterize program phase behavior. Among

them, a vector of the average execution cycles of basic

blocks, called basic block vector (BBV), is most widely

used. By characterizing a program phase with BBV, one can

apply the program phase concept to DVFS as in [16] and

[17]. A new program phase is detected when two BBVs are

significantly different, e.g., when Hamming distance between

two BBVs is larger than a pre-defined threshold value.

Because there are a large number of basic blocks in typical

software applications, program phase detection utilizing the

BBV is usually impractical. Thus, the key issue is to reduce

the dimensionality of the BBV by identifying a subset of

basic blocks to represent the program phase behavior. A

random linear projection method is described in [14] and

[15] to reduce the effort of exploring all the combinations of

basic blocks to identify the subset. In this paper, we present a

program phase detection scheme suitable for DVFS purpose,

based on the vectors of predicted workloads for coarsegrained code sections (instead of using BBV) as explained in

Section VIII. In addition, unlike existing phase-based DVFS

methods, our method exploits runtime distribution within

each program phase to better predict the remaining workload.

Several online DVFS methods have been presented to utilize

the dynamic program behavior for further energy saving. [18]

presents a workload prediction method utilizing the Kalman

filter which captures time-varying workload characteristics

by adaptively reducing the prediction error via feedback.

We presented an online workload prediction method which

minimizes both dynamic and leakage energy consumption by

exploiting the program phase behavior and runtime distribution

of computational cycle within each program phase [19]. Based

on the assumption that memory stall time does not vary a

not considered. However, the memory stall time is simply

accounted for as an integral (nonseparable) part of the total

runtime of software program. However, in memory-bound applications where memory stall time becomes a significant portion of total program runtime, the distribution of memory stall

time needs to be exploited to achieve further energy reduction.

Compared to the method which sets voltage and frequency

based on average computational workload and memory stall

time during program runs [4], our method has three distinctive

features. First, our approach exploits runtime distribution of

both computational cycle and memory stall time, while only

the average values are assumed in [4]. Second, we exploited

program phase detection to achieve maximal reduction of

energy consumption, while [4] utilizes average workload of

whole program without the notion of program phase. Third,

in our method, workload prediction is done in a temperatureadaptive manner to tackle the dependency of leakage energy

and temperature, while the temperature dependence is ignored

in [4].

III. Preliminary

A. Processor Energy Model

Energy consumption per cycle (e) consists of switching (es )

and leakage (el ) components. Additionally, in deep submicron

regime, el is further divided into subthreshold (esub

l ), gate

gate

junc

(el ), and junction (el ) leakage energy. Putting them all

together, we can express the total energy consumption per

cycle as follows [20], [21]:

e

2

Ceff Vdd

+ Ng f 1 (Vdd K1 exp(K2 Vdd ) exp(K3 Vbb )

(1)

+Vdd K4 exp(K5 Vdd ) + |Vbb |Ij

where Ceff and Ng are effective capacitance and effective number of gates of the target processor, respectively. K1 , K2 , K3 ,

K4 , K5 , and Ij are process-dependent curve-fitting paramgate

junc

, respectively. Especially, the

eter sets for esub

l , el , and el

values of K1 , K2 , K3 are functions of operating temperature

increases exponentially as the operating tem(T ), since esub

l

perature increases. According to BSIM4 model and [21], the

temperature dependence of the parameters (K1 , K2 , and K3 )

is modeled as follows:

T 2

K6

Tref

K1 (T )

exp

(1

) K1 (Tref ) (2)

Tref

Tref

T

Tref

K2 (T )

K2 (Tref )

(3)

T

Tref

(4)

K3 (Tref )

K3 (T )

T

where Tref is reference temperature and K6 is a curve-fitting

parameter. Thus, K1 , K2 , K3 at temperature T can be obtained from the values at Tref using the relationship in (2)(4).

Since the temperature-aware energy model shown in (1)

(4) is too complicated to be used in our optimization, we

adopted a simplified energy model of combined Vdd /Vbb

scaling to approximate the energy consumption per cycle at

each temperature T as follows:

e(f, T ) as (T )f bs (T ) + al (T )f bl (T ) + c(T )

(5)

KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD

113

TABLE I

Energy Fitting Parameters for Approximating the Processor

Energy Consumption to the Accurate Estimation Obtained

from PTscalar with BPTM High-k/Metal Gate 32 nm HP Model

for Different Temperatures, Along with the Corresponding

(Maximal and Average) Errors

Temperature

(C)

Fitting Parameters

as

1.2101

1.2101

1.2101

1.2101

25

50

75

100

bs

1.3

1.3

1.3

1.3

al

4.6109

2.0107

2.2106

1.4105

Maximum (Avg)

Error (%)

bl

20.5

16.6

14.2

12.4

c

0.11

0.12

0.14

0.15

2.8

1.4

1.4

1.7

[0.9]

[0.5]

[0.4]

[0.7]

by a straight line.

where as (T ), bs (T ) and al (T ), bl (T ) are sets of curvefitting parameters which model frequency-dependent portion in

es (T ) and el (T ), respectively. c(T ) is a curve-fitting parameter

corresponding to the amount of frequency-independent energy

portion in e(f, T ). Table I shows examples of fitting parameters which approximate the processor energy consumption

obtained from PTscalar [21] and Cacti5.3 [22] with the energy

model, i.e., (1)(4), for Berkeley predictive technology model

(BPTM) high-k/metal gate 32 nm HP model [23] at 25 C,

50 C, 75 C, and 100 C. In the modeling, we configured a

target processor in PTscalar as the best-effort estimate of Core

2-class microarchitecture using the parameters presented in

[24]. As Table I shows, the simplified energy model tracks

the original energy model within 2.8% of maximum error for

all the operating temperatures. Note that fitting parameters

for modeling switching energy consumption, i.e., as and bs ,

are unchanged as temperature varies because switching energy

consumption is temperature invariant.

Processor energy consumption depends on the type of

instructions executed in the pipeline path [25]. To simply

consider the energy dependence on instructions, we classify

processor operation into two states: computational state for

executing instructions and memory stall state mostly spent for

waiting for data from memory. When a processor is in the

memory stall state, switching energy consumption can be suppressed using clock gating while leakage energy consumption

is almost the same as the computational state. The reduction

ratio of switching energy, called clock gating fraction denoting

the fraction of the clock-gated circuit, is modeled as (0.1 in

our experiments). Thus, energy consumption per clock cycle

in each processor state can be calculated as follows:

ecomp

stall

e

comp

=

=

a s f bs + a l f bl + c

as f

bs

bl

+ al f + c

(6)

(7)

stall

and e

represent energy consumption per cycle

where e

in the computational and memory stall state, respectively.

Given a desired frequency level (f ), one can always find a

pair of Vdd and Vbb that gives minimum energy consumption

per cycle using the combined Vdd /Vbb scaling [11].

B. Runtime Workload Profiling

The total number of processor execution cycles, x, can be

expressed as a sum of the number of clock cycles for executing

instructions in a processor, xcomp , and that of stall cycles for

a function of memory stall time, t stall , and frequency, f , as

follows:

x = xcomp + xstall = xcomp + f t stall .

(8)

clock frequency-invariant components, i.e., xcomp and t stall ,

during program runs, we adopt an online profiling method

which uses performance counters in a processor as presented

in [4]. We model t stall using only the number of the lastlevel cache misses (N L2 miss , in our experiment, L2 is the

last-level cache). The rationale of modeling t stall only with

N L2 miss is twofold. First, the effect of last-level cache

miss dominates the others (TLB miss, interrupts, and so on)

according to our experiment. Second, the number of events

simultaneously monitored in a processor is usually limited (in

our experimental platform, two events). In our model, t stall is

expressed as follows:

t stall = ap N L2

miss

+ bp

(9)

(9) (solid line) tracks quite well the measured memory stall

time (dots) when running H.264 decoder program in FFMPEG

[29].

In a typical software program, xcomp and t stall obtained from

running a code section are correlated with each other. It is

because t stall of a code section is proportional to the number

of external memory references which is highly correlated

with the number of executed memory instructions in a code

section, e.g., load and store. xcomp of a code section depends

on the type and number of executed instructions including

memory instructions. To consider the correlation between

computational cycle (xcomp ) and memory stall time (t stall ), we

model the distribution of xcomp and t stall of a code section using

a joint probability density function (PDF) as shown in Fig. 3.

During runtime, the joint PDF is obtained as follows. After the

execution of a code section, t stall is obtained from (9). Then,

from (8), xcomp is calculated with x and t stall . The probability

of occurrence of a pair of xcomp and t stall is defined as the ratio

of the number of occurrences of the pair to the total number

of executions of the code section.

114

IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 1, JANUARY 2011

memory stall time (t stall ).

Fig. 4. Solution inputs. (a) Software program (or source code) partitioned

into program regions. (b) Energy model (in terms of energy-per-cycle) as

a function of frequency. (c) fv table storing the energy-optimal pairs,

(Vdd , Vbb ), for N frequency levels.

Fig. 4 illustrates three types of input required in the

proposed procedure. Fig. 4(a) shows a software program

partitioned into program regions each shown as a box. A

program region is defined as a code section with associated

voltage/frequency setting. The partition can be performed

manually by a designer or via an automatic tool [26] based

on execution cycles of code sections obtained by a priori

simulation of the software program. The ith program region is

denoted as ni while the first and the last program region are

called root (nroot ) and leaf (nleaf ) program region, respectively.

In this paper, we simply focused on a software program which

periodically runs from nroot to nleaf at every time interval. At

the start of a program region, voltage/frequency is set and

maintained until the end of the program region. At the end of

a program region, computational cycle and memory stall time

are profiled. Then, as explained in Section III-B, the joint PDF

of computational cycle and memory stall time are updated

as shown in Fig. 3. Fig. 4(b) shows an energy model (more

specifically, energy-per-cycle vs. frequency). Fig. 4(c) shows

a pre-characterized table called f-v table in which the energyoptimal pair, (Vdd , Vbb ), is stored for each frequency level (f ).

When the frequency is scaled, Vdd and Vbb are adjusted to the

corresponding level stored in the table. Note that, due to the

dependency of leakage energy on temperature, energy-optimal

values of (Vdd , Vbb ) corresponding to f vary depending on the

operating temperature. Therefore, we prepare f-v table for a

set of quantized temperature level.

1: if (end of ni ) then

2:

Online profiling and calculation of statistics (Section III-B)

3:

if (ni == nleaf ) then

4:

iter++

5:

if ((iter % PHASE UNIT)==0) then

6:

for from nleaf to nroot do

7:

Workload prediction for each energy component

(Section VI)

8:

end for

9:

Program phase detection (Section VIII)

10:

end if

11:

end if

12: else if (start of ni ) then

13:

Finding workload of ni based on coordination (Section VII)

14:

Voltage/frequency scaling with feasibility check

15: end if

opt

workload prediction, i.e., wi , of each program region during

program execution. Algorithm 1 shows the overall flow of the

proposed method. The proposed method is largely divided into

workload prediction (lines 111) and voltage/frequency (v/f)

setting (lines 1215) step, which are invoked at the end and

the start of every program region, respectively.

In the workload prediction step, we profile runtime information, i.e., xistall and tistall , and update the statistical parameters of

the runtime distributions, e.g., mean, standard deviation, and

skewness of xistall and tistall (lines 12). After the completion

of the leaf program region, the number of program runs, i.e.,

iter, is increased (line 4). At every PHASE UNIT program

runs (line 5), where PHASE UNIT is the predefined number

of program runs (e.g., 20-frame decoding in MPEG4), we

perform the workload prediction and program phase detection

by utilizing the profiled runtime information and its statistical

parameters (lines 510). The periodic workload prediction is

performed in the reverse order of program flow as presented in

[10], [11], and [19], i.e., from the end (nleaf ) to the beginning

(nroot ) of a program (lines 68). As will be explained in

Sections V and VI, in this step, we find local-optimal workload

predictions of ni , each of which minimizes each energy

component, instead of total energy.2 By utilizing the localoptimal workload predictions, the program phase detection is

performed to identify which program phase the current instant

belongs to (line 9).

In the v/f setting step (lines 1215), which is performed

at the start of each program region, a process called coordination determines energy-optimal global workload prediction,

opt

wi , with the combination of the local-optimal workload

predictions of the detected program phase (line 13). Based on

opt

wi , we set voltage/frequency while satisfying hard real-time

constraint (line 14).

V. Analytical Formulation of

Memory Stall Time-Aware DVFS

Assume that a program is partitioned into two program

regions, i.e., ni and ni+1 , and that each program region has

2 In this paper, total energy consumption is calculated as the sum of the five

independent energy components as shown in (11).

KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD

energy model presented in Section III-A is used. The total

energy consumption for running the two program regions, Ei ,

is calculated as follows:

comp

Ei = Ei

+ Eistall

(10)

comp

where Ei

running computational workload and memory stall workload,

respectively.

comp

Ei

and Eistall , respectively, consist of three independent energy components: frequency-dependent switching encomp

ergy (Esi

and Esistall ), frequency-dependent leakage energy

comp

(Eli

and Elistall ), and frequency-independent energy called

comp

comp

base energy (Ebi

and Ebistall where Ebi = Ebi

+ Ebistall ).

Thus, Ei is expressed as follows:

comp

Ei = (Esi

comp

+ Eli

(11)

Using (6)(8), the five energy components in (11) are expressed as follows:

comp

Esi

comp

= as fibs xi

comp

Eli

Esistall

Elistall

Ebi

=

=

comp

bs

+ as fi+1

xi+1

comp

comp

bl

xi

+ al fi+1

xi+1

bs

stall

(as fibs fi tistall + as fi+1

fi+1 ti+1

)

bl

bl

stall

stall

al fi fi ti + al fi+1 fi+1 ti+1

comp

comp

stall

c(xi

+ xi+1 + fi tistall + fi+1 ti+1

).

al fibl

(12)

(13)

(14)

(15)

(16)

Frequency of each program region, fi and fi+1 can be expressed as the ratio of the remaining computational workload

prediction (wi and wi+1 ) to the remaining time-to-deadline

prediction for running the computational workload, i.e., total

R

remaining time-to-deadline (tiR and ti+1

) minus remaining

memory stall time prediction (si and si+1 ), as shown in

wi

fi = R

(17)

ti s i

wi+1

.

(18)

fi+1 = R

ti+1 si+1

R

in (18) is expressed as follows:

ti+1

comp

R

ti+1

= tiR

xi

fi

tistall .

(19)

R

with (17) and (19), fi+1 in (18) is

By replacing fi and ti+1

rearranged as follows:

wi+1

fi+1 = R

(20)

(ti si )i

where

comp

tistall

xi

t stall

R i

wi

ti s i

(tistall + si+1 ) si .

1

(21)

(22)

stall

When memory stall time of ni and ni+1 , i.e., tistall and ti+1

,

are unit functions, remaining memory stall time prediction is

set to the sum of memory stall time of remaining program

stall

regions, i.e., si = tistall + ti+1

. In the same manner, si+1 is set

stall

to ti+1 because ni+1 is the leaf, i.e., last, program region in

this case. Therefore, tistall in (22) becomes zero, thereby, i

115

from leaf to root program region as presented in [10], wi+1 is

opt

already known as wi+1 when calculating wi . With (17)(20),

(12)(16) can be expressed as functions of wi and tiR .

Since Ei is continuous and convex with respect to wi , the

energy-optimal workload prediction of computational workopt

load, i.e., wi , can be obtained by finding a point which

satisfies the following relation:

comp

Ei Esi

=

wi

wi

comp

Eli

wi

+

+

= 0. (23)

wi

wi

wi

opt

well as tiR , wi satisfying (23) varies with respect to tiR . In

opt

other words, wi has to be found for every tiR . Because tiR

has a wide range of values, performing a workload prediction

for every value of tiR is unrealistic. Therefore, we proposed

a solution which performs a workload prediction for a set of

quantized levels of tiR [28]. However, it also requires a lot

of workload predictions since more energy savings can be

obtained as tiR is quantized into larger number of quantization

levels. Thus, the method causes a large runtime overhead

if it is applied as the online solution while maintaining

its effectiveness (according to our experiment, the runtime

overhead is 3.4 times larger than the pure runtime for H.264

decoder when tiR is quantized into 30 levels).

To reduce the runtime overhead of finding an energyoptimal workload prediction, we propose a workload preopt

diction method which finds wi in two steps: 1) workload

prediction which minimizes each energy component, called

local-optimal workload prediction (in Section VI), and 2)

coordination of the local-optimal workload predictions to

opt

obtain global workload prediction wi (in Section VII).

A local-optimal workload prediction is to find the workload

prediction which minimizes each of the five energy components in (11) by adjusting voltage/frequency based on the

workload prediction. For example, voltage/frequency scaling

comp

based on the local-optimal workload prediction of Esi

only

comp

minimizes energy consumption of Esi . It can be obtained

by finding the point which equates the single derivative of

comp

(23) to zero, i.e., Esi /wi = 0. Note that a local-optimal

workload prediction can be calculated independently of tiR ,

because i in (21) is independent of tiR (tistall = 0).

A coordination of the local-optimal workload predictions

is to find the workload prediction which minimizes Ei by

utilizing the five local-optimal workload predictions. When

a derivative of one energy component with respect to wi

dominates others in (23), the workload prediction which

satisfies the (23) can be obtained by finding a point where

the derivative of the dominant energy component becomes

comp

opt

zero. For instance, when Esi /wi dominates others, wi

comp

is simply set to wsi . When there are multiple dominant

energy components, we need to coordinate them so as to find

the workload prediction with lower total energy consumption.

Finding the workload prediction [satisfying (23)] requires a

numerical solution whose complexity is too high to be applied

during runtime, as presented in [28]. In this paper, we present

an efficient approach to coordinate local-optimal workload

116

IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 1, JANUARY 2011

comp

comp

comp

comp

and xi+1 ,

where xi

comp

comp

is fixed as xi

since Ji is a

respectively. Note that xi

opt

comp

unit function in this case. wi+1 is replaced by wsi+1 since

comp

we perform the local-optimal workload prediction of Esi .

comp

comp

Since Esi

is continuous and convex on wi , wsi

can be

obtained by finding a point which satisfies

as bs wbi s 1

(tiR si )bs

comp

Esi

wi

comp

xi

comp

comp

comp

xi

(25)

= 0.

comp

(wi xi )bs +1

comp

in a closed-form expression as follows:

comp

comp

comp

comp 1

= xi

+ (wsi+1 )bs xi+1 bs +1

wsi

comp

= xi

comp

si+1 .

+w

(26)

comp

wsi

comp

and tistall . (b)

comp

and unit function of tistall . (c) Case 3:

Case 2: runtime distribution for xi

comp

runtime distributions for both xi

and tistall .

opt

start of ni through the coordination of local-optimal workload

predictions.

Consumption of Single Energy Component

In this section, we assume that a program is partitioned into

two consecutive program regions, ni and ni+1 , and present

a method which finds a local-optimal workload prediction

while exploiting the runtime distribution of both computational

workload and memory stall time. As Fig. 5 shows, we will

explain the local-optimal workload prediction method in three

comp

different cases of Ji , the joint PDF of xi

and tistall . Case

1: when Ji is given as a unit function while Ji+1 is a general

comp

function as shown in Fig. 5(a). Case 2: when xi

alone has

a runtime distribution while tistall is a unit function as shown

comp

in Fig. 5(b). Case 3: when both xi

and tistall have runtime

distributions as shown in Fig. 5(c).

comp

A. Case 1: Both xi

of ni is given as a unit function as shown in Fig. 5(a). We

comp

comp

define wsi , wli , wsistall , wlistall , and wbi as the localcomp

comp

optimal workload prediction for minimizing Esi , Eli ,

stall

stall

Esi , Eli , and Ebi , respectively. Given the joint PDFs, Ji

and Ji+1 , average switching energy consumption for running

comp

computational workload, i.e., Esi , is calculated as the sum

comp

of Esi

with respect to Ji and Ji+1 as follows:

comp

comp

Esi

=

Esi Ji Ji+1

(24)

comp

bs

as

wsi+1

comp

comp

wbi s xi

= R

+

xi+1

comp

b

s

(ti si )

1 xi /wi

Equation (26) shows that

comp

1) workload of the ith program region, i.e., xi , and 2)

comp

comp bs comp 1/(bs +1)

si+1 = ((wsi+1 ) xi+1 )

w

, called effective remaining

comp

workload of ni+1 with respect to Esi , corresponding to

the portion of remaining workload after program region ni .

comp

Fig. 5(a) illustrates the calculation of wsi

presented in

(26), where Ji and Ji+1 are replaced by their representative

comp

scomp

workloads, i.e., xi

and w

i+1 , respectively. In the same

comp

stall

stall

way, wli , wsi , wli , and wbi can be expressed as

follows:

comp comp 1

comp

comp

= xi

+ (wli+1 )bl xi+1 bl +1

wli

comp

comp

= xi

+ wl

(27)

i+1

wsistall

comp

xi

xi

comp

comp

xi

xi

comp

tistall

comp ,

wl

i+1

stall bl +1 stall

(wli+1

) ti+1

stall

+ wl

i+1

comp

xi

xi

comp

(28)

b 1+2

comp

xi

wbi

stall bs +1 stall

(wsi+1

) ti+1

tistall

sstall

w

i+1

wlistall

b 1+2

comp

xi

(29)

21

comp

xi

tistall

i+1

+ wb

stall

wbi+1 ti+1

(30)

stall

sstall

w

i+1 , wli+1 ,

where

and wb

comp

workload of ni+1 with respect to Eli , Esistall , Elistall , and

Ebi , respectively. Since local-optimal workload can simply be

calculated by just summing effective remaining workloads of

program regions as shown in (26)(30), it can be obtained

during program runs with negligible runtime overhead.3

If the software program consists of a cascade of program

regions with conditional branches, we can still calculate the

effective remaining workload of program region in a similar

manner to [10].

3 The runtime overhead of the local-optimal workload prediction is presented

in Table VI.

KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD

comp

B. Case 2: xi

Unit Function

117

comp

has a

runtime distribution with tistall still assumed as unit function as

shown in Fig. 5(b). In this case, average energy consumption

of computational workload is expressed as follows:

comp

as

comp

Esi

=

Esi Ji Ji+1 = R

(31)

(ti si )bs

Nc

comp

pi (j)

comp

comp

comp

wbi s xi

+ (wsi+1 )bs xi+1

comp

(1 xi (j)/wi )bs

j=1

comp

Note that, in this case, wsi

of tiR , without loss of quality degradation. However, contrary

comp

comp

to (26), no explicit form exists for wsi . Thus, wsi

can

be obtained only through a numerical solution approach as

presented in [11], which is too time-consuming for runtime

application. Instead, being inspired by (26), we can model the

comp

solution, wsi

as follows:

comp

comp

i

= xs

comp

si+1

+w

(33)

comp

i

where xs

is the effective workload of program region ni

comp

scomp

for Esi . w

i+1 is obtained in the same way as presented

in (26). From our observation that energy-optimal workload

prediction tends to have a value near the average and depends

comp

on runtime distribution, we model xs

as follows:

i

comp

comp

comp

xs

= (1 + si ) xi

i

comp

comp

and Index 3:

follows:

comp

in its

where Nc is the number of quantized levels of xi

comp

comp

PDF. pi (j) represents the probability of xi

falling into

the jth quantized level. Note that, in this case where tistall is

comp

given as a unit function, the joint PDF (Ji ) of xi

and tistall

comp

comp

stall

is the same as the PDF of xi

at the given ti , i.e., pi .

comp

wsi

can be obtained by finding wi which satisfies the

following relation:

comp

Esi

as bs wibs 1

comp

= R

xi

+

(32)

wi

(ti si )bs

Nc

comp

comp

x

(j)p

(j)

comp

i

i

s

wbi+1

= 0.

xi+1

comp

bs +1

(w

x

(j))

i

i

j=1

wsi

comp

/xi

comp

comp

comp

xi

/

wsi+1 , and (b) Index 2: skewness (gi

), at 75 C.

(34)

where si

is a parameter which represents the ratio of

comp

comp

comp

the distance between xs

and xi

to xi . We calculate

i

comp

i

by exploiting the pre-characterization of solutions. First,

xs

comp

we prepare a lookup table LUTscomp for si

during design

comp

during runtime.

time and perform table lookup to obtain si

comp

si

depends on the shape of runtime distribution. Thus, we

derived the indexes of LUTscomp as follows:

comp comp

1) Index 1: i /xi , normalized standard deviation

comp

comp

(i ) with respect to the mean of ni (xi );

comp

comp

2) Index 2: gi , skewness of xi ;

comp

comp

comp

3) Index 3: xi /

wsi+1 , ratio of the mean of ni (xi ) to

comp

the effective remaining workload of ni+1 (

wsi+1 ).

The rationale of choosing the three indexes is as follows. By

comp

substituting wsi

with (33) and (34), (32) is rearranged as

comp

xi

comp

+ (

wsi+1 )bs +1

Nc

(35)

j=1

comp

xi

comp

((1 + si

comp

) xi

comp

(j)pi

(j)

comp

comp

si+1 xi

+w

(j))bs +1

= 0.

comp

can be obtained by finding a

comp

point which satisfies (35). As shown in (35), si

depends

comp

comp

scomp

on xi , w

(Index

3),

and

the

of

x

, i.e.,

i

i+1

comp

comp

xi (j), pi (j), which is modeled as a skewed normal

distribution in this paper, since the PDF usually does not have

a nice normal distribution.4 The skewed normal distribution is

comp

comp

comp

characterized with three parameters: xi , i , and gi

comp

(Index 1 and Index 2). Fig. 6 illustrates si

as the indexes

change.

comp

Fig. 6 shows si

as a function of three indexes above. As

comp

shown in Fig. 6(a), si

increases with the wider distribution

comp

comp comp

of xi , i.e., i /xi

increases, and increases as the

workload of ni (relative to the effective remaining workload

comp

comp

of ni+1 ), i.e., xi /

wsi+1 , increases. It also increases as the

comp

gi , skewness of PDF, moves to the right (gi > 0) as

Fig. 6(b) shows.

comp

In the same way, wli , wsistall , wlistall , and wbi can also

comp

be calculated by finding li , sistall , listall , and bi from

LUTlcomp , LUTsstall , LUTlstall , and LUTb , respectively. Note

comp

bi can be obtained by performing table lookup

that si

with the statistical parameters (e.g., mean, standard deviation,

and skewness) and effective workload of ni+1 . Thus, it can

be performed with negligible runtime overhead to find a

local-optimal workload prediction while exploiting the runtime

distribution of computational workload.

comp

C. Case 3: Both xi

comp

xi

When both

as shown in Fig. 5(c), average switching energy consumption

comp

for running computational workload, i.e., Esi , can be

comp

calculated as the sum of Esi

with respect to the joint PDFs

4 Note that more accurate workload prediction can be performed with an

comp

additional effort, as presented in [19], where PDF of xi

is modeled as a

multimodal distribution with each mode given as a skewed normal distribution.

Although more energy savings can be obtained from the multimodal modeling,

in this paper, we simply approximated PDF as a single-modal skewed normal

distribution in order to reduce the runtime overhead. However, it can be easily

extended to the multimodal case [19].

118

comp

comp

Esi

=

Esi Ji Ji+1

bs comp

as

comp

=

wi xi

+ Zsi

R

b

s

(ti si )

(36)

where

comp

Zsi

comp

comp

Ns

Nc

j=1 k=1

J(j, k)

(i (j, k))bs

(37)

comp

when (xi , tistall ) falls into the (j, k)th quantized level, respectively. Since we set the predicted remaining memory stall time

stall

(si ) to the sum of average of tistall and ti+1

, tistall [defined in

(22)] in i is not zero any longer. Due to the nonzero tistall ,

the local-optimal workload prediction is a function of tiR . To

reduce the solution complexity, we approximate the calculation

comp

of Zsi

in (37) as follows:

comp

comp

comp

comp

si

(wsi+1 )bs xi+1

Zsi

Nc

j=1

(1

comp

pi (j)

comp

xi (j)/wi )bs

TABLE II

Threshold Parameters Used in Coordination

Coordination

Step

C1

Threshold

Parameter

Condition

fscomp

b

(as f bs )

lf l )

c (af

f

b

b

s

(al f l )

c (asff )

f

b

+1

l

(al f

)

(cf )

f c

f

(al f bl +1 )

(cf )

c f

f

bl +1

(as f bs +1 )

c (al f f +cf )

f

b

+1

bs +1

l

(al f

+cf )

c (asff )

f

flcomp

C2

fbstall

flstall

(38)

C3

fsstall

fLstall

where

comp

si

Ns

Nc

j=1 k=1

comp

1 xi /wi

i (j, k)

bs

:

c

Ji (j, k).

As shown in (39),

(21) is a function of wi and si . Note that (wi , si ) will be

calculated at the end of the current program phase using the

joint PDFs (Ji and Ji+1 ) profiled during the time period of

the current program phase. To simplify the interdependence

comp

between si

and (wi , si ), we approximate the calculation

comp

comp

of si

by replacing (wi , si ) with (wsi , si ) of the current

program phase. By substituting (38) with the approximated

comp

si , we can rearrange (36) as follows:

as

comp

comp

comp

Esi

wbi s xi

+ si

(40)

(tiR si )bs

Nc

comp

comp b comp

p

(j)

i

.

(wsi+1 ) s xi+1

comp

(1 xi /wi )bs

j=1

comp

comp

in a similar way as (32) and (33), we can express wsi ,

comp

which minimizes Esi , as follows:

comp

wsi

comp

i

= xs

comp

si+1

+w

1/(bs +1)

comp

comp bs comp

scomp

w

=

s

(ws

)

x

.

i

i+1

i+1

i+1

(39)

comp

si

where

opt

where C1C4 represent coordination steps.

(41)

(42)

comp

Compared to the calculation of wsi

comp

2 in Fig. 5(a) and (b), the only difference is that (si )1/(bs +1)

is multiplied in the calculation of effective remaining workload

comp

5

scomp

of ni+1 , i.e., w

, wsistall , wlistall , and wbi can also

i+1 . wli

be calculated in the same way.

comp

i

i

comp

si+1 becomes the same as Case I and Case II.

becomes 1, thereby w

comp

workload prediction, i.e., finding si

bi with respect

to the runtime distribution, in a design-time step, and then,

we store the parameters into LUTs. Thus, we can drastically

reduce the runtime overhead of finding workload prediction

while accurately considering the influence of the runtime

distribution in workload predictions because we only access the LUTs to find workload prediction during runtime.

However, it requires additional memory space to store the

pre-characterized data. The runtime and area overhead are

presented in Section IX-C.

In this section, we present a method called coordination to

opt

find the global workload prediction of ni (wi ) based on the

comp

comp

local-optimal workload predictions, i.e., wsi , wli , wsistall ,

stall

wli , and wbi . As (23) shows, the workload prediction which

minimizes average total energy consumption at given tiR varies

according to the sensitivity of each energy components with

comp

comp

respect to wi , i.e., Esi /wi , Eli /wi , Esistall /wi ,

stall

Eli /wi , and Ebi /wi in (23).

Since the coordination of workload predictions is performed

online, it needs to be done with low overhead. To achieve this

goal, we present a simple hierarchical method which finds

opt

wi from local-optimal workload predictions (independent of

R

ti ), as shown in Fig. 7. As Fig. 7 shows, first, we obtain

the workload prediction for each workload type, i.e., compucomp

tational workload (wi ) through a coordination step called

C1 and memory stall workload (wstall

i ) through coordination

opt

comp

steps called C2 and C3. Then, we find wi from wi

and

stall

wi through a coordination step called C4.

KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD

comp

comp

opt

C4: wi

and wstall

to find wi .

i

comp

and wli

comp

to find wi

. (b)

comp

wi

1) Coordination for

(C1): A workload prediction

comp

for computational workload, wi

represents the prediction

comp

comp

comp

which minimizes Ei , i.e., sum of Esi

and Eli . Therecomp

comp

comp

depends on wsi

and wli . In this coordination,

fore, wi

comp

has exponential dependency

we utilize the fact that Eli

on frequency in combined Vdd /Vbb scaling. The rationale is

explained as follows. In the low frequency region, high reverse

body bias voltage can be applied suppressing the leakage

energy consumption due to high Vth . As frequency increases,

|Vbb | is decreased to enable higher clock frequency operation

by reducing Vth , which drastically increases leakage energy

consumption.

In combined Vdd /Vbb scaling, increase of switching energy

consumption (with respect to frequency increase), i.e., es /f ,

dominates leakage energy consumption in the lower frequency

region while increase of leakage energy consumption, i.e.,

el /f , dominates others in relatively high frequency region

[27]. Therefore, when most operating frequency falls into the

frequency range where the sensitivity of switching energy

consumption is much larger than that of leakage energy

comp

comp

consumption, i.e., es /f el /f , wi

approaches wsi

because switching energy consumption is the major contributor in this frequency region. On the other hand, when the

operating frequency is within the frequency region where

comp

comp

es /f el /f , wi

approaches wli .

We partition the frequency range into three regions: switching energy-dominant, leakage energy-dominant, and intermediate regions. The partition is done with two threshold frequencies, fscomp and flcomp . The frequency range below fscomp

(above flcomp ) is called switching (leakage) energy-dominant

region while the frequency range between the two threshold

frequencies is called intermediate region. Each energy component has two threshold frequencies as shown in Table II.

In order to identify which frequency partition the current

program region belongs to, we introduce a simple evaluation

metric, fieval , as the upper bound of the operating frequency

in the remaining program regions from ni to nleaf

comp(k)

fieval =

comp(k)

WCECi

tiR WCETistall(k)

(43)

In (43), WCECi

and WCETistall(k) represent the remaining

worst-case execution cycle of computational workload and

remaining worst-case memory stall time from ni to nleaf

when a current program phase is the kth program phase,

respectively. The solid line in Fig. 8(a) illustrates a linear

comp

coordination method to find wi

by utilizing fieval . When

119

fieval is lower than the threshold value, fscomp (in the second

row in Table II where c is set to 5.0 in our experiment), we set

comp

comp

wi

to wsi

because that remaining program regions will

be operated within the switching energy-dominant frequency

region. When fieval is higher than the threshold value, flcomp

comp

comp

(in the third row in Table II), we set wi

to wli . As

comp

comp

eval

comp

< fi

the last case, i.e., fs

< fl

, we set wi

in

eval

comp

comp

proportion to the ratio of (fi fs

) to (fl

fscomp )

using a linear interpolation function L() defined as follows:

L(X

, Xupper , Ylower , Yupper , Xeval )

lower

eval Xlower

= XXupper

(Yupper Ylower ) + Ylower .

Xlower

(44)

comp

comp

comp

Yupper = wli , and Xeval = fieval , we can obtain wi

as the

output of the function L().

2) Coordination for wstall

(C2 and C3): A workload prei

diction for memory stall, wstall

represents the prediction which

i

minimizes Eistall . Since Eistall depends on Ebi as well as Esistall

and Elistall , wstall

can be derived from wbi as well as wsistall and

i

by coordinating the three local-optimal

wlistall . To obtain wstall

i

workload predictions, we perform the coordination in two

steps as shown in Fig. 7. First, we find wLstall

by coordinating

i

wlistall and wbi , i.e., C2, both of which are related to leakage

energy consumption. Then, we find wstall

by coordinating

i

wsistall and wLstall

,

i.e.,

C3.

Note

that

the

coordination

for

i

comp

wLstall

can

be

done

in

the

same

way

as

w

,

which

is

i

i

comp

comp

shown in Fig. 8(a), by simply substituting (wsi , wli )

by (wlistall , wbi ) and (fscomp , flcomp ) by (flstall , fbstall ), where

flstall and fbstall are threshold values defined in the fourth

and fifth rows in Table II, respectively. In the same way, the

coordination for wstall

can also be done by the substitution of

i

corresponding workload predictions and threshold values, i.e.,

fsstall and fLstall in Table II.

opt

3) Coordination for wi (C4): The last step of the coordiopt

comp

nation is to obtain wi from wi

and wstall

i . In CPU-bound

opt

comp

comp

since Ei

applications, wi approaches wi

dominates

Eistall . On the contrary, in case of memory-bound applications,

opt

wstall

contributes more to wi . We calculate the maximum

i

memory-boundedness of the remaining program region from

ni to nleaf , denoted by i , as the ratio of the worst-case remaining memory stall cycles from ni at fieval (43) to that of the comcomp(k)

.

putational cycles, i.e., i = fieval WCETistall(k) /WCECi

Fig. 8(b) illustrates the linear coordination method to find

opt

wi by utilizing i . As i becomes larger (smaller), the

remaining work is characterized to be more memory-bound

(CPU-bound). When i is smaller than a certain threshold

value, called comp (0.5, in our experiment), we regard that

opt

the remaining workload is CPU-bound, thereby, we set wi

comp

to wi . On the other hand, if i is larger than a certain

threshold value, called stall (=1/ comp , in our experiment),

opt

we set wi to wstall

since the remaining work is memory

i

bound. In the intermediate case, i.e., comp < < stall ,

opt

we set wi in proportion to the ratio of ( i comp ) to

stall

( comp ) using (44).

opt

After wi is obtained, voltage/frequency is set to fi =

opt

R

wi /(ti si ), where tiR is measured at the start of each program

120

whether the performance level satisfies the given deadline

constraint even if the worst-case execution time occurs after

the frequency is set, which is called feasibility check. More

details are explained in [10] and [27].

VIII. Program Phase Detection

Program phase, especially, in terms of computational cycles

and memory stall time, during PHASE UNIT (as defined

in Algorithm 1) is characterized by a salient difference in

computational cycle and memory stall time. Conventionally,

the program phase is characterized by utilizing only average

execution cycle of basic blocks without exploiting the runtime

distributions of computational cycle and memory stall time

[14], [15]. To exploit the runtime distributions in characterizing a program phase, we define a new program phase vector

consisting of five local-optimal workload predictions for each

program region. Note that local-optimal workload predictions

reflect the correlation as well as the runtime distributions of

both computational cycle and memory stall time. Thus, a set

of local-workload predictions becomes a good indicator which

represents the joint PDF of each program region. The program

phase vector of the kth program phase is defined as follows:

W (k) =[Wscomp(k) , Wlcomp(k) , Wsstall(k) , Wlstall(k) , Wb(k) ]T

(45)

where

comp(k)

comp(k)

comp(k)

Wscomp(k) = wsroot , . . . , wsi

, . . . , wsleaf

comp(k)

comp(k)

comp(k)

Wlcomp(k) = wlroot , . . . , wli

, . . . , wlleaf

stall(k)

stall(k)

Wsstall(k) = wsroot

, . . . , wsistall(k) , . . . , wsleaf

stall(k)

stall(k)

, . . . , wlistall(k) , . . . , wlleaf

Wlstall(k) = wlroot

(k)

(k)

Wb(k) = wbroot

, . . . , wbi(k) , . . . , wbleaf

.

(46)

(47)

(48)

(49)

(50)

as the time for decoding 20 frames in our experiments),

we check to see whether a program phase is changed. It

is evaluated by calculating Hamming distance between program phase vector of the current period and that of current

program phase. When the Hamming distance is greater than

the threshold called p (set to 10% of the magnitude of

the current program phase vector in our experiments), we

evaluate that the program phase is changed, and then, check

to see if there is any previous program phase whose Hamming

distance with the program phase vector of the current period is

within the threshold p . If so, we reuse local-optimal workload

predictions of the matched previous phase as that of the new

phase to set voltage/frequency. If there is no previous phase

satisfying the condition, we store the newly detected program

phase and use the local-optimal workload predictions of a

newly detected program phase to set voltage/frequency until

the next program phase detection.

IX. Experimental Results

A. Setup

In our experiments, we used two real-life multimedia programs, MPEG4 and H.264 decoder in FFMPEG [29]. We

in total, 4200 frames of 1920 1080 video clip consisting

of eight test pictures, including Rush Hour (500 frames),

Station2 (300 frames), Sunflower (500 frames), Tractor (690

frames), SnowMnt (570 frames), InToTree (500 frames), ControlledBurn (570 frames), and TouchdownPass (500 frames)

in [30]. Second, we used 3000 frames of 1920 800 movie

clip (as excerpted from Dark Knight). We inserted nine

voltage/frequency setting points in each program: seven for

macroblock decoding and two for file write operation for decoded image. We performed profiling with PAPI [31] running

on LG XNOTE with Linux 2.6.3.

We performed experiments at 25, 50, 75, and 100 C.

We calculated the energy consumption using the processor

energy model with combined Vdd /Vbb shown in Section III-A.

The parameters in (1)(4) of the processor energy model

were obtained from PTscalar [21] and Cacti5.3 with BPTM

high-k/metal gate 32 nm HP model. We used seven discrete

frequency levels from 333 MHz to 2.333 GHz with 333 MHz

step size. We set 20 s as the time overhead for switching

voltage/frequency levels and calculate the energy overhead

using the model presented in [7].

We compared the following four methods.

1) RT-CM-AVG [4]: runtime DVFS method based on the

average ratio of memory stall time and computational

cycle (baseline).

2) RT-C-DIST [19]: runtime DVFS method which only

exploits the PDF of computational cycle.

3) DT-CM-DIST [28]: design-time DVFS method which

exploits the joint PDF of computational cycle and memory stall time.

4) RT-CM-DIST : runtime version of DT-CM-DIST (proposed).

We modified the original RT-CM-AVG [4], which runs intertask DVFS without real-time constraint, such that it supports

intratask DVFS with a real-time constraint. In running DTCM-DIST [28], we performed a workload prediction with

respect to 20 quantized levels of remaining time, i.e., bins,

using the joint PDF of the first 100 frames in design time.

B. Energy Savings

Table III(a) and (b) shows the comparisons of energy

consumption for MPEG4 and H.264 decoder, respectively,

at 75 C. The first column shows the name of test pictures.

Columns 2, 3, and 4 represent the energy consumption of each

DVFS method normalized with respect to that of RT-CM-AVG.

Compared with RT-CM-AVG [4], our method, RT-CMDIST offers 5.134.6% and 4.517.3% energy savings for

MPEG4 and H.264 decoder, respectively. Fig. 9 shows the

statistics of used frequency levels when running SnowMnt

in MPEG4 decoder. As Fig. 9 shows, RT-CM-AVG uses the

lowest frequency level, i.e., 333 MHz, more frequently than

other two methods. It also leads to the frequent use of high

frequency levels, i.e., frequency levels above 2.00 GHz where

energy consumption drastically increases as frequency rises,

in order to meet the real-time constraint. However, by considering the runtime distribution in RT-CM-DIST, high frequency

KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD

121

TABLE III

Comparison of Energy Consumption for Test Pictures at 75 C:

(a) MPEG4 (20 Frames/s) and (b) H.264 Decoder (12 Frames/s)

(a)

Image

Rush Hour

Station2

Sunflower

Tractor

SnowMnt

InToTree

ControlledBurn

TouchdownPass

Average

RT-C-DIST

[19]

1.08

1.34

0.99

1.01

1.02

0.97

0.88

1.15

1.05

DT-CM-DIST

[28]

0.83

0.97

0.76

0.78

0.90

0.79

0.67

0.91

0.83

RT-CM-DIST

(Proposed)

0.79

0.95

0.74

0.75

0.81

0.71

0.65

0.86

0.78

DT-CM-DIST

[28]

0.94

0.90

0.97

1.00

1.03

1.00

0.94

0.99

0.97

RT-CM-DIST

(Proposed)

0.93

0.83

0.88

0.96

0.84

0.93

0.88

0.94

0.90

Fig. 9.

(b)

Image

Rush Hour

Station2

Sunflower

Tractor

SnowMnt

InToTree

ControlledBurn

TouchdownPass

Average

RT-C-DIST

[19]

1.11

1.05

1.09

1.18

1.03

1.14

1.07

1.10

1.10

because the workload prediction with distribution awareness

is more conservative than average-based method.

Table IV shows energy savings results for one of the test

pictures, i.e., SnowMnt, at four temperatures, 25 C, 50 C,

75 C, and 100 C. As the table shows, more energy savings

can be achieved as temperature increases. It is because the

energy penalty caused by frequent use of high frequency level

can be more obviously observed as temperature increases,

since leakage energy consumption is exponentially increasing

according to the temperature. By considering the temperature

dependency of leakage energy consumption, RT-CM-DIST

sets voltage/frequency so as to use high frequency levels

less frequently as temperature increases while RT-CM-AVG

does not consider the temperature increases. Note that, in

most cases, MPEG4 decoder gives more energy savings than

H.264 case. It is because, as Fig. 10 shows, the distribution

of memory boundedness (defined as the ratio of memory

stall time to computational cycle) of MPEG4 has a wider

distribution than that of H.264 in terms of Max/Avg and

Max/Min ratios.

Compared with RT-C-DIST [19], which exploits only the

distribution of computational cycle in runtime, RT-CM-DIST

provides up to 20.828.9% and 15.121.0% further energy

savings for MPEG4 and H.264 decoder, respectively. The

amount of further energy savings represents the effectiveness

of considering the distribution of memory stall time as well as

the correlation between computational cycle and memory stall

time, i.e., the joint PDF of computational cycle and memory

stall time. RT-C-DIST regards the whole number of clock

cycles, which is profiled at the end of every program region,

Fig. 10. Distribution of memory boundedness in (a) MPEG4 and (b) H.264

decoder.

TABLE IV

Comparison of Energy Consumption for SnowMnt at Four

Temperature Levels

MPEG4 dec.

H.264 dec.

Temp

(C)

25

50

75

100

25

50

75

100

RT-C-DIST

[19]

1.15

1.10

01.02

0.96

1.06

1.05

1.03

1.01

DT-CM-DIST

[28]

0.94

0.92

0.90

0.89

1.02

1.02

1.03

1.03

RT-CM-DIST

(Proposed)

0.86

0.84

0.81

0.79

0.91

0.88

0.84

0.80

the joint PDF distribution of computational cycle and memory

stall time. As the consequence, it sets frequency levels higher

than required levels, as shown in Fig. 9.

In Table III, compared with DT-CM-DIST, which exploits

runtime distributions of both computational and memory stall

workload in design time, RT-CM-DIST provides 2.110.2%

and 1.218.1% further energy savings for MPEG4 and H.264

122

TABLE V

Comparison of Energy Savings for DarkKnight at 75 C

MPEG4 dec.

H.264 dec.

RT-C-DIST

[19]

1.26

1.16

DT-CM-DIST

[28]

1.20

1.16

RT-CM-DIST

(Proposed)

0.89

0.89

TABLE VI

Summary of Runtime Overhead

Source of Runtime Overhead

Local-optimal workload prediction

Coordination

Feasibility check

Amount

40 40052 400 cycles

27204780 cycles

4073560 cycles

decoder, respectively. The largest energy savings can be obtained at SnowMnt for both MPEG4 and H.264 decoder, which

has distinctive program phase behavior. Since DT-CM-DIST

finds the optimal workload using the first 100 frames (designtime fixed training input), which is totally different from that

of the remaining frames (runtime-varying input), it cannot

provide proper voltage and frequency setting.

To further investigate the effectiveness of considering

complex program phase behavior, we performed another

experiment using 3000 frames of Movie clip. Program phase

behavior is more obviously observed at Movie clip whose

scene is fast moving. Table V shows normalized energy

consumption at 75 C when decoding the movie clip from

Dark Knight in MPEG4 and H.264 decoder, respectively.

RT-CM-DIST outperforms DT-CM-DIST by up to 26.3%

and 23.3% for MPEG4 and H.264 decoder, respectively. It is

because, in the movie clip, complex program phase behavior

exists due to frequent scene change as Fig. 1(d) shows.

C. Overhead

1) Runtime Overhead: We measured the runtime overhead

of the proposed online method, i.e., RT-CM-DIST, using

PAPI [31]. The proposed method consists of three parts:

local-optimal workload prediction, coordination, and feasibility check. Table VI shows the runtime overhead of the

proposed method. The local-optimal workload prediction of

a program region consumes 40 40052 400 clock cycles when

PHASE UNIT is set to 20 frames. Note that the local-optimal

workload prediction is performed at every PHASE UNIT.

The runtime overhead of coordination and feasibility check,

which is performed at every start of program region, takes

27204780 and 4073560 clock cycles, respectively. The total

runtime overhead in Table VI amounts to 0.38% and 0.25%

of the average execution cycles in the case of MPEG4 and

H.264 decoder, respectively.

2) Memory Overhead of LUTs: As explained in Section VI-B, the presented method requires three temperatureindependent LUTs, i.e., LUTscomp , LUTsstall , and LUTb , and

two temperature-dependent LUTs, i.e., LUTlcomp and LUTlstall .

The LUTs incur memory overhead. The memory overhead

largely depends on the number of steps (scales) in the indexes

of the LUTs. The more steps are used, the more accurate

workload prediction will be achieved with a higher memory

the ratio of standard deviation to mean (Index 1) ranging 0.05

0.30 with 0.05 step size, with skewness (Index 2) ranging

1.001.00 with 0.10 step size, and with the ratio of mean to

the effective remaining workload of the remaining program

regions (Index 3) ranging 0.101.00 with 0.10 step size.

Therefore, 1140 (= 19 6 10) entries are required for

each LUT where 8 bits are assigned to each entry. Thus,

about 1 kB memory space is required for each LUT. The area

overhead can be further reduced by trimming and compressing

entries. LUTs are built for four temperatures, i.e., 25, 50, 75,

and 100 C used in our experiment. The total area overhead

amounts to 11 kB [=(31 kB) + 4(21 kB)].

X. Conclusion

In this paper, we presented a novel online DVFS method

which exploits the distribution of both computational workload

and memory stall workload during program runs in combined

Vdd /Vbb scaling. To reduce the complexity of our previous

design-time solution [28], we presented a DVFS method

consisting of two steps: local-optimal workload prediction and

coordination. In the local-optimal workload prediction step, we

periodically calculated five local-optimal workload predictions

each of which minimized single energy component under the

joint PDF of computational cycle and memory stall time,

which is profiled during runtime. To further reduce the runtime

overhead, we prepared tables which are pre-characterized

in design time based on the analytical formulation. During

runtime, we utilized them to find local-optimal workloads. In

the coordination step, we found the global workload prediction

by coordinating the five local-optimal workload predictions.

Experimental results show that the proposed method offers up

to 34.6% and 17.3% energy savings for MPEG4 and H.264

decoder, respectively, compared with the existing method [4].

References

[1] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens,

Memory access scheduling, in Proc. ISCA, 2000, pp. 128138.

[2] K. Govil, E. Chan, and H. Wasserman, Comparing algorithms for

dynamic speed-setting of a low-power CPU, in Proc. MOBICOM, 1995,

pp. 1325.

[3] Y. Gu and S. Chakraborty, Control theory-based DVS for interactive

3-D games, in Proc. DAC, 2008, pp. 740745.

[4] K. Choi, R. Soma, and M. Pedram, Fine-grained dynamic voltage and

frequency scaling for precise energy and performance tradeoff based on

the ratio of off-chip access to on-chip computation times, IEEE Trans.

Comput.-Aided Design Integr. Circuits Syst., vol. 24, no. 1, pp. 1828,

Jan. 2005.

[5] W.-Y. Liang, S.-C. Chen, Y.-L. Chang, and J.-P. Fang, Memory-aware

dynamic voltage and frequency prediction for portable devices, in Proc.

RTCSA, 2008, pp. 229236.

[6] G. Dhiman and T. S. Rosing, System-level power management using

online learning, IEEE Trans. Comput.-Aided Design Integr. Circuits

Syst., vol. 28, no. 5, pp. 676689, May 2009.

[7] A. Azevedo, I. Issenin, R. Cornea, R. Gupta, N. Dutt, A. Veidenbaum,

and A. Nicolau, Profile-based dynamic voltage scheduling using program checkpoints, in Proc. DATE, 2002, pp. 168175.

[8] D. Shin and J. Kim, Optimizing intra-task voltage scheduling using

data flow analysis, in Proc. ASPDAC, 2005, pp. 703708.

[9] J. Seo, T. Kim, and J. Lee, Optimal intratask dynamic voltage-scaling

technique and its practical extensions, IEEE Trans. Comput.-Aided

Design Integr. Circuits Syst., vol. 25, no. 1, pp. 4757, Jan. 2006.

KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD

[10] S. Hong, S. Yoo, H. Jin, K.-M. Choi, J.-T. Kong, and S.-K. Eo, Runtime

distribution-aware dynamic voltage scaling, in Proc. ICCAD, 2006, pp.

587594.

[11] S. Hong, S. Yoo, B. Bin, K.-M. Choi, S.-K. Eo, and T. Kim, Dynamic

voltage scaling of supply and body bias exploiting software runtime

distribution, in Proc. DATE, 2008, pp. 242247.

[12] J. R. Lorch and A. J. Smith, Improving dynamic voltage scaling

algorithm with PACE, ACM SIGMETRICS Perform. Eval. Rev., vol.

29, no. 1, pp. 5061, Jun. 2001.

[13] C. Xian and Y.-H. Lu, Dynamic voltage scaling for multitasking realtime systems with uncertain execution time, in Proc. GLSVLSI, 2006,

pp. 392397.

[14] T. Sherwood, E. Perelman, G. Hamerly, S. Sair, and B. Calder, Discovering and exploiting program phases, IEEE Micro, vol. 23, no. 6,

pp. 8493, Nov. 2003.

[15] T. Sherwood, S. Sair, and B. Calder, Phase tracking and prediction,

in Proc. ISCA, 2003, pp. 336347.

[16] Q. Wu, M. Martonosi, D. W. Clark, V. J. Reddi, D. Connors, Y. Wu, J.

Lee, and D. Brooks, A dynamic compilation framework for controlling

microprocessor energy and performance, in Proc. IEEE MICRO, 2005,

pp. 271282.

[17] C. Isci, G. Contreras, and M. Martonosi, Live, runtime phase monitoring and prediction on real systems with application to dynamic power

management, in Proc. MICRO, 2006, pp. 359370.

[18] S.-Y. Bang, K. Bang, S. Yoon, and E.-Y. Chung, Run-time adaptive

workload estimation for dynamic voltage scaling, IEEE Trans. Comput.Aided Design Integr. Circuits Syst., vol. 28, no. 9, pp. 13341347, Sep.

2009.

[19] J. Kim, S. Yoo, and C.-M. Kyung, Program phase and runtime

distribution-aware online DVFS for combined Vdd /Vbb scaling, in Proc.

DATE, 2009, pp. 417422.

[20] T. Mudge, K. Flautner, D. Vlaauw, and S. M. Martin, Combined

dynamic voltage scaling and adaptive body biasing for lower power

microprocessors under dynamic workloads, in Proc. ICCAD, 2002, pp.

721725.

[21] W. Liao, L. He, and K. M. Lepak, Temperature and supply voltage

aware performance and power modeling at microarchitecture level,

IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 24, no. 7,

pp. 10421053, Jul. 2005.

[22] Cacti5.3 [Online]. Available: http://www.hpl.hp.com/research/cacti

[23] BPTM High-k/Metal Gate 32 nm High Performance Model [Online].

Available: http://www.eas.asu.edu/ptm

[24] K. Puttaswamy and G. H. Loh, Thermal herding: Microarchitecture

techniques for controlling hotspots in high-performance 3-D-integrated

processors, in Proc. HPCA, 2007, pp. 193204.

[25] N. Kavvadias, P. Neofotistos, S. Nikolaidis, C. A. Kosmatopoulos, and

T. Laopoulos, Measurement analysis of the software-related power

consumption in microprocessors, IEEE Trans. Instrum. Meas., vol. 53,

no. 4, pp. 11061112, Aug. 2004.

[26] S. Oh, J. Kim, S. Kim, and C.-M. Kyung, Task partitioning algorithm

for intra-task dynamic voltage scaling, in Proc. ISCAS, 2008, pp. 1228

1231.

[27] J. Kim, S. Oh, S. Yoo, and C.-M. Kyung, An analytical dynamic scaling

of supply voltage and body bias based on parallelism-aware workload

and runtime distribution, IEEE Trans. Comput.-Aided Design Integr.

Circuits Syst., vol. 28, no. 4, pp. 568581, Apr. 2009.

[28] J. Kim, Y. Lee, S. Yoo, and C.-M. Kyung, An analytical dynamic

scaling of supply voltage and body bias exploiting memory stall time

variation, in Proc. ASPDAC, 2010, pp. 575580.

[29] FFMPEG [Online]. Available: http://www.ffmpeg.org

[30] VQEG [Online]. Available: ftp://vqeg.its.bldrdoc.gov

[31] PAPI [Online]. Available: http://icl.cs.utk.edu/papi

electrical engineering from the Korea Advanced

Institute of Science and Technology (KAIST), Daejeon, South Korea, in 2005, and graduated the unified course of the M.S. and Ph.D. degrees from the

Department of Electrical Engineering and Computer

Science, KAIST, in 2010.

Since 2010, he has been in a post-doctoral position with KAIST. His current research interests

include dynamic power and thermal management,

multiprocessor system-on-a-chip design, and lowpower wireless surveillance system design.

123

Ph.D. degrees in electronics engineering from Seoul

National University, Seoul, South Korea, in 1992,

1995, and 2000, respectively.

He was a Researcher with the TIMA Laboratory,

Grenoble, France, from 2000 to 2004, and was a

Senior and Principal Engineer with Samsung Electronics, Seoul, from 2004 to 2008. Since 2008, he

has been with the Pohang University of Science and

Technology, Pohang, South Korea. His current research interests include dynamic power and thermal

management, on-chip network, multithreaded software and architecture, and

fault tolerance of solid-state disk.

from Seoul National University, Seoul, South Korea,

in 1975, and the M.S. and Ph.D. degrees in electrical

engineering from the Korea Advanced Institute of

Science and Technology (KAIST), Daejeon, South

Korea, in 1977 and 1981, respectively.

From April 1981 to January 1983, he was with Bell

Telephone Laboratories, Murray Hill, NJ, in a postdoctoral position. Since he joined KAIST in 1983,

he has been working on system-on-a-chip design

and verification methodology, processor, and graphics architectures for highspeed and/or low-power applications, including mobile video codec. He was

a Visiting Professor with the University of Karsruhe, Karsruhe, Germany,

in 1989, as an Alexander von Humboldt Fellow, a Visiting Professor with

the University of Tokyo, Tokyo, Japan, from January 1985 to February

1985, a Visiting Professor with the Technical University of Munich, Munich,

Germany, from July 1994 to August 1994, with Waseda University, Tokyo,

from 2002 to 2005, with the University of Auckland, Auckland, New Zealand,

from February 2004 to February 2005, and with Chuo University, Tokyo, from

July 2005 to August 2005.

Dr. Kyung is the Director of the Integrated Circuit Design Education Center,

Daejeon, established in 1995 to promote the integrated circuit (IC) design

education in Korean universities through computer-aided design environment

setup, and chip fabrication services. He is the Director of the SoC Initiative for Ubiquity and Mobility Research Center established to promote

academia/industry collaboration in the SoC design-related area. From 1993 to

1994, he served as an Asian Representative in the International Conference

on Computer-Aided Design Executive Committee. He received the Most

Excellent Design Award, and the Special Feature Award from the University

Design Contest in the ASP-DAC 1997 and 1998, respectively. He received the

Best Paper Awards at the 36th DAC, New Orleans, LA, the 10th International

Conference on Signal Processing Application and Technology, Orlando, FL, in

September 1999, and the 1999 International Conference on Computer Design,

Austin, TX. He was the General Chair of the Asian Solid-State Circuits

Conference 2007, and ASP-DAC 2008. In 2000, he received the National

Medal from the Korean Government for his contribution to research and

education in the IC design. He is a member of the National Academy of

Engineering Korea and the Korean Academy of Science and Technology. He

is a Hynix Chair Professor with KAIST.

- COA_02_Computer Evolution and PerformanceUploaded byAline Chan
- MP&MC Unit iUploaded byKalai Selvan
- ARK _ Intel® Core™ i5-480M Processor (3M Cache, 2Uploaded byIancu-Geru Cristina Alina
- Unit 1CO AutonomousUploaded bysatyavathi
- 08diagUploaded byapi-3806887
- NWU-EECS-06-16Uploaded byeecs.northwestern.edu
- A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTTLENECKUploaded byAnonymous Gl4IRRjzN
- SOG_16h_52128_PUB_Rev1_1Uploaded byAlexander García Valverde
- Merging Write BuffersUploaded byvijayarajuppt
- Sidechannel_isca07Uploaded byVenkat Giri
- awrUploaded bybsrksg123
- Cortex R4 White PaperUploaded byRAJARAM
- 03-MissPenaltyReductionUploaded byHarish Swami
- How Microprocessors WorkUploaded by1094
- L19-MemoryHierarchyUploaded byah chong
- 2. CSE_321_2Uploaded bylizu
- Misaglignment Data and Instruction PrrfetchUploaded bytuanngoc10
- scimakelatex.5521.Automatic+.CS+.Paper.GeneratorUploaded bymdp anon
- ArchUploaded byKutsal Kara
- Cache MemoryUploaded byShantanu Sharma
- doc1Uploaded byHimanshu Baria
- HW4Uploaded bytinhtrilac
- Naikur Gohil Project ReportUploaded byShefali Garg
- TUT2 (1)Uploaded byraza
- The Computer ChroniclesUploaded byJoseph Okafor
- Lec6 Memory Cache0910Uploaded byShum Chong
- comp arch syll.docxUploaded byAnupam Majumdar
- En Wikipedia Org 4Uploaded byAmruthVarma
- Luke DirectoriesUploaded byAmey Kulkarni
- Hy27uf084g2m Series(Rev.0.7)Uploaded byPt Duong

- final862.pdfUploaded byArmin Ahmadzadeh
- lec1 dsa fsadf sadf asdf asdf asfd asdfUploaded byArmin Ahmadzadeh
- Eui Seong 2008Uploaded byArmin Ahmadzadeh
- IJEDR1402091 sd dsf sdf sdf sdfsdfsUploaded byArmin Ahmadzadeh
- gem5_hipeac_2f sfsd sdfUploaded byArmin Ahmadzadeh
- gem5_hipeac_2f sfsd sdfUploaded byArmin Ahmadzadeh
- jdcta7 ert ert ert ertw ert ewrUploaded byArmin Ahmadzadeh
- IJETT-V10P288 ert ert e ewrterwt wertweUploaded byArmin Ahmadzadeh
- IJETTCS-2014-04-23-114 ewrt ewrt wertUploaded byArmin Ahmadzadeh
- 10_chapter5Uploaded byArmin Ahmadzadeh
- DelayFault 6 Per PageUploaded byHuzur Ahmed
- IC-162g dg dfgd fg sdfg sdfg sdg sd sdgdsfgsdg dsfgsd fUploaded byArmin Ahmadzadeh
- 10.1.1.121.5295Uploaded byArmin Ahmadzadeh
- g09srcsd fsdf sdfUploaded byArmin Ahmadzadeh

- Floorplan design of VLSIUploaded byAtul Prakash Dwivedi
- Using PCC to Load Balance Across Multiple Wan in MikrotikUploaded byArnel Asuncion
- Evaluation Lipid Profile in Second SemesterUploaded bygracia
- RISK RPUploaded byayush
- Node.js: A Guided TourUploaded bycacois
- Beck to Basics InfoUploaded byhenri dupont
- Cold Soak Session PowerPointUploaded byapi-19733837
- DISTRUCT: a program for the graphical display of population structureUploaded bydrselvam77
- Collatz Conjecture Solution - Hamburger BeitrageUploaded byvigneshvr
- 2009 BaldridgeUploaded byMigUel Angel Rocha
- sistempneumatik-121118015635-phpapp01Uploaded byMohammad Asri
- ADT AVL Tree V3Uploaded byRishiGupta
- A Defense of Alvin Plantinga's Evolutionary Argument Against NaturalismUploaded byarkelmak
- Ccms RelnotesUploaded byapi-3728519
- termo%2c inf2%2c G4Uploaded byKeymhi Estívariz Rivero
- Holland Some Diagnostic Scales for Research in Decision Making and Personality Identity, Information, And BarriersUploaded byMariana Ciceu
- Sample Paper NAT IIPUploaded byasstt_dir_pnd
- Ship Bouyancy and StabilityUploaded bykoldobika7
- CD4511Uploaded bymario2000
- 10048887_siriusUploaded bySatish Dabral
- AvailUploaded byAju aju
- Offset Printing Defects PDFUploaded byCody
- Tecnotion TM Series SpecsheetUploaded byElectromate
- The Go Programming LangaugeUploaded bySaurabh Gilalkar
- FAQ ABAP-105 QuestionsUploaded bySreenivsa Reddy
- UGCommissioningGuide iDX 33RevBUploaded byJackson Dias Rocha
- Elastic Analysis & Application Tables of Rectangular Plates [Artigo-papanikolaou]Uploaded byMelekeen
- GalileoUploaded byMildred Carreon Dela Paz
- BasicOpenFOAMtutorialsGuide.pdfUploaded byEustache Gokpi
- Frank Pti KatalogUploaded bysava88