Professional Documents
Culture Documents
2006-504
Scheduling Algorithms for Grid Computing:
State of the Art and Open Problems
Fangpeng Dong and Selim G. Akl
School of Computing,
Queens University
Kingston, Ontario
January 2006
Abstract:
Thanks to advances in wide-area network technologies and the low cost of computing
resources, Grid computing came into being and is currently an active research area. One
motivation of Grid computing is to aggregate the power of widely distributed resources,
and provide non-trivial services to users. To achieve this goal, an efficient Grid scheduling
system is an essential part of the Grid. Rather than covering the whole Grid scheduling
area, this survey provides a review of the subject mainly from the perspective of
scheduling algorithms. In this review, the challenges for Grid scheduling are identified.
First, the architecture of components involved in scheduling is briefly introduced to
provide an intuitive image of the Grid scheduling process. Then various Grid scheduling
algorithms are discussed from different points of view, such as static vs. dynamic policies,
objective functions, applications models, adaptation, QoS constraints, strategies dealing
with dynamic behavior of resources, and so on. Based on a comprehensive understanding
of the challenges and the state of the art of current research, some general issues worthy of
further exploration are proposed.
1. Introduction
The popularity of the Internet and the availability of powerful computers and
high-speed networks as low-cost commodity components are changing the way we use
computers today. These technical opportunities have led to the possibility of using
geographically distributed and multi-owner resources to solve large-scale problems in
science, engineering, and commerce. Recent research on these topics has led to the
emergence of a new paradigm known as Grid computing [19].
To achieve the promising potentials of tremendous distributed resources, effective and
efficient scheduling algorithms are fundamentally important. Unfortunately, scheduling
algorithms in traditional parallel and distributed systems, which usually run on
homogeneous and dedicated resources, e.g. computer clusters, cannot work well in the new
circumstances [12]. In this paper, the state of current research on scheduling algorithms for
1
the new generation of computational environments will be surveyed and open problems
will be discussed.
The remainder of this paper is organized as follows. An overview of the Grid
scheduling problem is presented in Section 2 with a generalized scheduling architecture. In
Section 3, the progress made to date in the design and analysis of scheduling algorithms
for Grid computing is reviewed. A summary and some research opportunities are offered in
Section 4.
2. Overview of the Grid Scheduling Problem
A computational Grid is a hardware and software infrastructure that provides
dependable, consistent, pervasive, and inexpensive access to high-end computational
capabilities [45]. It is a shared environment implemented via the deployment of a
persistent, standards-based service infrastructure that supports the creation of, and resource
sharing within, distributed communities. Resources can be computers, storage space,
instruments, software applications, and data, all connected through the Internet and a
middleware software layer that provides basic services for security, monitoring, resource
management, and so forth. Resources owned by various administrative organizations are
shared under locally defined policies that specify what is shared, who is allowed to access
what, and under what conditions [48]. The real and specific problem that underlies the Grid
concept is coordinated resource sharing and problem solving in dynamic,
multi-institutional virtual organizations [44].
From the point of view of scheduling systems, a higher level abstraction for the Grid
can be applied by ignoring some infrastructure components such as authentication,
authorization, resource discovery and access control. Thus, in this paper, the following
definition for the term Grid adopted: A type of parallel and distributed system that
enables the sharing, selection, and aggregation of geographically distributed autonomous
and heterogeneous resources dynamically at runtime depending on their availability,
capability, performance, cost, and users' quality-of-service requirements [10].
To facilitate the discussion, the following frequently used terms are defined:
A task is an atomic unit to be scheduled by the scheduler and assigned to a
resource.
The properties of a task are parameters like CPU/memory requirement, deadline,
priority, etc.
A job (or metatask, or application) is a set of atomic tasks that will be carried out
on a set of resources. Jobs can have a recursive structure, meaning that jobs are
composed of sub-jobs and/or tasks, and sub-jobs can themselves be decomposed
further into atomic tasks. In this paper, the term job, application and metatask are
interchangeable.
A resource is something that is required to carry out an operation, for example: a
processor for data processing, a data storage device, or a network link for data
transporting.
A site (or node) is an autonomous entity composed of one or multiple resources.
A task scheduling is the mapping of tasks to a selected group of resources which
may be distributed in multiple administrative domains.
2
2.1 The Grid Scheduling Process and Components
A Grid is a system of high diversity, which is rendered by various applications,
middleware components, and resources. But from the point of view of functionality, we
can still find a logical architecture of the task scheduling subsystem in Grid. For example,
Zhu [123] proposes a common Grid scheduling architecture. We can also generalizescheduling
process in the Grid into three stages: resource discovering and filtering,
resource selecting and scheduling according to certain objectives, and job submission [94].
As a study of scheduling algorithms is our primary concern here, we focus on the second
step. Based on these observations, Fig. 1 depicts a model of Grid scheduling systemswhich
functional components are connected by two types of data flow: resourceapplication information
flow and task or task scheduling command flow.
a
in
or
Fig. 1: A logical Grid scheduling architecture: broken lines show resource or application
information
flows and real lines show task or task scheduling command flows.
Basically, a Grid scheduler (GS) receives applications from Grid users, selects feasible
resources for these applications according to acquired information from the Grid
Information Service module, and finally generates application-to-resource mappings, based
on certain objective functions and predicted resource performance. Unlike their
counterparts in traditional parallel and distributed systems, Grid schedulers usually cannot
control Grid resources directly, but work like brokers or agents[13], or even tightly
coupled with the applications as the application-level scheduling scheme proposes [11],
[105]. They are not necessarily located in the same domain with the resources which are
visible to them. Fig. 1 only shows one Grid scheduler, but in reality multiple such
schedulers might be deployed, and organized to form different structures (centralized,
hierarchical and decentralized [55]) according to different concerns, such as performance
or scalability. Although a Grid level scheduler (or Metascheduler as it is sometime referred
to in the literature, e.g., in [77]) is not an indispensable component in the Grid
3
infrastructure (e.g., it is not included in the Globus Toolkit [125], the defacto standardthe Grid
computing community), there is no doubt that such a scheduling componentcrucial for harnessing
the potential of Grids as they are expanding quickly, incorporating
resources from supercomputers to desktops. Our discussion on scheduling algorithmsbased on the
assumption that there are such schedulers in a Grid.
Information about the status of available resources is very important for a Grid
scheduler to make a proper schedule, especially when the heterogeneous and dynamic
nature of the Grid is taken into account. The role of the Grid information service (GIS)to provide
such information to Grid schedulers. GIS is responsible for collecting and
predicting the resource state information, such as CPU capacities, memory size, network
bandwidth, software availabilities and load of a site in a particular period. GIS can answer
queries for resource information or push information to subscribers. The Globus
Monitoring and Discovery System (MDS) [33] is an example of GIS.
Besides raw resource information from GIS, application properties (e.g., approximate
instruction quantity, memory and storage requirements, subtask dependency in a job and
communication volumes) and performance of a resource for different application species
are also necessary for making a feasible schedule. Application profiling (AP) is usedextract
properties of applications, while analogical benchmarking (AB) provides a measure
of how well a resource can perform a given type of job [67] [96]. On the basisknowledge from AP
and AB, and following a certain performance model [14], cost
estimation computes the cost of candidate schedules, from which the scheduler chooses
those that can optimize the objective functions.
The Launching and Monitoring (LM) module (also known as the binder [31])
implements a finally-determined schedule by submitting applications to selected resources,
staging input data and executables if necessary, and monitoring the execution of the
applications. A good example of an LM is the Globus GRAM (Grid Resource Allocation
and Management) [32].
A Local Resource Manager (LRM) is mainly responsible for two jobs: local scheduling
inside a resource domain, where not only jobs from exterior Grid users, but also jobs from
the domains local users are executed, and reporting resource information to GIS. Withindomain,
one or multiple local schedulers run with locally specified resource management
policies. Examples of such local schedulers include OpenPBS [126] and Condor [127].
An LRM also collects local resource information by tools such as Network Weather
Service [111], Hawkeye [127] and Ganglia [89], and report the resource status information
to GIS.
in
is
is
is
to
of
a
2.2 Challenges of Scheduling Algorithms in Grid Computing
Scheduling algorithms have been intensively studied as a basic problem in traditional
parallel and distributed systems, such as symmetric multiple processor machines (SMP),
Grid Scheduling
Typical Scheduling Algorithms
Scheduling
Scheduling
Algorithms
Algorithms
Algorithms
Although we can look for inspirations in previous research, traditional scheduling
models generally produce poor Grid schedules in practice. The reason can be found by
going through the assumptions underlying traditional systems [14]:
All resources reside within a single administrative domain.
To provide a single system image, the scheduler controls all of the resources.
The resource pool is invariant.
Contention caused by incoming applications can be managed by the scheduler
according to some policies, so that its impact on the performance that the site can
provide to each application can be well predicted.
Computations and their data reside in the same site or data staging is a highly
predictable process, usually from a predetermined source to a predetermined
destination, which can be viewed as a constant overhead.
Unfortunately, all these assumptions do not hold in Grid circumstances. In Grid
computing, many unique characteristics make the design of scheduling algorithms more
challenging [123], as explained in what follows.
Heterogeneity and Autonomy
Although heterogeneity is not new to scheduling algorithms even before the emergence
of Grid computing, it is still far from fully addressed and a big challenge for scheduling
algorithm design and analysis. In Grid computing, because resources are distributedmultiple
domains in the Internet, not only the computational and storage nodes but also the
underlying networks connecting them are heterogeneous. The heterogeneity resultsdifferent
capabilities for job processing and data access.
In traditional parallel and distributed systems, the computational resources are usually
managed by a single control point. The scheduler not only has full information about all
running/pending tasks and resource utilization, but also manages the task queue and
resource pool. Thus it can easily predict the behaviours of resources, and is able to assign
tasks to resources according to certain performance requirements. In a Grid, however,
in
in
5
resources are usually autonomous and the Grid scheduler does not have full control of the
resources. It cannot violate local policies of resources, which makes it hard for the Grid
scheduler to estimate the exact cost of executing a task on different sites. The autonomy
also results in the diversity in local resource management and access control policies, such
as, for example, the priority settings for different applications and the resource reservation
methods. Thus, a Grid scheduler is required to be adaptive to different local policies. The
heterogeneity and autonomy on the Grid user side are represented by various parameters,
including application types, resource requirements, performance models, and optimization
objectives. In this situation, new concepts such as application-level scheduling and Grid
economy [20] are proposed and applied for Grid scheduling.
Performance Dynamism*
Making a feasible scheduling usually depends on the estimate of the performance that
candidate resources can provide, especially when the algorithms are static. Grid schedulers
work in a dynamic environment where the performance of available resources is constantly
changing. The change comes from site autonomy and the competition by applications for
resources. Because of resource autonomy, usually Grid resources are not dedicated to a
Grid application. For example, a Grid job submitted remotely to a computer cluster might
be interrupted by a clusters internal job which has a higher priority; new resources may
join which can provide better services; or some other resources may become unavailable.
The same problem happens to networks connecting Grid resources: the available
bandwidth can be heavily affected by Internet traffic flows which are non-relevant to Grid
jobs. For a Grid application, this kind of contention results in performance fluctuation,
which makes it a hard job to evaluate the Grid scheduling performance under classic
performance models. From the point view of job scheduling, performance fluctuation
might be the most important characteristic of Grid computing compared with traditional
systems. A feasible scheduling algorithm should be able to be adaptive to such dynamic
behaviors. Some other measures are also provided to mitigate the impact of this problem,
such as QoS negotiation, resource reservation (provided by the underlying resource
management system) and rescheduling. We discuss algorithms related to these mechanisms
in Section 3.
Local vs. Global
At the highest level, a distinction is drawn between local and global scheduling. The
local scheduling discipline determines how the processes resident on a single CPU are
allocated and executed; a global scheduling policy uses information about the system to
allocate processes to multiple processors to optimize a system-wide performance objective.
Obviously, Grid scheduling falls into the global scheduling branch.
decisions are made to assign a task to a selected resource. Since the cost for an assignment
is not available, a natural way to keep the whole system health is balancing the loads of all
resources. The advantage of dynamic load balancing over static scheduling is that the
system need not be aware of the run-time behavior of the application before execution. It is
particularly useful in a system where the primary performance goal is maximizing resource
utilization, rather than minimizing runtime for individual jobs [64]. If a resource is
assigned too many tasks, it may invoke a balancing policy to decide whether to transfer
some tasks to other resources, and which tasks to transfer. According to who will initiate
the balancing process, there are two different approaches: sender-initiated where a node
that receives a new task but doesnt want to run the task initiates the task transfer, and
receiver-initiated where a node that is willing to receive a new task initiates the process
[95]. According to how the dynamic load balancing is achieved, there are four basic
approaches [42]:
Unconstrained First-In-First-Out (FIFO, also known as First-Come-First- Served)
Balance-constrained techniques
Cost-constrained techniques
Hybrids of static and dynamic techniques
Unconstrained FIFO: In the unconstrained FIFO approach, the resource with the
currently shortest waiting queue or the smallest waiting queue time is selected for the
incoming task. This policy is also called opportunistic load balancing (OLB) [76] or
myopic algorithm. The major advantage of this approach is its simplicity, but it is often far
from optimal.
Balance-constrained: The balance-constrained approach attempts to rebalance the loads
on all resources by periodically shifting waiting tasks from one waiting queue to another.
In a massive system such as the Grid, this could be very costly due to the considerable
communication delay. So some adaptive local rebalancing heuristic can be applied. For
example, tasks are initially distributed to all resources, and then, instead of computing the
global rebalance, the rebalancing only happens inside a neighborhood where all
resources are better interconnected. This approach has several advantages: the initial loads
can be quickly distributed to all resources and started quickly; the rebalancing process is
distributed and scalable; and the communication delay of rebalancing can be reduced since
task shifting only happens among the resources that are close to each other. A similar
algorithm is proposed and evaluated in [28].
Cost-constrained: An improved approach to balance-constrained scheduling is
cost-constrained scheduling. This approach not only considers the balance among
resources but also the communication cost between tasks. Instead of doing a task exchange
periodically, tasks will be checked before their move. If the communication cost brought
by a task shift is greater than the decrease in execution time, the task will not move;
otherwise, it will be moved. This approach is more efficient than the previous one when
9
the communication costs among resources are heterogeneous and the communication cost
to execute the application is the main consideration. It is also flexible, and can be used
with other cost factors such as seeking lowest memory size or lowest disc drive activity,
and so on. In [68], the authors propose a similar scheduling policy, but the objective of the
rescheduling phase in their case is not for load balancing and cost optimizing, but rather to
make running jobs release required resources for pending jobs.
Hybrid: A further improvement is the static-dynamic hybrid scheduling. The main idea
behind hybrid techniques is to take the advantages of static schedule and at the same time
capture uncertain behaviors of applications and resources. For the scenario of an
application with uncertain behavior, static scheduling is applied to those parts that always
execute. At run time, scheduling is done using statically computed estimates that reduce
run-time overhead. That is, static scheduling is done on the always-executed-tasks, and
dynamic scheduling on others. For example, in those cases where there are special QoS
requirements in some tasks, the static phase can be used to map those tasks with QoS
requirements, and dynamic scheduling can be used for the remaining tasks. For the
scenario of low predictable resource behaviors, static scheduling is used to initiate task
assignment at the beginning and dynamic balancing is activated when the performance
estimate on which the static scheduling is based fails. Spring et al show an example of this
scenario in [100].
Some other dynamic online scheduling algorithms, such as those in [2] and [77],
consider the case of resource reservation which is popular in Grid computing as a way to
get a degree of certainty in resource performance. Algorithms in these two examples aim to
minimize the makespan of incoming jobs which consist of a set of tasks. Mateescu [77]
uses a resource selector to find a co-reservation for jobs requiring multiple resources.
The job queue is managed by a FIFO fashion with dynamic priority correction. If
co-reservation fails for a job in a scheduling cycle, the jobs priority will be promoted in
the queue for the next scheduling round. The resource selector ranks candidate resources
by their number of processors and memory size. Aggarwals method [2] is introduced in
subsection 3.3.
particular application, while resource providers usually pay more attention to the
performance of their resources, for example the resource utilization in a particular period.
Thus objective functions can be classified into two categories: application-centric and
resource-centric [123]. Fig. 3 shows the objective functions we will meet in our following
discussion.
Application-Centric
Scheduling algorithms adopting an application-centric scheduling objective function
aim to optimize the performance of each individual application, as application-level
schedulers do. Most of current Grid applications concerns are about time, for example the
12
makespan, which is the time spent from the beginning of the first task in a job to the end of
the last task of the job. Makespan is the one of the most popular measurements of
scheduling algorithms and many examples given in the following discussion adopt it. As
economic models [18] [19] [43] [124] are introduced into Grid computing, the economic
cost that an application needs to pay for resources utilization becomes a concern of some
of Grid users. This objective function is widely adopted by Grid economic models which
are mainly discussed in Subsection 3.6. Besides these simple functions, many applications
use compound objective functions, for example, some want both shorter execution time
and lower economic costs. The primary difficulty facing the adoption of this kind of
objective functions lies in the normalization of two different measurements: time and
money. Such situations make scheduling in the Grid much more complicated. It is required
that Grid schedulers be adaptive enough to deal with such compound missions. At the
same time, the development of the Grid infrastructure has shown a service-oriented
tendency [49], so the quality of services (QoS) becomes a big concern of many Grid
applications in such a non-dedicated dynamic environment. The meaning of QoS is highly
dependent on particular applications, from hardware capacity to software existence.
Usually, QoS is a constraint imposed on the scheduling process instead of the final
objective function. The involvement of QoS usually has effect on the resource selection
step in the scheduling process, and then influences the final objective optimization. Such
scenarios will be discussed in 3.7.
Resource-Centric
Scheduling algorithms adopting resource-centric scheduling objective functions aim to
optimize the performance of the resources. Resource-centric objectives are usually related
to resource utilization, for example, throughput which is the ability of a resource to process
a certain number of jobs in a given period; utilization, which is the percentage of time a
resource is busy. Low utilization means a resource is idle and wasted. For a multiprocessor
resource, utilization differences among processors also describe the load balance of the
system and decrease the throughput. Condor is a well known system adopting throughput
as the scheduling objective [112]. As economic models are introduced into Grid computing,
economic profit (which is the economic benefits resource providers can get by attracting
Grid users to submit applications to their resources) also comes under the purview of
resource management polices.
In the Grid computing environments, due to autonomy both in Grid users and resource
providers, application-centric objectives and resource centric objectives often are at odds.
Legion [26] provides a methodology allowing each group to express their desires, and acts
as a mediator to find a resource allocation that is acceptable to both parties through a
flexible, modular approach to scheduling support.
The objective functions mentioned above are widely adopted before the emergence of
Grid computing and many efforts have been made to approach an approximation [27] [30]
[59] [70] or to get a good result heuristically. But these works make a common
assumption, namely, that the resources are dedicated so that they can provide constant
performance to an application. But as we have emphasized in Section 2, this assumption
does not hold in Grid computing. This violation weakens the previous results. For example,
assume an optimal schedule with makespan OPT can be found, if the resources involved
are stable. If the Grid resources are suddenly slowed down at OPT due to some reason
13
(interrupted by resources local jobs, network contention, or whatever) and the slow speed
situation continues for a long period, then the makespan of the actual schedule is far from
OPT and cannot be bounded by scheduling algorithms that cannot predict the performance
changing. So, if the objective function of a schedule is makespan, and there is no bound
either for resource performance or for the time period of the change, in other words, if we
cannot predict the performance fluctuations, there will be no makespan approximation
algorithm in general that applies to a Grid [51].
To escape from this predicament, a novel scheduling criterion for Grid computing is
proposed in [51]: total processor cycle consumption (TPCC), which is the total number of
instructions the Grid could compute from the starting time of executing the schedule to the
completion time. TPCC represents the total computing power consumed by an application.
In this new criterion, the length of a task is the number of instructions in the task; the speed
of a processor is the number of instructions computed per unit time; and processors in a
Grid are heterogeneous so they have various speeds. In addition, the speed of each
processor varies over time due to the contention expected in an open system. Let sp,t be the
speed of processor p during time interval [t, t+1), where t is a non-negative integer.
Without loss of generality, it can be assumed that the speed of each processor does not
vary during time interval [t, t+1), for every t by adopting an interval as short as the unit
time. It is also assumed that the value of any sp,t cannot be known in advance. Fig. 4(a)
shows an example of a set of tasks. Fig. 4(b) shows an example of processor speed
distribution.
Fig. 4: A new Grid scheduling criterion: TPCC [50].
Let T be a set of n independent tasks with the same length L. Let S be a schedule of T
in a Grid with m processors. Let M be the makespan of S. The speed of processor p during
the time interval [t, t+1) is sp,t. Then, the TPCC of S is defined as:
p=1 m
M t =0
1
s p,t
+ p=1
m
(M
M
)s
p,M
.
So the TPCC of the schedule in Fig. 4(c) is 21 + 45 + 38 + 7 2/5 + 6 2/5 + 5 2/5 =
111.2.
14
The advantage of TPCC is that it can be little affected by the variance of resource
performance, yet still related to the makespan. Since the total number of instructions
needed to run a job is constant, approximation algorithms based on this criterion can be
developed. In [50], a (1 + m (log e (m 1 ) + 1 ) approximation algorithm is given for
n
Lco (n) m (loge (m 1 ) + 1coarse-grained independent tasks in the Grid. A ( 1 + ) approximation
n
approaches which are sensitive to performance prediction errors, for Grid environments in
which resource availabilities change over time.
B
C
A
F
D
E
Fig. 6: The parallel workflow of a divide-and-conquer application.
Application Adaptation: To achieve high performance, application-level schedulers in the
Grid (e.g. AppLeS [13]) are usually tightly integrated with the application itself and are not
easily applied to other applications. As a result, each scheduler is application-specific.
Noticing this limitation, Dail et al [34] explicitly decouple the scheduler core (the
searching procedure introduced in the beginning of this subsection) from
application-specific (e.g. performance models) and platform-specific (e.g. resource
information collection) components used by the core. The key feature to implement the
decoupling (while still keeping awareness of application characteristics) is that application
characteristics are recorded and/or discovered by components such as a specialized
16
compiler and Grid-enabled libraries. These application characteristics are communicated
via well-defined interfaces to schedulers so that schedulers can be general-purpose, while
still providing services that are appropriate to the application at hand. Aggarwal et al [2]
consider another case that applications in Grid computing often meet, namely, resource
reservation, and develop a generalized Grid scheduling algorithm that can efficiently
assign jobs having arbitrary inter-dependency constraints and arbitrary processing
durations to resources having prior reservations. Their algorithm also takes into account
arbitrary delays in transfer of data from the parent tasks to the child tasks. In fact, this is a
heuristic list algorithm which we will discus in the next subsection. In [113], Wu et al give
a very good example of how a self-adaptive scheduling algorithm cooperates with
long-term resource performance prediction [54] [105]. Their algorithm is adaptive to
indivisible single sequential jobs, jobs that can be partitioned into independent parallel
tasks, and jobs that have a set of indivisible tasks. When prediction error of the system
utilization is reaching a threshold, the scheduler will try to reallocate tasks.
3.4 Task Dependency of an Application
When the relations among tasks in a Grid application are considered, a common
dichotomy used is dependency vs. independency. Usually, dependency means there are
precedence orders existing in tasks, that is, a task cannot start until all its parent are done.
Dependency has crucial impact to the design of scheduling algorithms, so in this
subsection, algorithms are discussed by following the same dichotomy as shown in Fig. 7.
Fig. 7: Task dependency taxonomy of Grid scheduling algorithms.
3.4.1 Independent Task Scheduling
As a set of independent tasks arrive, from the systems point view, a common strategy
is to assign them according to the load of resources in order to achieve high system
throughput. This approach was discussed under the dynamic branch in Subsection 3.1.
From the point of view of applications, some static heuristic algorithms based on execution
cost estimate can be applied [17].
can cause a severe load imbalance among machines. Even worse, this heuristic is not
applicable to heterogeneous computing environments where resources and tasks are
characterized as consistent, which means a machine that can run a task faster will run all
the other tasks faster.
MCT (Minimum Completion Time): MCT assigns each task, in an arbitrary order, to
the resource with the minimum expected completion time for that task. This causes some
tasks to be assigned to machines that do not have the minimum execution time for them.
The intuition behind MCT is to combine the benefits of opportunistic load balancing (OLB)
and MET, while avoiding the circumstances in which OLB and MET perform poorly.
Min-min: The Min-min heuristic begins with the set U of all unmapped tasks. Then,
the set of minimum completion time M for each task in U is found. Next, the task with the
overall minimum completion time from M is selected and assigned to the corresponding
machine (hence the name Min-min). Last, the newly mapped task is removed from U, and
the process repeats until all tasks are mapped (i.e., U is empty). Min-min is based on the
minimum completion time, as is MCT. However, Min-min considers all unmapped tasks
during each mapping decision and MCT only considers one task at a time. Min-min maps
the tasks in the order that changes the machine availability status by the least amount that
any assignment could. Let t i be the first task mapped by Min-min onto an empty system.
The machine that finishes ti the earliest, say mj, is also the machine that executes ti the
fastest. For every task that Min-min maps after ti, the Min-min heuristic changes the
availability status of mj by the least possible amount for every assignment. Therefore, the
percentage of tasks assigned to their first choice (on the basis of execution time) is likely to
be higher for Min-min than for Max-min (see below). The expectation is that a smaller
makespan can be obtained if more tasks are assigned to the machines that complete them
the earliest and also execute them the fastest.
Max-min: The Max-min heuristic is very similar to Min-min. It also begins with the
set U of all unmapped tasks. Then, the set of minimum completion time M, is found. Next,
the task with the overall maximum from M is selected and assigned to the corresponding
machine (hence the name Max-min). Last, the newly mapped task is removed from U, and
the process repeats until all tasks are mapped (i.e., U is empty). Intuitively, Max-min
attempts to minimize the penalties incurred from performing tasks with longer execution
times. Assume, for example, that the job being mapped has many tasks with very short
execution times and one task with a very long execution time. Mapping the task with the
longer execution time to its best machine first allows this task to be executed concurrently
with the remaining tasks (with shorter execution times). For this case, this would be a
better mapping than a Min-min mapping, where all of the shorter tasks would execute first,
and then the longer running task would be executed while several machines sit idle. Thus,
in cases similar to this example, the Max-min heuristic may give a mapping with a more
balanced load across machines and a better makespan.
Min-min and Max-min algorithms are simple and can be easily amended to adapt to
different scenarios. For example, in [57], a QoS Guided Min-min heuristic is presented
which can guarantee the QoS requirements of particular tasks and minimize the makespan
at the same time. Wu, Shu and Zhang [115] gave a Segmented Min-min algorithm, in
18
which tasks are first ordered by the expected completion time (it could be the maximum
ECT, minimum ECT or average ECT on all of the resources), then the ordered sequence is
segmented, and finally Min-min is applied to all these segments. The segment improves
the performance of typical Min-min when the lengths of the tasks are dramatically
different by giving a chance to longer tasks to be executed earlier than in the case where
the typical Min-min is adopted.
XSuffrage: Another popular heuristic for independent scheduling is called Suffrage
[76]. The rationale behind Suffrage is that a task should be assigned to a certain host and if
it does not go to that host, it will suffer the most. For each task, its suffrage value is
defined as the difference between its best MCT and its second-best MCT. Tasks with high
suffrage value take precedence. But when there is input and output data for the tasks, and
resources are clustered, conventional suffrage algorithms may have problems. In this case,
intuitively, tasks should be assigned to the resources as near as possible to the data source
to reduce the makespan. But if the resources are clustered, and nodes in the same cluster
are with near identical performance, then the best and second best MCTs are also nearly
identical which makes the suffrage close to zero and gives the tasks low priority. Other
tasks might be assigned on these nodes so that the task might be pushed away from its data
source. To fix this problem, Casanova et al gave an improvement called XSuffrage in [23]
which gives a cluster level suffrage value to each task. Experiments show that XSuffrage
outperforms the conventional Suffrage not only in the case where large data files are
needed, but also when the resource information cannot be predicted very accurately.
Task Grouping: The above algorithms are usually used to schedule applications that
consist of a set of independent coarse-grained compute-intensive tasks. This is the ideal
case for which the computational Grid was designed. But there are some other cases in
which applications with a large number of lightweight jobs. The overall processing of
these applications involves a high overhead cost in terms of scheduling and transmission to
or from Grid resources. Muthuvelu et al [79] propose a dynamic task grouping scheduling
algorithm to deal with these cases. Once a set of fine grained tasks are received, the
scheduler groups them according to their requirements for computation (measured in
number of instructions) and the processing capability that a Grid resource can provide in a
certain time period. All tasks in the same group are submitted to the same resource which
can finish them all in the given time. By this mean, the overhead for scheduling and job
launching is reduced and resource utilization is increased.
The Problem of Heterogeneity: In heterogeneous environments, the performance of
the above algorithms is also affected by the rate of heterogeneity of the tasks and the
resources as well as the consistency of the tasks estimated completion time on different
machines. The study in [76] shows that no single algorithm can have a permanent
advantage in all scenarios. This result clearly leads to the conclusion that if high
performance is wanted as much as possible in a computational Grid, the scheduler should
have the ability to adapt different application/resource heterogeneities.
are replicated until a predefined maximum number of replicas are reached. When a task
replica finishes, other replicas are cancelled. In this approach, performance is increased in
situations when tasks are assigned to slow/busy hosts because when a task is replicated,
there is a greater chance that a replica is assigned to a fast/idle host. Another advantage of
this scheme is that it increases the immunity to performance changing, since the possibility
that all sites are changing is much smaller than one site.
3.4.2 Dependent Task Scheduling
When tasks composing a job have precedence orders, a popular model applied is the
directed acyclic graph (DAG), in which a node represents a task and a directed edge
denotes the precedence orders between its two vertices. In some cases, weights can be
added to nodes and edges to express computational costs and communicating costs
respectively. As Grid computing infrastructures become more and more mature and
powerful, support for complicated workflow applications, which can be usually modeled
by DAGs, are provided. We can find such tools like Condor DAGMan [127], CoG [74],
Pegasus [37], GridFlow [21] and ASKALON [110]. A comprehensive survey of these
systems is given in [121], while we continue our focus on their scheduling algorithm
components in what follows.
selected is the one that has the minimal makespan. Considering the dynamism of the Grid,
instead of submitting the whole task graph at once, Pegasus applies a workflow partition
method that submits layer-partitioned subgraphs iteratively. But, as shown in the
discussion below, layered partition may not use the advantages of locality for task
independency and, as a result, produce bad schedules, especially when the DAG is
unbalanced. This weakness is also demonstrated in [110].
Similar with Pegasus, ICENI [78] also adopts pluggable algorithms for abstract
workflow to concrete workflow mapping, and in [119], random, best of n random,
simulated annealing and game theory algorithms are tested. The latter two algorithms will
be discussed in Subsection 3.6.
In GridFlow [21], workflow scheduling is conducted hierarchically by a global Grid
workflow manager and a local Grid sub-workflow scheduler. Global Grid workflow
manager receives requests from users with the workflow description in XML, and then
simulates workflow execution to find a near-optimal schedule in terms of makespan. The
simulation is done by polling local Grid schedulers which can estimate the finish time of
sub-workflows on their local sites. A fuzzy timing technique is used to get the estimate,
and the possibility of conflict on a shared resource among tasks from different
sub-workflows is considered. The advantage of fuzzy functions is that they can be
computed very fast and are suitable for scheduling of time-critical Grid applications,
although they do not necessarily provide the best scheduling solution. GridFlow also
provides rescheduling functionality when the real execution is delayed too far from the
estimate.
We see from the above discussion that most current efforts have been directed towards
supporting workflows at the programming level, thus providing potential opportunities for
algorithms designers (as they allow scheduling algorithms to be plugged in). As Grid
21
computing inherits problems from traditional systems, a natural question to ask is what can
be learned from the extensive studies on DAG scheduling algorithms in heterogeneous
computing? A complete survey of these algorithms is beyond the scope of this paper, but
some ideas and common examples are discussed below to show the problems we are still
confronted with in the Grid.
parallelize those tasks having no precedence orders in the graph and how to make the time
cost along with the critical path in the DAG as small as possible. Many list heurists have
been invented, and some new proposals can be found in [107], [91] and [83] as well as the
comparison of their algorithms with older ones.
Two Classic Examples
HEFT: Topcuoglu et al [107] present a heuristic called Heterogeneous
Earliest-Finish-Time (HEFT) algorithm. The HEFT algorithm selects the task with the
highest upward rank (an upward rank is defined as the maximum distance from the current
node to the exiting node, including the computational cost and communication cost) at
each step. The selected task is then assigned to the processor which minimizes its earliest
finish time with an insertion-based approach which considers the possible insertion of a
task in an earliest idle time slot between two already-scheduled tasks on the same resource.
The time complexity of HEFT is O(ep), where e is the number of edges and p is the
number of resources.
HEFT might be one of the most frequently referred to listing algorithms which aim to
reduce makespan of tasks in a DAG. For example, it is tested in ASKALON and compared
with a genetic algorithm and a myopic algorithm [110], and the results show its
effectiveness in the Grid scenarios, especially when the task graph is unbalanced.
22
FCP: Radulescu et al [83] present another list heuristic called Fast Critical Path (FCP),
intending to reduce the complexity of the list heuristics while maintaining the scheduling
performance at the same time. The motivation of FCP is based on the following
observation regarding the complexity of list heuristics. Basically, a list heuristic has the
following procedures: the O(e + v) time ranking phase, the O(v log v) time ordering phase,
and finally the O((e + v) p) time resource selecting phase, where e is the number of edges,
v is the number of tasks and p is the number of resources. Usually the third term is larger
than the second term. The FCP algorithm does not sort all the tasks at the beginning but
maintains only a limited number of tasks sorted at any given time. Instead of considering
all processors as possible targets for a given task, the choice is restricted to either the
processor from which the last messages to the given task arrives or the processor which
becomes idle the earliest. As a result, the time complexity is reduced to O(v log p + e).
The Problem of Heterogeneity:
A critical issue in list heuristics for DAGs is how to compute a nodes rank. In a
heterogeneous environment, the execution time of the same task will differ on different
resources as well as the communication cost via different network links. So for a particular
node, its rank will also be different if it is assigned to different resources. The problem is
how to choose the proper value used to make the ordering decision. These values could be
the mean value (like the original HEFT in [107]), the median value [62], the worst value,
the best value and so on. But Zhao et al [122] have shown that different choices can affect
the performance of list heuristics such as HEFT dramatically (makespan can change 47.2%
for certain graph). Motivated by this observation, Sakellariou et al [91] gave a Hybrid
algorithm which is less sensitive to different approaches for ranking nodes. In this
algorithm, tasks are upward ranked and sorted decreasingly. Then the sorted tasks are
grouped along the sorted sequence and in every group, tasks are independent. Finally, each
group can be assigned to resources using heuristics for independent tasks.
The above algorithms have explored how the heterogeneity of resources and tasks
impacts the scheduling algorithm, but they only consider the heterogeneity of
computational resources, and miss the heterogeneity of communication links.
Instances of List Heuristics in Grid Computing
Previous research in DAG scheduling algorithms is very helpful when we consider the
same problem in the Grid scenario. For example, a list scheduling algorithm is proposed
in [118], which is similar to the HEFT algorithm, but changes the method to compute the
level of a task node by not only including its longest path to an exit node, but also taking
incoming communication cost from its parents into account. In [75], Ma et al propose a
new list algorithm called Extended Dynamic Critical Path (xDCP) which is a Grid-enabled
version of the Dynamic Critical Path (DCP) algorithm, applied in a homogenous
environment. The idea behind DCP is continuous shortening the critical path in the task
graph, by scheduling tasks in the current CP to a resource where a task on the critical path
has an earlier start time. The xDCP algorithm was proposed for scheduling parameter-swap
applications in a heterogeneous Grid. The improvements include: (i) initial shuffling tasks
to multiple resources when the scheduling begins instead of keeping them on one node, (ii)
using the finish time instead of start time to rank task nodes to adapt to heterogeneous
resources, and (iii) multiple rounds of scheduling to improve the current scheduling instead
of only scheduling once. The complexity of xDCP is O(v3), which is the same as the
DCP.
23
o Duplication Based Algorithms
An alternative way to shorten the makespan is to duplicate tasks on different resources.
The main idea behind duplication based scheduling is utilizing resource idle time to
duplicate predecessor tasks. This may avoid the transfer of results from a predecessor to a
successor, thus reducing the communication cost. So duplication can solve the max-min
problem.
Duplication based algorithms differ according to the task selection strategies for
duplication. Originally, algorithms in this group were usually for an unbounded number of
identical processors such as distributed memory multiprocessor systems. Also they have
higher complexity than the algorithms discussed above. For example, Darbha et al [35]
present such an algorithm named TDS (Task Duplication-based Scheduling Algorithm) in
a distributed-memory machine with a complexity of O(v 2 ). (Note, this complexity is for
homogeneous environments.)
TDS: In [35], for each node in a DAG, its earliest start time (est), earliest completion
time (ect), latest allowable start time (last), latest allowable completed time (lact), and
favorite predecessor (fpred) should be computed. The last is the latest time when a task
should be started; otherwise, successors of this task will be delayed (that is, their est will
be violated). The favorite predecessors of a node i are those which are predecessors of i
and if i is assigned to the same processors on which these nodes are running, est(i) will be
minimized. The level value of a node (which denotes the length of the longest path from
that node to an exit node (also known as sink node), ignoring the communicating cost
along that path) is used as the priority to determine the processing order of each task. To
compute these values, the whole DAG of the job will be traversed, and the complexity
needed for this step is O(e + v). Based on these values, task clusters are created iteratively.
The clustering step is like a depth-first search from an unassigned node having the lowest
level value to an entry node. Once an entry node is reached, a cluster is generated and tasks
in the same cluster will be assigned to the same resource. In this step, the last and lact
values are used to determine whether duplication is needed. For example, if j is a favorite
predecessor of i and (last(i) - lact(j)) < cj,i , where c j,i is the communication cost between j
and i, i will be assigned to the same processor as j, and if j has be assigned to other
processors, it will be duplicated to is processor. In the clustering step, the DAG is
traversed similarly to the depth-first search from the exiting node, and the complexity of
this step would be the same as the complexity of a general search algorithm, which is also
O(v + e). So the overall complexity is O(v + e). In a dense DAG, the number of edges is
proportional to O(v2 ), which is the worst case complexity of duplication algorithm. Note,
in the clustering step, the number of resources available is always assumed to be smaller
than required, that is, the number of resources is unbounded.
TANH: To exploit the duplication idea in heterogeneous environments, a new
[72]).
DSC: Yang et al [117] propose a clustering heuristic called Dominant Sequence Clustering
(DSC) algorithm. The critical path of a scheduled DAG is called Dominant Sequence (DS)
to distinguish it from the critical path of a clustered DAG. The critical path of a clustered
graph is the longest path in that graph, including both non-zero communication edge cost
and task weights in that path. The makespan in executing a clustered DAG is determined
by the Dominant Sequence, not by the critical path of the clustered DAG. Fig. 9 (a) shows
the critical path of a clustered graph, which consists of {n 1, n2, n7} with a length of 9. Fig.
9(b) is a schedule of this clustered graph, and Fig. 9 (c) gives the DS of the scheduled task
graph, which consists of {n 1, n3, n 4, n5, n 6 , n7} with a length of 10 [53].
0
Fig. 9: (a) The clustered DAG and its CP shown in thick arrows. (b) The Gantt chart of a schedule.
(c)
The scheduled DAG and the DS shown in thick arrows[53].
In the DSC algorithm, task priorities are dynamically computed as the sum of their top
level and bottom level. The top level and bottom level are the sum of the computation and
communication costs along the longest path from the given task to an entry task and an exit
task, respectively. While the bottom level is statically computed at the beginning, the top
level is computed incrementally during the scheduling process. Tasks are scheduled in the
order of their priorities. The current node is an unassigned node with highest propriety.
Since the entry node always has the longest path to the exit node, clustering always begins
with the entry node. The current node is merged with the cluster of one of its predecessors
so that the top level value of this node can be minimized. If all possible merging increases
the top level value, the current node will remain in its own cluster. After the current node is
clustered, priorities of all its successors will be updated. The time complexity of DSC is
O((e + v) log v), in which O(log v) comes from priority updating at each step using a
binary heap, and (e + v) is for graph traversal in the clustering iterations. So for a dense
2task graph, the complexity is roughly O(v log v).
CASS-II: Liou et al [71] present a cluster algorithm called CASS-II which employs a
two-step approach. In the first step, CASS-II computes for each node v a value s(v), which
26
is the length of the longest path from an entry node to v (excluding the execution time of v).
Thus, s(v) is the start time of v before clustering, and s(v) is 0 if v is an entry node. The
second step is the clustering step. Just like DSC, it consists of a sequence of refinement
steps, where each refinement step creates a new cluster or grows an existing cluster.
Unlike DSC, CASS-II constructs the clusters bottom-up, i.e., starting from the exit nodes.
To construct the clusters, the algorithm computes for each node v a value f(v), which is the
longest path from v to an exit node in the current partially clustered DAG. Let l(v) = f(v) +
s(v). The algorithm uses l(v) to determine whether the node v can be considered for
clustering at the current refinement step. The clustering begins by placing every exit node
in its own cluster, and then goes through a sequence of iterations. In each iteration, it
considers to cluster every node v whose immediate successors have all been clustered.
Node v is merged to the cluster of its successor which gives it the minimum l(v) value if
the merge will not increase the l(v) value. CASS-II does not re-compute the critical path in
each refinement step. Therefore, the algorithm has a complexity O(e + v log v), and shows
3 to 5 times faster than DSC in experiments according to [71].
Algorithms for the Post-clustering Phase
In [72], steps after task clustering are studied, which include the cluster merging,
processor assignment and task ordering in local processors. For the cluster merging, three
strategies are compared, namely: load balancing (LB), communication traffic minimization
(CTM), and random (RAND).
o LB: Define the (computational) workload of a cluster as the sum of execution times
of the tasks in the cluster. At each merging step, choose a cluster, C1, that has a
minimum workload among all clusters, and find a cluster, C2, that has a minimum
workload among those clusters which have a communication edge between
themselves and C1. Then the pair of clusters C1 and C2 are merged.
o CTM: Define the (communication) traffic of a pair of clusters (C1, C2) as the sum
of communication times of the edges from C1 to C2 and from C2 to C1. At each
merging step, merge the pair of clusters which have the most traffic.
o RAND: At each refinement step, merge a random pair of clusters.
For the processor assignment, a simple heuristic is applied to find a one-to-one
mapping between a cluster and a processor: (1) assign the cluster with the largest total
communication traffic with all other clusters to a processor; (2) choose an unassigned
cluster having the largest communication traffic with an assigned cluster and place it in a
processor closest to its communicating partner; (3) repeat (2) until all clusters have been
assigned to processors.
Experimental results in [71] and [72] indicate that the performance of clustering
heuristics evaluated by makespan depends on the granularity of the tasks of a graph. The
granularity of a task is the ratio of its execution time vs. the overhead incurred when
communicating with other tasks. This result means adaptive ability will be helpful for the
scheduler to provide higher scheduling quality if jobs have high diversity.
The Problem of Heterogeneity
According to the basic idea of task clustering, cluster heuristics need not consider
heterogeneity of the resources in the clustering phase. But in the following cluster merging
and resource assigning phases, heterogeneity will definitely affect the final performance.
Obviously, research in [72] does not consider this problem and no other research on this
problem has been performed, to our knowledge. Cluster heuristics have not yet improved
27
for Grid computing either, where communication is usually costly and performance of
resources varies over time. Therefore this remains an interesting topic for research in the
Grid computing environement. Another value of the cluster heuristic for Grid scheduling is
its multi-phase nature, which provides more flexibility to the Grid scheduler to employ
different strategies according to the configuration and organization of the underlying
resources.
Another approach to deal with the dynamic problem is applying dynamic algorithms.
In [75], the authors propose a pM-S algorithms which extends a traditional dynamic
Master-Slave scheduling model. In the pM-S algorithm, two queues are used by the master,
the unscheduled queue and the ready queue. Only tasks in the Ready queue can be directly
dispatched to slave nodes, and tasks in the unscheduled queue can be only put into the
ready queue when all its parents have been in the ready queue or dispatched. The
dispatching order in the ready queue is based on tasks priorities. When a task is finished,
the priorities of all its childrens ancestors will be dynamically promoted.
In [62], another dynamic algorithm is proposed for scheduling DAGs in a shared
heterogeneous distributed system. Unlike the previous works in which a unique global
scheduler exists, in this work, the authors consider multiple independent schedulers which
have no knowledge about other jobs. So there is the danger of conflicts on resources. This
algorithm is derived from a static list algorithm: the Dynamic Level Scheduling (DLS). In
the original DLS, the dynamic level of a task in a DAG is used to adapt to the
heterogeneity in resources, while in the newly proposed algorithm, the dynamic length of
the queue on each resource is also taken into account for computing a tasks level. To
estimate the length of the queue, it is assumed that jobs are coming following a Poisson
distribution. The research also discusses how to choose the time when the scheduling
decision is made and the time when a task should be put into resources local queue in a
28
dynamic system. Ready task queuing probabilities which are computed following the
Poisson distribution are used to make these decisions.
All DAG scheduling algorithms we have discussed so far have the same goal:
minimizing the makespan of a task graph. In [120], Yu et al. consider the scheduling
problem in a pay-per-use service Grid. Their objective is to minimize the total cost for
executing a workflow on such a Grid, and at the same time, QoS which is interpreted as a
deadline, is provided. The algorithm firstly partitions the original DAG into sub-workflows,
which consist of a sequential set of tasks between two synchronization tasks (the nodes
from which at least one sub-workflow starts and/or ends) in the graph, and assigns a
sub-deadline to each partition by a combinational Breadth-First Search and Depth-First
Search with critical path analysis. Then for each partition a planning process is applied to
find the optimal mapping for which the cost is the lowest and the deadline is met. The
optimal search is modeled by a Markov Decision Process and is implemented by using a
dynamic programming algorithm. Rescheduling is also provided in this work when a
sub-workflow misses its sub-deadline. But only the sub-workflows children will be
rescheduled to reduce the rescheduling cost. A similar problem is also considered by
Sample et al. in [92]. They adopt the similar scheduling procedures as Yus. The
scheduling begins with the selection of an initial schedule based on service providers
estimates for completion time and fee. If the certainty of the completion time and cost is
dropped to a threshold, which is usually caused by performance fluctuation, rescheduling
will be carried out. To find an initial scheduling, the scheduler requests bids from resource
providers for the services they can provide. The bid request is based on the service needed,
the expected start time for the service, and information about the size and complexity of
the input parameters to the service. The schedulers decision as to which bid is finally
accepted is based on the Pareto optimality of the best schedules. For a bid to be Pareto
optimal there can be no other bid with an absolute advantage in terms of price, time and
certainty of provided performance.
The uncertainty introduced by the dynamism also brings opportunities for the
application of another method in the Grid: the data-and-control dependency task graph
(DCG) scheduling [41]. Different to a DAG, a DCG has two types of edges denoting data
dependency and control dependency among task nodes respectively. A control-dependency
edge usually represents some condition relying on the results of its starting node, so in a
DCG, users can predefine adaptive rules for dynamic adaptation in their task graph. For
example, a rule can be duplicating all the successors of a node if its real execution time
fails to match its predicted one. Research in [41] gives an example of how to schedule such
conditional task graphs in terms of reducing the final possible makespan. However, no
research that adopts this idea to the Grid environment exists at this time.
3.5 Data Scheduling
In high energy physics, bioinformatics, and other disciplines, there are applications
involving numerous, parallel tasks that both access and generate large data sets, sometimes
in the petabyte range. Data sets in this scale require specialized storage and management
systems and data Grid projects are carried out to harness geographically distributed
resources for such data-intensive problems by providing remote data set storage, access
management, replication services [15], and data transfer protocols [5].
29
Fig. 10: Taxonomy of algorithms considering data scheduling.
As mentioned in Section 1, one important difference between Grid scheduling and its
traditional counterpart is that, in the latter, the data staging problem usually need not be
considered by scheduling algorithms. This is because the resources on which applications
will run are determined before scheduling so that the data staging cost is a constant. The
only cost relating to data transmission that should be considered is from the data produced
at the run time, e.g., the data dependency in DAGs. In the Grid environment, by contrast,
the location where an application finally is processed is usually selected in real time, so
that the cost to transfer the input data from the storage sites to the processing sites might
vary according to which processing sites are selected as well as which storage sites are
used when the data have multiple replicas. Further, assigning a task to the machine that
gives its best execution time may result in poor performance due to the cost of retrieving
the required input data from data repositories. In [82], Park et al classify models of cost
measured in makespan into five categories, namely 1) Local Data and Local Execution
(here, local means where a job is submitted), 2) Local Data and Remote Execution, 3)
Remote Data and Local Execution, 4) Remote Data and Same Remote Execution and 5)
Remote Data and Different Remote Execution. The algorithms we have discussed in
previous subsections can not be directly used to solve these problems, although they are
helpful to find feasible answers. In Fig. 10, the taxonomy of algorithms which consider the
data scheduling problem is given.
On Data Replication
When the scheduling problem with data movement is considered, there are two
situations: whether data replication is allowed or not. In Pegasus, the CWG assumes that
accessing an existing dataset is always more preferable than generating a new one [36]
when it maps an abstract workflow to a concrete one. By contrast, Ranganathan et al [85]
view data sets in the Grid as a tiered system and use dynamic replication strategies to
improve data access. These data replication strategies work at the system level instead of
the application level, aiming to reduce the bandwidth consumption for data transfer and the
workload of data hotspots.
replication strategies are independent from each other. In [86], four simple dynamic
computation scheduling algorithms (namely, JobRandom which selects a random site for
computation; JobLeastLoaded which selects the least loaded site; JobDataPresent which
always selects a site where the required data is already there, and JobLocal which always
runs a job at the local site) and three dada scheduling algorithms (namely, DataDoNothing
in which no active replication takes place; DataRadom which replicates a popular data set
to a random site, and DataLeastLoaded which selects the least loaded sites as the
destination for a new data set replication). So there are in total 4*3 = 12 scheduling
algorithms. Without tight coupling with requirements of applications, it is obvious that
these algorithms are for system-centric purposes.
Actually, which resources are used to run the tasks and where the data are accessed are
two inter-related issues because the choice of the execution site may depend on the
location of the data and the data site selection may depend on where the computation will
take place. If the data sets involved in the computation are large, one may favor moving the
computation close to the data. On the other hand if the computation is significant compared
to the cost of data transfer the compute site selection could be considered first. Based on
this idea, Alhusaini et al [4] combine scheduling of the computation with data replica
selection to reduce the makespan of a collection of applications which can be unified to
compose a DAG. The cost to access data repositories and data transfer are accounted when
computing the total time cost of a task in a DAG. A static level-by-level algorithm is used
to search for an optimal scheduling. In this algorithm, the original DAG is partitioned
along with the directed edges into levels and in the same level, all tasks are independent.
Min-min and Max-min are applied to schedule tasks in the same level. But, how to find the
levels when the task graph is unbalanced is not mentioned in this paper, and as discussed
earlier, a level-by-level partition can not harness locality to facilitate tasks having
dependency.
In [39], Desprez et al try to optimize the throughput of a Grid system for independent
jobs each of which requests to refer to a certain data set, under the constraint of storage
capacity of each computation site. The algorithm is based on the following assumptions:
data replication is allowed; a job can only refer to local data on the same site whose storage
capacity is limited; the execution time of a job is linear with the size of the data set to
which it refers; the number of each kind of jobs in the job set follows a fixed proportion.
The objective is to maximize the total size of the job set. An integer solution to a linear
program problem is used as an approximation, because the problem itself is NP Complete.
By simulation, the authors also verified their method by comparison against the MCT and
Max-Min algorithm.
In [109], Venugopal et al propose a scheduling algorithm considering two kinds of cost:
the economic budget and time. The algorithm tries to optimize one of them under a limited
number of the other, e.g., spend a budget as small as possible, while not missing the
deadline. The incoming applications consist of a set of independent tasks each of which
requires a computational resource and accesses a number of data sets located on different
storage sites. The algorithm assumes every data set has only one copy in the Grid, so that
the resource selection is only for computational resources, taking into account the
communication costs from data storage sites to different computation sites as well as
computation costs. Instead of doing a thorough search in a space whose size is exponential
31
in the number of data sets requested by a task, the resource selecting procedure simply
performs a local search which only guarantees that the current mapping is better than the
previous one. In this way, the cost of the search procedure is linear. But the drawback of
this strategy is that it is not guaranteed to find a feasible schedule even if there is one.
3.6 Non-traditional Approaches for Grid Task Scheduling
The Grid is a kind of system made up of a large number of autonomous resource
solve economic problems. Each task is modelled as a player whose available strategies are
the resources on which the task could run. The payoff for a player is defined to be the sum
of the benefit value for running the task and all communication arriving at that task. Game
theory examines the existence of a stable solution in which players are unable to improve
their payoff by changing their strategy. The solution found is not always optimal for the
collective payoff, but in many cases it is close to optimal.
Economic methods for scheduling problems are very interesting because of their
successes in our daily lives. The models mentioned above can only support relatively
simple Grid scheduling problems such as independent tasks. For more complex
applications, such as those consisting of dependent tasks and requiring cross-site
cooperation, more sophisticated economic models might be needed.
3.6.2 Scheduling Methods Inspired by Natures Laws
As scheduling is usually a process to find optimal solutions, several analogies from
natural phenomena have been introduced to form powerful heuristics, which have proven
to be highly successful. Some of the common characteristics of Natures heuristics are the
close resemblance to a phenomenon existing in nature, namely, non-determinism, the
implicit presence of a parallel structure and adaptability [1]. Abraham et al [1] and Braun
et al [17] present three basic heuristics implied by Nature for Grid scheduling, namely,
33
Genetic Algorithm (GA), Simulated Annealing (SA) and Tabu Search (TS), and heuristics
derived by a combination of these three algorithms.
Simulated Annealing (SA)
SA is a search technique based on the physical process of annealing, which is the
thermal process of obtaining low-energy crystalline states of a solid. At the beginning, the
temperature is increased to melt the solid. If the temperature is slowly decreased, particles
of the melted solid arrange themselves locally, in a stable ground state of a solid. SA
theory states that if temperature is lowered sufficiently slowly, the solid will reach thermal
equilibrium, which is an optimal state. By analogy, the thermal equilibrium is an optimal
task-machine mapping (optimization goal), the temperature is the total completion time of
a mapping (cost function), and the change of temperature is the process of mapping change.
If the next temperature is higher, which means a worse mapping, the next state is accepted
with certain probability. This is because the acceptance of some worse states provides a
way to escape local optimality which occurs often in local search [73].
34
A SA algorithm is implemented in [17]. The initial system temperature is the makespan
of the initial (random) mapping. The initial SA procedure implemented here is as follows.
The first mapping is generated from a uniform random distribution. The mapping is
mutated in the same manner as the GA, and the new makespan is evaluated. If the new
makespan is better, the new mapping replaces the old one. If the new makespan is worse
(larger), a uniform random number z [0, 1) is selected. Then, z is compared with y,
where
If z>y, the new (poorer) mapping is accepted; otherwise it is rejected, and the old mapping
is kept. So, as the system temperature cools, it is more difficult for poorer solutions to be
accepted. This is reasonable because when the temperature is lower, there is less possibility
to find a better solution starting from another poorer one. After each mutation, the system
temperature is reduced to 90% of its current value. (This percentage is defined as the
cooling rate.) This completes one iteration of SA. The heuristic stops when there is no
change in the makespan for a certain number of iterations or the system temperature
approaches zero. Another example of SAs application in the Grid can be found in [119].
to move to a new region of the solution space that has not already been searched. After
each successful long hop, the short hop procedure is repeated. The stopping criterion for
the entire heuristic is when the sum of the total number of successful short hops and
successful long hops equals limithops. Then, the best mapping from the tabu list is the final
answer.
35
Combined Heuristics
GA can be combined with SA and TS to create combinational heuristics. For example,
The Genetic Simulated Annealing (GSA) [1] heuristic is a combination of the GA and SA
techniques. In general, GSA follows procedures similar to the GA outlined above.
However, for the selection process, GSA uses the SA cooling schedule and system
temperature and a simplified SA decision process for accepting or rejecting a new
chromosome.
These Natures heuristics were only relatively introduced into the scheduling area and
more work needs to be done to fit them in a Grid context. There are a lot of interesting
questions [73]. First, the meaning of controlling measurements in such heuristics needs to
be refined. For example, in SA, the cooling rate is usually set to a fixed value to control the
number of iterations the search process will perform; the question is: how to choose this
value as well as the threshold value y for accepting a poorer mutation? Second, there is a
trade-off between the search cost and the degree of optimality of solutions found. For
example, in a genetic algorithm, historical knowledge can be used to guide the
chromosome selection, crossover, or mutation process so that the search process can
converge quickly. But this adjustment seems to contradict the philosophy of an
evolutionary algorithm: randomization and diversity generate better results, and it may not
bring a better solution. Third, what is the effect of the dynamic nature of the Grid on these
algorithms? More practical experiments should be done to find proper answers to these
questions.
3.7 Scheduling under QoS Constraints
In a distributed heterogeneous non-dedicated environment, quality of services (QoS) is
a big concern of many applications. The meaning of QoS can be varied according to the
concerns of different users. It could be a requirement on the CPU speed, memory size,
bandwidth, software version or deadline. In general, QoS is not the ultimate objective of an
application, but a set of conditions to run the application successfully.
Two methods of dealing with the heterogeneity and variability have been proposed
[40]. The guaranteed approach is based on the notion of negotiating and re-negotiating
quality of service contracts with the resource providers. These QoS management
infrastructures assume that some form of service guarantees can be provided by resource
providers. An example of this approach can be found in [25]. The best effort approach
assumes that guarantees cannot be obtained from the system. In these adaptive schemes,
algorithms that enable the applications to adapt themselves to the operating environments
are built into the application. Adaptive examples are demonstrated in [46] and [80].
However, the concerns of current research on QoS, including the examples above, are
at the resource management level rather than at the task/resource scheduling level. Once
QoS is guaranteed by resources or adjusted by users, the rest of the scheduling process
falls back into the traditional mode. Being aware of this limitation, He et al give the first
study in [57], which attempts to analyze QoS in Grid task scheduling. QoS information is
embedded into the Min-min scheduling algorithm to make a better match among different
levels of QoS request/supply. Tasks can only be mapped to resources which are qualified
36
for the QoS, and tasks with higher QoS requirements are mapped before those with lower
requirements. Min-min is used to determine the order of mapping among tasks having
the same QoS level. As the first effort, the work in [57] certainly leaves many open issues:
only one dimensional QoS (network bandwidth) is considered, and tasks are assumed
independent. How does the imposition of QoS on a few tasks affect the scheduling of other
tasks in other application models such as DAG, and how to optimize other objectives
besides QoS are interesting research topics.
3.8 Strategies Treating Dynamic Resource Performance
As discussed, one of the most significant difficulties for task scheduling in the Grid is
the dynamic nature of Grid resources. The dynamic changing of resource performance
mainly comes from the autonomy of the sites, and the contention incurred by resource
sharing among a potentially large number of users. Current Grid scheduling systems adopt
three levels of techniques to relieve the dynamic problem: scheduling based on just-in-time
information from Grid information services (e.g., dynamic algorithms for load balancing),
performance prediction, and dynamic rescheduling at run time, as Fig. 12 shows.
Fig. 12: Taxonomy of scheduling strategies treating performance dynamism.
Performance Prediction
In dynamic systems, predictions are important for decision making. Good examples of
existing predictions are weather forecasting and stock market analysis. In terms of task
scheduling, almost all static algorithms rely on performance estimates when conducting the
37
scheduling and we have seen a lot of such examples above. It is usually easy to get a
resources static information, such as CPU frequency, memory size, network bandwidth,
file systems, etc. But run-time information, such as CPU load, available memory, and
available network bandwidth, is comparatively hard to obtain due to the performance
fluctuation brought by contention among sharing resources.
o On Prediction Accuracy
When a scheduling decision is made, an accurate prediction on run-time resource
measurement parameters is critical to achieve the ultimate objective. For example,
Takefusa et al [106] present a deadline scheduling algorithm for client-server systems in
the Grid. The estimated service processing time is computed with a parameter decided by
the schedulers confidence for the prediction accuracy. If the performance prediction is
wrong, resource utilization will drop or services will fail due to missed deadlines. Schopf
et al [93] use stochastic values to represent variable behaviors of resources which can be
obtained from NWS, and assume these values can be adequately represented by using
normal distributions. Based on these assumptions, they define a time-balancing scheduling
strategy for data-parallel applications. The amount of data distributed to each resource is
based on the standard deviation of the predicted completion time. Another example using
prediction accuracy as a parameter during scheduling can be found in [65].
Rescheduling
When there is no resource prediction service available or the resource prediction can
not provide an accurate forecast, rescheduling which changes previous schedule decisions
based on a fresh resource status, can be taken as a remedy. With the efforts of system
designers and developers, more and more Grid infrastructures now support job migration,
which is one precondition of rescheduling [69] [108]. Algorithms taking rescheduling into
consideration have been mentioned in previous discussions (refer 3.3 and 3.4.2), and here,
some details will be given by examples.
GrADS [31] has introduced the concept of a performance contract: an agreement
between a user and the provider of one or more resources. A contract states that when
provided with certain resources with certain capabilities and certain problem parameters,
the application will achieve a specified sustained level of measurable performance. When a
contract violation is detected, a rescheduler will be activated. The rescheduling process
must determine whether a rescheduling is profitable, based on sensor data, estimates of the
remaining work in the application, and cost of moving to new resources. If the
rescheduling appears profitable, the rescheduler will compute a new schedule and contact
rescheduling actuators located on each processor. These actuators use some mechanisms
to initiate the actual migration. Two rescheduling mechanisms are used in GrADs:
rescheduling by stop and restart (RSR) and rescheduling by processor swapping (RPS).
In the RSR approach, an application is suspended and migrated only when better
resources are found. When a running application is signaled to migrate, all application
processes check user specific data and terminate. The rescheduled execution is then
launched by restarting the application on the new set of resources, which then read the
checkpoints and continue the execution. Although this approach is flexible, in practice, it
could be very expensive: each migration event can involve large data transfers. Moreover,
restarting the application can incur expensive start-up costs, and significant application
modifications may be required for a specialized restart code. Therefore, the RPS approach
is introduced. The basic idea of RPS is similar to the duplication heuristics. Application is
launched to more machines than the number that will actually be used for the computation.
Some of these machines become a part of the computation (the active set), while some
others do nothing initially (the inactive set). The users application can only see the active
39
processes. During execution, the contract monitor periodically checks the performance of
the machines and swaps slower machines in the active set with faster machines in the
inactive set. This approach requires little application modification and provides an
inexpensive fix for many performance problems. However, the approach is less flexible
than RSR the processor pool is limited to the original set of machines, and the data
allocation can not be modified.
In [114], a self-adaptive scheduling algorithm is given to reschedule tasks on the
processors showing abnormal performance. Abnormal performance means that the
behaviors of those processors violate the performance model proposed in [105] and [54]
with an unexpected high local jobs arrival rate that delays remote Grid jobs. The
algorithm uses a prediction error threshold to trigger the rescheduling process. If the
estimated complete time of a Grid task is shortened after migration, the tasks will be
migrated to the processor which can give it minimum complete time according to current
prediction. The key problem in this algorithm is to find a proper threshold to determine
whether a processor is abnormal. In this paper, the effect of different thresholds is tested by
simulations, but it could vary if different settings are given in practice. Another problem
with this algorithm is that it ignores the migration cost when it computes the benefits of
migrations.
The above methods only consider the rescheduling of independent tasks in the Grid. A
novel low cost rescheduling policy is proposed in [90] to improve the initial static schedule
of a DAG. This algorithm only considers a selective set of tasks for rescheduling based on
measurable properties. The key idea of this selective rescheduling policy is to evaluate (at
run-time, before each task starts execution) the starting time of each node against its
estimated starting time in the static scheduling and the maximum allowable delay, in order
to make a decision for rescheduling. The maximum allowable delay could be the slack of a
task, or the minimal spare time. The slack of a task is defined as the maximal value that
can be added to the execution time of a task without affecting the overall makespan of the
schedule. The minimal spare time is defined as the maximum value that can be added to
the execution time of a task without delaying its successors execution. As the tasks of the
DAG are running, the rescheduler maintains two schedules. One is for the static scheduling
using estimated values, and the other for keeping track of the status of the tasks executed
so that the gap between the original schedule and the real run can be known. Once the gap
is beyond the slack or the minimal spare time, rescheduling to all unexecuted tasks will be
triggered and a new slack or minimal spare time for each task is recomputed. This
rescheduling policy is applied to some list algorithms discussed in 3.4.2 such as HEFT,
FCP and Hybrid. Simulation results show that this policy can achieve comparable
performance with the policy that reschedules all tasks in the DAG while dramatically
reducing the rescheduling overhead.
4. Summary and Open Issues
In this literature review, the task scheduling problem in Grid computing is discussed,
mainly from the aspect of scheduling algorithms. Although task scheduling in parallel and
distributed systems has been intensively studied, new challenges in Grid environments still
make it an interesting topic, and many research projects are underway. Through our survey
40
on current scheduling algorithms working in the Grid computing scenario, we can find that
heterogeneity, dynamism, computation and data separation are the primary challenges
concerned by current research on this topic. We also find that the evolution of Grid
infrastructures, e.g., supports for complex application models such as DAG, resource
information services and job migration frameworks, provides an opportunity to implement
sophisticated scheduling algorithms. In addition to enhancements to classic scheduling
algorithms, new methodologies are applied, such as the adaptive application-level
scheduling, Grid economic models and Natures heuristics.
Due to the characteristics of Grid computing and the complex nature of the scheduling
problem itself, there are still many open issues for Grid scheduling [8], relating to
scheduling architectures, protocols, models etc. However, we only focus on those
questions about algorithms. Eschewing any specific assumptions, we find the following
general issues deserving of future exploration.
Application and Enhancement of Classic Heterogeneous Scheduling Algorithms in
Grid Environments
1
We have seen that many solutions for Grid scheduling are inspired by classic
algorithms for traditional systems. But the potentials of these algorithms in the Grid
scenarios havent been fully exploited. For example, in 3.4.2, we disused mainly three
kinds of heuristics used for scheduling of tasks with precedence orders in heterogeneous
parallel and distributed systems: list heuristics, duplicated heuristics and clustering
heuristics. These heuristics were first invented for DAG scheduling in homogeneous
systems, and then amended for heterogeneous systems. Although the ideas behind the
latter two categories also have many advantages in the Grid scenario, almost all algorithms
in the current literature refer to list algorithms. Since all of these heuristics consider
complex application models, where tasks can be fine granular and with data and control
dependency, there is great potential for using these heuristics in Grid computing as the
underlying supporting infrastructures continue to develop.
the prediction when they make the estimate, and assume that value is static during the job
execution period. This might be a waste of the prediction efforts which can actually
provide continuous variation information about the system. So heuristics that can exploit
multiple stage prediction information should be designed. Compared to current models of
heterogeneous heuristics, the resource models for such algorithms should be remedied with
an extra matrix to present the performance values as time proceeds. Fig. 13 shows an
example. It can be expected that these heuristics will be more complex than current ones
1
2
RecallRecall3.4.1and3.4.1and3.4.2
3.4.2
41
that only consider the heterogeneity among resources rather than the variation of
performance. Another question we can ask is whether we can reestablish approximating for
makespan optimization based on performance predictions. For example, if we know the
range of performance fluctuation is bounded, can we find a bound for the ratio of real
makespan to optimal finish time accordingly?
sk
Ta
Fig. 13: (a) A 2-D matrix depicts predictions on processor performance in traditional systems. (b) In
Grid computing, performance prediction has a 3-D structure.
3.7.
Fig. 14: An example of DAG with nodes having QoS requirements.
New Algorithms Considering Combined Computation and Data Scheduling5
As discussed, scheduling algorithms in traditional computing paradigms barely
consider the data transfer problem during mapping computational tasks, and this neglect
will be costly in the Grid scenario. Only a handful of current research efforts consider the
simultaneous optimization of computation and data transfer scheduling, which brings
opportunities for future studies. Fig. 15 describes a simple example showing how taking
data staging into account will effect the scheduling. If data staging is not considered, the
critical path of DAG in Fig. 15 (a) is T1-T2-T4, but if data transfer cost is considered, the
critical path will change to T1-T3-T4. This consideration also has effect on the selection of
computational node for a task: if no data transfer cost, T2 can complete earlier on P1, but if
there is such a cost, P2 is preferred. The situation could be far more complex if there are
multiple copies of D1 and D2, and data dependencies among Tx are considered.
Fig. 15: (a) A DAG with input data D1 and D2. Suppose data dependency between any two
computation nodes Tx and Ty is 0. (b) Resources available for the job in (a); P1 and P2 are two
computational nodes; D1 is on Database1 and D2 is on Database2. (c) Computational cost estimate.
(d)
Data Transfer cost estimate.
5
Recall 3.5
43
At the higher level, where fine resource information is harder to obtain, the global
scheduling can use coarse information (such as load balancing, communication delay
of WAN links) to provide decentralized load balancing mechanisms.
At the lower level, it is easy for local scheduling to utilize more specific information
(such as information from a local forecaster) to make adaptive decisions.
References
[1]
A. Abraham, R. Buyya and B. Nath, Nature's Heuristics for Scheduling Jobs on
Computational Grids, in Proc. of 8th IEEE International Conference on Advanced
Computing and Communications (ADCOM 2000), pp. 45-52, Cochin, India,
December 2000.
[2]
A. K. Aggarwal and R. D. Kent, An Adaptive Generalized Scheduler for Grid
Applications, in Proc. of the 19th Annual International Symposium on High
Performance Computing Systems and Applications (HPCS05), pp.15-18, Guelph,
Ontario Canada, May 2005.
[3]
M. Aggarwal, R.D. Kent and A. Ngom, Genetic Algorithm Based Scheduler for
Computational Grids, in Proc. of the 19th Annual International Symposium on High
Performance Computing Systems and Applications (HPCS05), pp.209-215, Guelph,
Ontario Canada, May 2005.
[4]
A.H. Alhusaini, V.K. Prasanna, and C.S. Raghavendra, A Unified Resource
Scheduling Framework for Heterogeneous Computing Environments, in Proc. of the
8th Heterogeneous Computing Workshop, pp.156-165, San Juan, Puerto Rico, April
1999.
[5]
W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, C. Dumitrescu, I. Raicu, I. Foster,
The Globus Striped GridFTP Framework and Server, in Proc. of the 2005
ACM/IEEE conference on Supercomputing, pp.54-64, Seattle, Washington USA,
November 2005.
[6]
M, Arora, S.K. Das, R. Biswas, A Decentralized Scheduling and Load Balancing
Algorithm for Heterogeneous Grid Environments, in Proc. of International
Conference on Parallel Processing Workshops (ICPPW'02), pp.:499 505,
Vancouver, British Columbia Canada, August 2002,
[7]
M. Aktaruzzaman, Literature Review and Survey: Resource Discovery in
Computational Grids, School of Computer Science, University of Windsor, Windsor,
Ontario, Canada.
[8]
A. Andrieux, D. Berry, J. Garibaldi, S. Jarvis, J. MacLaren, D. Ouelhadj and D.
Snelling, Open Issues in Grid Scheduling, Official Technical Report of the Open
Issues in Grid Scheduling Workshop, Edinburgh, UK, October 2003.
[9]
R. Bajaj and D. P. Agrawal, Improving Scheduling of Tasks in A Heterogeneous
Environment, in IEEE Transactions on Parallel and Distributed Systems,
Vol.15, no. 2, pp.107 118, February 2004.
45
[10] M. Baker, R. Buyya and D. Laforenza, Grids and Grid Technologies for Wide-area
Distributed Computing, in J. of Software-Practice & Experience, Vol. 32, No.15, pp: