You are on page 1of 8

Computing Resource Prediction for MapReduce

Applications Using Decision Tree


Jing Tai Piao and Jun Yan
School of Information Systems and Technology
University of Wollongong
Northfields Avenue, Wollongong, NSW, Australia
{jp928,jyan}@uow.edu.au

Abstract. The cloud computing paradigm offer users access to computing


resource in a pay-as-you-go manner. However, to both cloud computing
vendors and users, it is a challenge to predict how much resource is needed to
run an application in a cloud at a required level of quality. This research focuses
on developing a model to predict the computing resource consumption of
MapReduce applications in the cloud computing environment. Based on the
Classified and Regression Tree (CART), the proposed approach derives
knowledge of the relationship among the application features, quality of
service, and amount of computing resource, from a small training. The
experiments show that the prediction accuracy is as high as 80%. This research
can potentially benefit both the cloud vendors and users through improving
resource management and reducing costs.
Keywords: Cloud computing, Decision tree, Machine learning, MapReduce,
Resource Prediction.

Introduction

Cloud computing has unveiled the new era of large-scale and data-intensive
computing. By means of virtualizing underlying physical computation resource, a
computer cluster could be formed in a more economical and flexible way so that the
computational resource could be requested. The data parallel applications which were
previously exclusive for high performance computing are able to be executed on the
virtualized cluster rather than expensive physical computers. As a parallel
programming model, MapReduce has been implemented by major cloud computing
service providers, such as Amazon Elastic MapReduce [13]. MapReduce simplifies
the parallel application development with Map function and Reduce function
regardless of task synchronization [6, 13]. Under such a circumstance, huge amount
of CPU cores and memory would be assigned to tremendous virtual machines to
support MapReduce applications. Management of computing resource is a challenge
to both cloud users and cloud computing providers. On the one hand, cloud users have
problems estimating the computing resource requirements to run their applications at
a required level of quality. For instance, the customers of Amazon Elastic MapReduce
Q.Z. Sheng et al. (Eds.): APWeb 2012, LNCS 7235, pp. 570577, 2012.
Springer-Verlag Berlin Heidelberg 2012

Computing Resource Prediction for MapReduce Applications Using Decision Tree

571

often select the number and type of instances either arbitrarily or based on a rough
estimation. As a consequence, the customers may unnecessarily pay for resource that
is not needed for running the application, or the execution of the application is not
compliant with the required quality level, e.g., the execution takes longer time due the
shortage of resource. On the other hand, from the service providers perspective, they
have difficulties to improve the efficiency of the resource utilization by always
allocating adequate amount of resource to applications.
Obviously, it is desirable that the computational resource requirements of an
application can be predicted before the application is actual run. It will potentially
benefit both cloud users and cloud vendors. Cloud users would be able to request
computational resource from the cloud in a more economical way. At the same time,
cloud vendors would be able to schedule their resource in a more efficient manner.
The rest of the paper is structured as follow. The next section reviews the major
related research. Then, Section 3 gives an overview of the MapReduce applications,
followed by an introduction of the CART algorithm in Section 4. After that, Section 5
discusses the proposed prediction model in details. Section 6 presents the
experimental results and compares the prediction accuracy of our approach with Knearest neighbour algorithm. Finally, Section 7 concludes this paper and outlines our
future work.

Related Work

Machine learning approaches have been adopted by a number of researchers to


predict the application computing resource consumption, in terms of CPU number,
memory and disk size, and so on. By means of collecting historical application
execution data as the training set, the prediction problem could be modelled as
supervised machine learning problem [1, 5]. Fairly intensive work has been done on
this problem and the major research contribution is reviewed in the following.
Linear regression assumes that the relationship between dependent variable (y) and
independent variables (x) can be described as y = x + . Theoretically, it can be
expected that the completion time of an application would decrease as the computing
resource being consumed increases. This approach is applied in [1] to simplify the
sophisticated modelling problem for medical image process applications. In practice,
however, the increased computation resource may not be related to the decline of
completion time in a linear way [13]. Rather, a numbers of approaches have proved
that predicting more accurately than linear model [3, 4, 12].
Support vector machine (SVM) has advantages in terms of handling a large
number of attributes in the non-linear scenario, whereas the disadvantage of SVM is
that it brings extra computational overheads at the same time and it relies heavily on
the size of training set [2].
In the [3], the characteristics of application are formed in multiple dimensions in
the space. Thus, the Euclidean distance between the applications in the space could
describe the similarity among these applications. The advantage of this kind of

572

J.T. Piao and J. Yan

approaches is that the prediction result could be highly accurate. But the accuracy
usually relies on the abundant historical data and the computational complexity would
be relatively higher than other approaches.
Decision tree algorithms ignore the correlation between attributes, classifying the
training set according to its characteristics. On each node of the tree structure, a
separate criterion will be used to divide the training set into different categories. The
separate process would not stop until the purity of each leaf on the tree is satisfied.
C4.5 decision tree algorithm is implemented in [10] to classify the training set into
different time intervals. However, the time intervals are static so that the control
granularity cannot be optimized. In [5, 8, 9, 11, 12], the application attributes are
selected and grouped as subsets so that the searching can be done in the templates
instead of the tree structure. The characteristics of applications are described as a
template, such as several predefined attributes, and thereby the similarity between
applications could be judged by comparing each attribute. Further, the research results
have proven that the prediction accuracy of template approach is better than adopting
complex regression as characteristics split method [12].

Overview

This research is based on the primitive assumption that applying the same computing
resource to similar applications would have similar performance. Also, this research is
inspired by the success of prediction of application runtime according to the preexecution features in previous research [4, 7, 10, 12]. It is possible that predicting an
applications computing resource demand by analysis data from previous similar
application.
The CART algorithm provides such a way to classify the applications by splitting
characteristics of application one by one. It has a significant advantage in reduced size
of training set, comparing with other machine learning algorithms such as SVM and
K-nearest neighbour algorithm [9].
In order to apply CART to predict computing resource requirement, the historical
execution information of applications, including application features, performance
and computing resource consumption should be collected as the training set. The data
of computing resource are chosen as dependent value, whereas other variables are
chosen as independent variables. We assume that the computing resource can be
combined by several fixed templates. This is similar to the way that, Amazon EC2
offers several types of instances to build a cluster with arbitrary computational
capacity. Thus, the computing resource is predefined as classes, whereas the
application features are presented by multiple attributes.
In this model, the performance of application is represented by execution time,
while the application features are defined as several attributes. The CART can be
applied on the data by setting the variables of performance and application features as
predictors.
In CART, the size of the tree determines the accuracy of prediction significantly.
However, as the tree is constructed deeper and deeper, the computation complexity

Computing Resource Prediction for MapReduce Applications Using Decision Tree

573

would increase simultaneously. In order to balance the accuracy and computation


efficiency, we implement CART by the following steps:
Construct the tree with a minimum size,
Observing the prediction accuracy pi , and
3. if pi < 80% , reconstruct a deeper tree ik , until
1.
2.

pi 80%

In the tree construction phase, the original learning set is split according to the
impurity function. For each node of CART, the impurity function has a maximum
value if all classes are mixed equally and has a minimum value when the node
contains only one class. Among all of the possible splitting, the best way should offer
the largest decrease in impurity.

The Prediction Approach

4.1

Prediction Model

The object of this research is to predict the computational resource demand of a


MapReduce application. Most of the previous solutions request collecting and
accumulating a huge amount of application historical execution data. Our decision
tree based approach could not only provide contented prediction accuracy but also
dramatically reduce the size of training data set. The decision tree approach, besides,
reduce the computational complexity, comparing with other distance based algorithms
so that the learning process could be much faster than other approaches.
The core problem of our research is how to model the application features so that
the training set is not tedious but effective for prediction. In our approach, we
describe each application using a 6-tuple:
{TaskCompletionTime, dataSize, dataDistributionNumber, dataSiteNumber,
MapperNumber, ReducerNumber}
TaskCompletionTime means the time period from the first mapper starts to the
last reducer completes.
dataSize is the total size of the workload. The unit of this attribute is Megabyte
(MB) in this approach.
dataDistributionNumber means the number of workload distribution. In the
case that the workload of the application is stored separately and performance
of the application might be influenced by the number of data pieces, the data
distribution number is introduced to describe the number of pieces the
workload has been divided into.
dataSiteNumber shows the number of sites which are used to save the
workload. Mapper and reducer numbers are straightforward values to show
how many mappers and reducers are setup in the application.
The historical application execution data will be collected according the above model.
By mean of implementing decision tree algorithm, the similar application can be
classified into same leaf on the tree. Then, a split rule can be extracted from this tree.

574

J.T. Piao and J. Yan

For a new coming application, it is possible to classify the application belongs to


which leaf by implementing the rule on it.
In this research, each leaf on the tree indicates a computer resource class. In
practice, Amazon Elastic MapReduce has classified its computational resource as
different classes, such as standard instances, high memory instances and high CPU
instances, and so on. Each type of instance has different computational capacity. For
example, a default small instance of Amazon Elastic MapReduce runs a 32-bit
platform with 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2
Compute Unit), 160 GB of instance storage, whereas a High-Memory Double Extra
Large Instance runs a 64-bit platform with 34.2 GB of memory, 13 EC2 Compute
Units (4 virtual cores with 3.25 EC2 Compute Units each), 850 GB of instance
storage [14]. In our case, we form our experimental computing resource as four
classes which is similar to the instance classification of Amazon Elastic MapReduce.
The computation resource categories include:

Class 1: highest performance type which has 2 CPU core (3.33 GHz) and 2GB
memory,
Class 2: high performance type which has 1 CPU core (3.33 GHz) and 1 GB
memory
Class 3: standard type with 1 CPU core (3.33 GHz) and 512 MB memory,
Class 4: economic type with 1 CPU core (2.40 GHz) and 512 MB memory

The following table shows a part of the training sample:


Table 1. A part of the training samples
Time(second) Data
Size(Mb)
1700
4.4
1530
2.6
1583
2.8
1726
2.9
1490
2.5
1728
3
1735
2.9
2326
69.1

Data
Number
1120
994
1036
1094
976
1130
1121
1515

Site
Number
1
1
1
1
1
1
1
1

Mapper Reducer Classes


1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1

2
2
2
2
2
2
2
2

In this research, the application is specified as classic word count application and
the work load is randomly generated with size ranging from 69MB to 0.7MB which
consists of 3028 to 226 files, respectively. The word count application was executed
under four computational environments as predefined. The training set is collected
from the 200 observations which consist of 50 samples for each class.
After the accumulation of the training data, the CART algorithm will be adopted to
construct the tree. Fig. 1 shows a part of the tree which contains 15 nodes and the
accuracy reaches 53.7%.

Computing Resource Prediction for MapReduce Applications Using Decision Tree

575

Fig. 1. A part of the CART in this research

Evaluation

This research conducts N-fold cross-validation method which randomly divides the
sample set as several subsets ( X 1 , X 2 , ... , X n ) to evaluate the prediction accuracy because
the size of training sample. For each subset n , using X X n as learning sample and
n as test sample, the accuracy estimation will be the average of n test results.
Fig.2. shows accuracy estimation result as the number of nodes increases. Here, the
accuracy is estimated by nfold cross validation with n = 10. The prediction accuracy
increases as the tree constructed deeper. While the prediction accuracy achieves the
predefined value (80%), the construction of the tree will be terminated.

Fig. 2. The prediction accuracy of CART

576

J.T. Piao and J. Yan

In the next phase, k-nearest neighbour algorithm is implemented to the same


learning set. The parameter of k ranges from 1 to 10. Also, the accuracy is measured
by n-fold cross validation and n =10. The prediction result can be seen in Fig. 3.

Fig. 3. The prediction accuracy trend of k-nearest neighbor algorithm

According to the experiment result, we found that CART can achieve higher
prediction accuracy by using a small training sample k-nearest neighbour algorithm
which only has accuracy from 34.1% to 48.2%.

Conclusion

We have demonstrated a CART based method to predict the computing resource


consumption of a MapReduce application within the context of cloud computing. The
success of previous research which estimated the performance of applications before
actual execute it proves the feasibility of predicting the computing resource demand.
The prediction of computing resource demand would have significant values in
improving current cloud computing service quality, such as Amazon Elastic
MapReduce. By means of such prediction mechanism, the user could request
adequate computing resources from the cloud so that the application can be executed
in a more economic way.
Compared with previous predicting algorithms, CART takes advantages of a small
training sample set so that the predicting result could be obtained earlier than the
other approaches. According to the experimental results, we found that the prediction
accuracy of our approach is much higher than k-nearest neighbor algorithm which is
another famous machine learning algorithm.
In the future, a client side application would be developed to improve the current
service of Amazon Elastic Cloud Computing based on the proposed CART algorithm.
The application would allow the users to specify expected execution time before
submitting MapReduce applications. To do so, the application features in the
prediction model will be extended to a general model.

Computing Resource Prediction for MapReduce Applications Using Decision Tree

577

References
1. Albers, R., Suijs, E., de With, P.H.N.: Triple-C: Resource Usage Prediction for SemiAutomatic Parallelization of Groups of Dynamic Image-Processing Tasks. In: Proc. of the
23rd Int. Parallel Distributed Processing Symp. (2009)
2. Duan, R., Nadeem, F., Wang, J.: A Hybrid Intelligent Method for Performance Modeling
and Prediction of Workflow Activities in Grids. In: Proc. of the 9th IEEE/ACM
International Conference on Cluster, Cloud and Grid Computing (CCGrid), Shanghai,
China, pp. 339347 (May 2009)
3. Ganapathi, A., Chen, Y., Fox, A.: Statistics-Driven Workload Modeling for the Cloud. In:
ICDE Workshops 2010, pp. 8792 (2010)
4. Ganapathi, A., Kuno, H., Dayal, U., et al.: Predicting Multiple Metrics for Queries: Better
Decisions Enabled by Machine Learning. In: Proc. of the 2009 IEEE International
Conference on Data Engineering, Shanghai, China, pp. 592603 (March 2009)
5. Gibbons, R.: A Historical Application Profiler for Use by Parallel Schedulers. In:
Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1997 and JSSPP 1997. LNCS, vol. 1291,
pp. 5877. Springer, Heidelberg (1997)
6. Kaashoek, F., Morris, R., Mao, Y.: Optimizing MapReduce for Multicore Architectures,
technical report,
http://dspace.mit.edu/bitstream/handle/1721.1/54692/
MIT-CSAIL-TR-2010-020.pdf?sequence=1
7. Matsunaga, A., Fortes, J.: On the Use of Machine Learning to Predict the Time and
Resources Consumed by Applications. In: Proc. of the 10th IEEE/ACM International
Conference on Cluster, Cloud and Grid Computing (CCGrid), Melbourne Australia, pp.
495504 (June 2010)
8. Mualem, A.W., Feitelson, D.G.: Utilization, Predictability, Workloads, and User Runtime
Estimates in Scheduling the IBM SP2 with Backfilling. IEEE Transactions on Parallel and
Distributed Systems 12(6) (June 2001)
9. Mitchell, T.M.: Machine Learning, McGraw-Hill Science/Engineering/Math (March 1,
1997)
10. Nadeem, F., Fahringer, T.: Using Templates to Predict Execution Time of Scientific
Workflow Applications in the Grid. In: Proc. of the 9th IEEE/ACM International Conference
on Cluster, Cloud and Grid Computing (CCGrid), Shanghai, China, pp. 316323 (May 2009)
11. Guim, F., Rodero, I., Corbalan, J., et al.: The Grid Backfilling: a Multi-Site Scheduling
Architecture with Data Mining Prediction Techniques. In: CoreGrid Workshop in Grid
Middleware (2007)
12. Smith, W.: Prediction Services for Distributed Computing. In: Proc. of IEEE Internatioal
Parallel and Distributed Processing Symposium, Long Beach, US, pp. 110 (June 2007)
13. Smith, W., Foster, I., Taylor, V.: Predicting Application Run Times Using Historical
Information. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1998. LNCS, vol. 1459,
pp. 122142. Springer, Heidelberg (1998)
14. Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving MapReduce
performance in heterogeneous environment. In: OSDI 2008 Proceedings of the 8th
USENIX Conference on Operating Systems Design and Implementation (2008)
15. http://aws.amazon.com/elasticmapreduce/
16. http://hadoop.apache.org

You might also like