Professional Documents
Culture Documents
Introduction
Cloud computing has unveiled the new era of large-scale and data-intensive
computing. By means of virtualizing underlying physical computation resource, a
computer cluster could be formed in a more economical and flexible way so that the
computational resource could be requested. The data parallel applications which were
previously exclusive for high performance computing are able to be executed on the
virtualized cluster rather than expensive physical computers. As a parallel
programming model, MapReduce has been implemented by major cloud computing
service providers, such as Amazon Elastic MapReduce [13]. MapReduce simplifies
the parallel application development with Map function and Reduce function
regardless of task synchronization [6, 13]. Under such a circumstance, huge amount
of CPU cores and memory would be assigned to tremendous virtual machines to
support MapReduce applications. Management of computing resource is a challenge
to both cloud users and cloud computing providers. On the one hand, cloud users have
problems estimating the computing resource requirements to run their applications at
a required level of quality. For instance, the customers of Amazon Elastic MapReduce
Q.Z. Sheng et al. (Eds.): APWeb 2012, LNCS 7235, pp. 570577, 2012.
Springer-Verlag Berlin Heidelberg 2012
571
often select the number and type of instances either arbitrarily or based on a rough
estimation. As a consequence, the customers may unnecessarily pay for resource that
is not needed for running the application, or the execution of the application is not
compliant with the required quality level, e.g., the execution takes longer time due the
shortage of resource. On the other hand, from the service providers perspective, they
have difficulties to improve the efficiency of the resource utilization by always
allocating adequate amount of resource to applications.
Obviously, it is desirable that the computational resource requirements of an
application can be predicted before the application is actual run. It will potentially
benefit both cloud users and cloud vendors. Cloud users would be able to request
computational resource from the cloud in a more economical way. At the same time,
cloud vendors would be able to schedule their resource in a more efficient manner.
The rest of the paper is structured as follow. The next section reviews the major
related research. Then, Section 3 gives an overview of the MapReduce applications,
followed by an introduction of the CART algorithm in Section 4. After that, Section 5
discusses the proposed prediction model in details. Section 6 presents the
experimental results and compares the prediction accuracy of our approach with Knearest neighbour algorithm. Finally, Section 7 concludes this paper and outlines our
future work.
Related Work
572
approaches is that the prediction result could be highly accurate. But the accuracy
usually relies on the abundant historical data and the computational complexity would
be relatively higher than other approaches.
Decision tree algorithms ignore the correlation between attributes, classifying the
training set according to its characteristics. On each node of the tree structure, a
separate criterion will be used to divide the training set into different categories. The
separate process would not stop until the purity of each leaf on the tree is satisfied.
C4.5 decision tree algorithm is implemented in [10] to classify the training set into
different time intervals. However, the time intervals are static so that the control
granularity cannot be optimized. In [5, 8, 9, 11, 12], the application attributes are
selected and grouped as subsets so that the searching can be done in the templates
instead of the tree structure. The characteristics of applications are described as a
template, such as several predefined attributes, and thereby the similarity between
applications could be judged by comparing each attribute. Further, the research results
have proven that the prediction accuracy of template approach is better than adopting
complex regression as characteristics split method [12].
Overview
This research is based on the primitive assumption that applying the same computing
resource to similar applications would have similar performance. Also, this research is
inspired by the success of prediction of application runtime according to the preexecution features in previous research [4, 7, 10, 12]. It is possible that predicting an
applications computing resource demand by analysis data from previous similar
application.
The CART algorithm provides such a way to classify the applications by splitting
characteristics of application one by one. It has a significant advantage in reduced size
of training set, comparing with other machine learning algorithms such as SVM and
K-nearest neighbour algorithm [9].
In order to apply CART to predict computing resource requirement, the historical
execution information of applications, including application features, performance
and computing resource consumption should be collected as the training set. The data
of computing resource are chosen as dependent value, whereas other variables are
chosen as independent variables. We assume that the computing resource can be
combined by several fixed templates. This is similar to the way that, Amazon EC2
offers several types of instances to build a cluster with arbitrary computational
capacity. Thus, the computing resource is predefined as classes, whereas the
application features are presented by multiple attributes.
In this model, the performance of application is represented by execution time,
while the application features are defined as several attributes. The CART can be
applied on the data by setting the variables of performance and application features as
predictors.
In CART, the size of the tree determines the accuracy of prediction significantly.
However, as the tree is constructed deeper and deeper, the computation complexity
573
pi 80%
In the tree construction phase, the original learning set is split according to the
impurity function. For each node of CART, the impurity function has a maximum
value if all classes are mixed equally and has a minimum value when the node
contains only one class. Among all of the possible splitting, the best way should offer
the largest decrease in impurity.
4.1
Prediction Model
574
Class 1: highest performance type which has 2 CPU core (3.33 GHz) and 2GB
memory,
Class 2: high performance type which has 1 CPU core (3.33 GHz) and 1 GB
memory
Class 3: standard type with 1 CPU core (3.33 GHz) and 512 MB memory,
Class 4: economic type with 1 CPU core (2.40 GHz) and 512 MB memory
Data
Number
1120
994
1036
1094
976
1130
1121
1515
Site
Number
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
In this research, the application is specified as classic word count application and
the work load is randomly generated with size ranging from 69MB to 0.7MB which
consists of 3028 to 226 files, respectively. The word count application was executed
under four computational environments as predefined. The training set is collected
from the 200 observations which consist of 50 samples for each class.
After the accumulation of the training data, the CART algorithm will be adopted to
construct the tree. Fig. 1 shows a part of the tree which contains 15 nodes and the
accuracy reaches 53.7%.
575
Evaluation
This research conducts N-fold cross-validation method which randomly divides the
sample set as several subsets ( X 1 , X 2 , ... , X n ) to evaluate the prediction accuracy because
the size of training sample. For each subset n , using X X n as learning sample and
n as test sample, the accuracy estimation will be the average of n test results.
Fig.2. shows accuracy estimation result as the number of nodes increases. Here, the
accuracy is estimated by nfold cross validation with n = 10. The prediction accuracy
increases as the tree constructed deeper. While the prediction accuracy achieves the
predefined value (80%), the construction of the tree will be terminated.
576
According to the experiment result, we found that CART can achieve higher
prediction accuracy by using a small training sample k-nearest neighbour algorithm
which only has accuracy from 34.1% to 48.2%.
Conclusion
577
References
1. Albers, R., Suijs, E., de With, P.H.N.: Triple-C: Resource Usage Prediction for SemiAutomatic Parallelization of Groups of Dynamic Image-Processing Tasks. In: Proc. of the
23rd Int. Parallel Distributed Processing Symp. (2009)
2. Duan, R., Nadeem, F., Wang, J.: A Hybrid Intelligent Method for Performance Modeling
and Prediction of Workflow Activities in Grids. In: Proc. of the 9th IEEE/ACM
International Conference on Cluster, Cloud and Grid Computing (CCGrid), Shanghai,
China, pp. 339347 (May 2009)
3. Ganapathi, A., Chen, Y., Fox, A.: Statistics-Driven Workload Modeling for the Cloud. In:
ICDE Workshops 2010, pp. 8792 (2010)
4. Ganapathi, A., Kuno, H., Dayal, U., et al.: Predicting Multiple Metrics for Queries: Better
Decisions Enabled by Machine Learning. In: Proc. of the 2009 IEEE International
Conference on Data Engineering, Shanghai, China, pp. 592603 (March 2009)
5. Gibbons, R.: A Historical Application Profiler for Use by Parallel Schedulers. In:
Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1997 and JSSPP 1997. LNCS, vol. 1291,
pp. 5877. Springer, Heidelberg (1997)
6. Kaashoek, F., Morris, R., Mao, Y.: Optimizing MapReduce for Multicore Architectures,
technical report,
http://dspace.mit.edu/bitstream/handle/1721.1/54692/
MIT-CSAIL-TR-2010-020.pdf?sequence=1
7. Matsunaga, A., Fortes, J.: On the Use of Machine Learning to Predict the Time and
Resources Consumed by Applications. In: Proc. of the 10th IEEE/ACM International
Conference on Cluster, Cloud and Grid Computing (CCGrid), Melbourne Australia, pp.
495504 (June 2010)
8. Mualem, A.W., Feitelson, D.G.: Utilization, Predictability, Workloads, and User Runtime
Estimates in Scheduling the IBM SP2 with Backfilling. IEEE Transactions on Parallel and
Distributed Systems 12(6) (June 2001)
9. Mitchell, T.M.: Machine Learning, McGraw-Hill Science/Engineering/Math (March 1,
1997)
10. Nadeem, F., Fahringer, T.: Using Templates to Predict Execution Time of Scientific
Workflow Applications in the Grid. In: Proc. of the 9th IEEE/ACM International Conference
on Cluster, Cloud and Grid Computing (CCGrid), Shanghai, China, pp. 316323 (May 2009)
11. Guim, F., Rodero, I., Corbalan, J., et al.: The Grid Backfilling: a Multi-Site Scheduling
Architecture with Data Mining Prediction Techniques. In: CoreGrid Workshop in Grid
Middleware (2007)
12. Smith, W.: Prediction Services for Distributed Computing. In: Proc. of IEEE Internatioal
Parallel and Distributed Processing Symposium, Long Beach, US, pp. 110 (June 2007)
13. Smith, W., Foster, I., Taylor, V.: Predicting Application Run Times Using Historical
Information. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1998. LNCS, vol. 1459,
pp. 122142. Springer, Heidelberg (1998)
14. Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving MapReduce
performance in heterogeneous environment. In: OSDI 2008 Proceedings of the 8th
USENIX Conference on Operating Systems Design and Implementation (2008)
15. http://aws.amazon.com/elasticmapreduce/
16. http://hadoop.apache.org