You are on page 1of 9

Business and Information 2013 (Bali, July 7-9)

TO ENHANCE THE SERVICE PERFORMANCE OF CLOUD COMPUTING BY USING K-MEANS BASED ON RFM DATA ANALYSIS
Kuo-Qin Yan, Shun-Sheng Wang*, Shu-Ching Wang*, Sheng-Zhong Chen Chaoyang University of Technology, 168, Jifeng E. Rd., Wufeng District, Taichung, 41349, Taiwan, R.O.C. {kqyan; sswang*; scwang*; s10014609}@cyut.edu.tw *: Corresponding author

ABSTRACT With the evolution of information technology, more and more users are using cloud-computing services. When users use the cloud computing services, their files will be stored in the Cloud Storage. However, the availability of services in the cloud-computing environment is one of the important factors that must be considered when providing cloud-computing services. To achieve this goal, the characteristics of the file are analyzed in this study. However, through the clustering of files, the same type of resources is concentrated, and then the resources can be corresponded to the similar attributes of files and services. According to the file size, in the proposed clustering mechanism, the files are stored by file or block level. Then, the attributes of file are analyzed by the factor of file popularity. Finally, the modified K-means clustering algorithm is used to cluster files and the similar files are stored in the same cluster. When users access the similar files in same cluster, users can get a faster response of file operation, and the performance of cloud computing can be improved. Keyword: Cloud Computing, Cloud Storage, Distributed File System, Popularity, Clustering 1. INTRODUCTION Due to the vigorous development of the Internet, the rapid growth of the computer hardware performance and network bandwidth increases. The Internet users are increasingly dependent on Internet service. Therefore, more and more resources into cloud computing, making the cloud computing related applications more flourish (Aymerich et al., 2008). The computing power of cloud computing is distributed computing and the storage capacity is constructed by a distributed file system. As the cloud-computing environment, mostly use the distributed file system to store data. From a logical point of analysis, distributed file system can be regarded as a hierarchical file system (Grossman, 2009). When the system based on file replication, distributed file systems can achieve the reliability of the file and the use of performance (Chang et al., 2008). Performing the processing of large amounts of data, the increase in the number of files copied backup, can increase the reliability of the file, it can speed up the access time (Ranganathan & Foster, 2001). In other words, due to the distributed file, system can improve the reliability of the system by replica, but too many copies of files, will be a waste of storage space. In addition, the files during storage, if there is no planning for storage location, it may appear that all files are stored in the same directory. In addition, unable to make full

- E41 -

Business and Information 2013 (Bali, July 7-9)

use of various resources in the cloud-computing environment, resulting future users to access files. In the architecture of a distributed file, the block level storage is used mostly (The Hadoop Distributed File System). However, by using block level storage, when the size of a file is less than a block that will waste a lot of extra space and the required access time is same as access a block. Therefore, in this study, when the size of a file is less than a block then the file is classified as a small file and stored using file level storage. Ensure the high availability of the file, when users access the files stored in Cloud Storage, hence, file maintenance is an important issue that must be explored in a cloud-computing environment. Therefore, in this study will explore the files stored in the cloud computing environment, and clustering the file through the analysis of file attributes. By a similar profile of the cluster, then the user can achieve a higher efficiency in the retrieval of similar files under the cloud-computing environment. The rest of this paper is organized as follows. The literature review is discussed in Section 2. The proposed k-means based on RFM data analysis (RFM k-means) is presented in Section 3. Finally, Section 4 concludes this paper. 2. LITERATURE REVIEW In this section, the cloud computing, the storage of distributed file system, RFM data analysis and k-means clustering are discussed. 2.1 Cloud Computing Cloud computing is a new concept in distributed systems (Weiss, 2007). It is currently used mainly in business applications in which computers cooperate to perform a specific service together. In addition, the internet applications are continuously enhanced with multimedia, and vigorous development of the device quickly occurs in the network system (Aymerich et al., 2008; Grossman, 2009). As network bandwidth and quality outstrip computer performance, various communication and computing technologies previously regarded as being of different domains can now be integrated, such as telecommunication, multimedia, information technology, and construction simulation. Thus, applications associated with network integration have gradually attracted considerable attention. Similarly, cloud computing facilitated through distributed applications over networks has also gained increased recognition. In a Cloud computing environment, users have access to faster operational capability on the Internet (Peppers & Rogers, 1997; Wang et al., 2008), and the computer systems must have high stability to keep pace with this level of activity. In a distributed computing system, nodes allocated to different places or in separate units are connected so that they may collectively be used to greater advantage (Vouk, 2008). In the computer system, each node must pass messages other nodes to cooperatively complete user requests. Thus, the Cloud computing can ensure increased ability to use low-power nodes to achieve high usability (Rimal et al., 2009). In addition, Cloud computing has greatly encouraged distributed system design and application to support user-oriented service applications (More Google Product, 2011). Furthermore, many applications of Cloud computing can increase user convenience, such as YouTube (More Google Product, 2011).

- E42 -

Business and Information 2013 (Bali, July 7-9)

Cloud computing use virtualization technology fully utilize idle resources (Luo, 2010; Tzeng, 2010; Wang and Ng, 2010), to increase the operation speed and lower energy consumption. Because, there are a large number of services and users in the Cloud computing environment, so how to clear and effective virtual space separates in the virtual environment, is a major issue of Cloud computing. However, when users access the Internet environment in the Cloud computing service, there will be a large amount of data stored in the Cloud computing environment. To provide users with a high reliability and high stability environment, hence, the replication of files in the Cloud computing environment must to be considered (Chang et al., 2008). 2.2 The Storage of Distributed File System The files are stored in a distributed file system can be divided into two categories, block level and file level respectively (Cai et al., 2007; Gwertzman & Seltze, 1995). Making the save file entity related technology from DAS (Direct Attached Store) evolved into SAN (Storage Attachment Network; NAS (Network Attached Storage) then iSCSI (inter SCSI), with several more physical storage methods proposed with the use of its storage format or using the file level and block level storage. Therefore, in this section, the file level and block level storage will be illustrated. (1) File Level: In Personal Computer (PC), the file level storage is used; the user uses a single directory for storage. During file physical storage, DAS technology is used for file accessing. When file level storage is used to save file, user must wait until the entire file has been read in order to respond to users. Mainly because of the limited network transmission speed, making the file access time is the same as the network speed when using the DAS technology. By past research shows that when the file size is less than the size of block, the storage speed of file level is faster than the block level (Gwertzman & Seltze, 1995). (2) Block Level: In the architecture of distributed file, block level is the most storage architecture. In the block level, files are stored by block; hence, the file can be accessed by different block at the same time and respond to the user. To block level, the storage architecture can support the transmission of a file through multiple blocks; it will not be limited to a single network transmission speed. 2.3 RFM Data Analysis In the past, the related researches in customer relationship management, there already have been many ways to help businesses understand their customer information and use this information to understand consumer behavior and to develop the marketing strategies that meet the customers needs (Liu & Zhu, 2009). Kahan advocates the RFM (Recency, Frequency, Monetary) data analysis technique, because it provides customer transaction information for businesses, and it is a behavior analysis technology that is more useful than cognitive analysis (Kahan, 1998). In customer-related information, the RFM data analysis technique is most widely used in order to understand the methods. In 1994, Hughes defined the RFM data analysis technique to analyze and measure consumer behavior (Hughes, 1994). The customers are segmented by past information on the customers transactions on the basis of measuring customer loyalty and

- E43 -

Business and Information 2013 (Bali, July 7-9)

contribution. The RFM data analysis technique is consistent in assessing the importance of a customer. It uses relative grading; the R, F, or M can be divided into five equal portions; 1 to 5 are used for the distinction; the higher the value, the larger the number represents, that is, each of them would be 20% of the entire database. Therefore, there is three numbers in the record of each user. The composition will be 555, 554, 553 to 111. Totally, there are 125 compositions. The 555 represents the customer who had purchased the product most recently, purchased the product most frequently, and spent the most of money on it. In addition, the record 111 represents the customer who had never purchased the product or purchased it at a long time ago, purchased the product once or never, and spent no money or a little money on it (Hughes, 1994). Traditionally, RFM data analysis technology is used in customer relationship management. RFM uses the history of customer's trading information, and then the customer loyalty and contribution can be measured to provide depth-operating reference. Since the user to save the file in the cloud computing environment, more and more users to use the services of cloud computing, the availability of cloud computing becomes the important factors that must be considered when providing cloud services. Therefore, in order to maintain the usability of file in the cloud-computing environment, file replica is used to increase the usability of the files in the cloud computing (Kahan, 1998; Rebecca et al., 2012). In addition, to provide fast service response mechanism, is also one of the important issues that must be explored in the cloud storage environment. The characteristics of the access behavior information of stored files in the cloud environment are similar to the transaction information of consumer behavior. Therefore, the traditional RFM data analysis technology will be improved in this study and will applied to the value of the data storage and analysis of file storage. 2.4 k-means Clustering In data mining, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. The term k-means was first used by James MacQueen in 1967 (MacQueen, 1967). The most common algorithm uses an iterative refinement technique. Given an initial set of k means m1(1),, mk(1), the algorithm proceeds by alternating between two steps (MacKay, 2003): Assignment step: Assign each observation to the cluster whose mean is closest to it.

S(i t ) = {x p :|| x p mi(t ) |||| x p m (jt ) || 1 j k}


where each xp is assigned to exactly one S(t), even if it could be is assigned to two or more of them. Update step: Calculate the new means to be the centroids of the observations in the new clusters.

mi(t +1) = 1 / | Si( t ) |

x
x j S ( t ) i

The algorithm has converged when the assignments no longer change. As it is a heuristic algorithm, there is no guarantee that it will converge to the global optimum,

- E44 -

Business and Information 2013 (Bali, July 7-9)

and the result may depend on the initial clusters. As the algorithm is usually very fast, it is common to run it multiple times with different starting conditions. However, in the worst case, k-means can be very slow to converge: in particular it has been shown that there exist certain point sets, even in 2 dimensions, on which k-means takes exponential time, that is 2(n), to converge (Vattani, 2011). These point sets do not seem to arise in practice: this is corroborated by the fact that the smoothed running time of k-means is polynomial (Arthur et al., 2009). 3. K-MEANS BASED ON RFM DATA ANALYSIS (RFM K-MEANS) To enhance the performance of cloud computing to save the file, the k-means based on the RFM data analysis (RFM k-means) is proposed in this study to enhance the service performance of cloud computing. In RFM k-means, according to the size of file to determine the file is saved as file level or block level firstly. Next, the popularity of the file is analyzed by the proposed popularity RFM factor. Finally, all files are clustered by the modified k-means clustering algorithm, then files will be stored in the same cluster that they have the similar characteristics or properties. In other words, RFM k-means can enhance the efficiencies of files in the cloud computing. 3.1 The Storage of RFM k-means The block level storage is used in the past distributed file system (The Hadoop Distributed File System), but when the size of file is less than a block size, the extra space is wasted, and the access speed depends on a block. Therefore, RFM k-means saves the file based on the size of the file, large files using block-level and small files using the file-level storage. In the transmission of file update, the multi-point and single point propagation are used in RFM k-means, in order to use the parallel features of block level storage and the direct access processing features of file level. 3.2 The Popularity RFM Factor Analysis In the proposed RFM k-means, the popularity RFM factor is used to analysis the data attributed. However, the traditional RFM factor that used in customer relationship management is improved to suit the cloud-computing environment. In the modified RFM data analysis, the R factor is Recency that means the last access time of file; the F factor is Frequency that means the access frequency of file; the M factor is aMounts that means the amount required time of file access. RFM k-means uses the popularity RFM factor to analysis the file, the analysis data is stored to the Metadata Server. When the file is analyzed by the popularity RFM factor, the popularity of each file can be obtained. For each file, the value of R, F and M will be a number between 0-1. However, the popularity of file will affect the file update speed. When file is analyzed by RFM analysis, the value of each file can be obtained, and then all files will be clustered by the archival value. 3.3 The Clustering of RFM k-means The provided data types are different in a cloud-computing environment. Therefore, the same types of resources assembled together and provide corresponding services, not only to better respond to the results of the request for services, but will also make use of cloud resources to achieve a better state. By using clustering, files with same tendency or pattern will be divided into different

- E45 -

Business and Information 2013 (Bali, July 7-9)

clusters; hence, files of a similar nature will be clustered together. Files on cloud storage needs to be divided into a number of different clusters, just right for the clustering analysis in data mining. In this study, the k-means clustering analysis algorithm is used to segment the attributes of file according to their popularity. By RFM k-means, separated out of the main types of file clusters firstly, and then perform file storage according to the attributes of the file to enhance the cloud computing service performance. The architecture of RFM k-means clustering is shown in Figure 1. First, the historical information of file stored or retrieved are transformed and saved to the database. Then, the used records of cloud user are imported into clustering algorithms. Finally, according to the characteristics of file attributes, all files are segmented into different cluster.
Clustering Algorithm Access Database

Figure 1. The architecture of RFM k-means clustering When a file is stored into the cloud-computing environment, the file will be pre-processing by RFM k-means firstly. Then, file is stored by block or file level by file size. The Metadata Server records the file access information, and performs the popularity RFM factor analysis. Finally, use the results of the RFM popular factor analysis; the files are clustered to different clusters according to their characteristics. The RFM k-means data clustering is modified from k-means method. There are four steps, as shown in Figure 3. [Step 1] [Step 2] [Step 3] [Step 4] Collect data The popularity RFM factor analysis Clustering Record the clustering results

In RFM k-means, a historical data of a file used in the previous stage of time is collected first in [Step 1]. In [Step 2], the historical data of file is normalized to the RFM popular factor, and RFM k-means performs the popularity analysis. In [Step 3], due to the RFM popular factor analysis results, files are clustered to the matching RFM clusters. [Step 4] records the clustering results. In addition, to detect the distance of various clusters, if the distance of cluster is too large, re-adjust the distance of the popularity RFM factor. Until the next stage, then the file is re-assigned to other clusters.

- E46 -

Business and Information 2013 (Bali, July 7-9)

For example, if the RFM k-means clustering is setting 8 equal divisions (2*2*2) in cloud-computing environment. Moreover, when file is saved in the cloud storage, the default popularity RFM factor is used to cluster the file. Then, in [Step 1] the attribute (RFM) of each file will be collected. In [Step 2], the collected data in [Step 1] are normalized to popularity RFM factor analysis. To cluster each file into the correspondence clusters is executed in [Step 3]. [Step 4] will record the clustering results and reconfigure the distance between clusters. If some cluster is load overweight, then the file clustering will be reconfigured and used in next times.

Data input

Master Node Collect data The popularity RFM factor analysis Clustering

Metadata Server

Record the clustering results

Figure.3 The flowchart of RFM k-means data clustering According to the characteristics of cloud computing, the cloud service providers can support the application services on the Internet. However, there are a lot of applications and data centers in the cloud-computing environment, hence, the issues of service performance must be considered. 4. Conclusions and Future Work According to the characteristics of cloud computing, the cloud service providers can support the application services on the Internet. However, there are a lot of applications and data centers in the cloud-computing environment, hence, the issues of service performance must be considered. To enhance the performance of cloud computing to save the file, the k-means based on the RFM data analysis (RFM k-means) is proposed in this study to enhance the service performance of cloud computing. In RFM k-means, according to the size of file to determine the file is saved as file level or block level firstly. Next, the popularity of the file is analyzed by the proposed popularity RFM factor. Finally, all files are clustered by the modified k-means clustering algorithm, then files will be stored in the same cluster that they have the similar characteristics or properties. In

- E47 -

Business and Information 2013 (Bali, July 7-9)

other words, RFM k-means can enhance the efficiencies of files in the cloud computing. In this study, only based on the results of popularity RFM factor analysis to cluster files, the amount of load that can be tolerated between clusters does not be considered. Therefore, if the state of the system is presented an idle state, that will not be able to achieve a faster response time. However, in future studies, the system state will be considered. ACKNOWLEDGMENT This work was supported in part by the Taiwan National Science Council under Grants NSC101-2221-E-324-032 and101-2221-E-324- 0343. REFERENCES
Arthur, D., Manthey, B. and Roeglin, H. 2009. k-means has polynomial smoothed complexity.

Paper Presented at the 50th Symposium on Foundations of Computer Science (FOCS). Aymerich, F. M., Fenu, G., & Surcis, S. 2008. An approach to a cloud computing network. Paper Presented at the 1st International Conference on the Applications of Digital Information and Web Technologies, 113-118.
Cai, B., Xie, C. and Zhu, G. 2007. EDRFS: an effective distributed replication file system for small-file and data-intensive application. Paper Presented at the 2007 2nd International Conference on Communication Systems Software and Middleware, 1-7.

Chang, R.S., Chang, H.P. and Wang, Y.T. 2008. A dynamic weighted data replication strategy in data grids. Paper Presented at the IEEE/ACS International Conference on Computer Systems and Applications, 414-421. Grossman, R.L. 2009. The case for cloud computing, IT Professional, 11 (2), 23-27. Gwertzman, J. and Seltzer, M. 1995. The case for geographical push-caching. Paper Presented at the 5th Annual Workshop on Hot Operating Systems, 51-55. Hughes A.M. 1994. Strategic Database Marketing, Chicago, Probus Publishing. Kahan, R. 1998. Using database marketing techniques to enhance your one-to-one marketing initiatives. Journal of Consumer Marketing, 15, 491-493. Liu, C.N. and Zhu, X.W. 2009. A study on CRM technology implementation and application practice. Paper Presented at the IEEE Internet Conference on Computational Intelligence and Natural Computing, (CINC), Nanchang, China, 367-370. Luo, Y. 2010. Network I/O virtualization for cloud computing. IT Professional, 12(5), 36-41.
MacKay, D. 2003. An example inference task: clustering. Information Theory, Inference and Learning Algorithms (Chapter 20). Cambridge University Press, 284292. MacQueen, J. B. 1967. Some methods for classification and analysis of multivariate observations. Paper Presented at the 5th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press. 281297.

- E48 -

Business and Information 2013 (Bali, July 7-9)

Peppers, D. and Rogers, M. 1997. Enterprise one to one: tools for competing in the interactive age, Doubleday, New York. Ranganathan, K. and Foster, I. 2001. Identifying dynamic replication strategies for a high-performance data grid. Paper Presented at the International Workshop on Grid Computing, 75-86. Rimal, B.P., Choi, E. and Lumb, I. 2009. A taxonomy and survey of cloud computing. Paper Presented at the NCM2009 5th International Joint Conference on INC, IMS and IDC, 44-51. Rebecca, K., Justin, P. and Graeme, P. 2012. Ethical considerations and guidelines in web analytics and digital marketing: a retail case study. Paper Presented at the 6th Australian Institute of Computer Ethics conference, Melbourne, Victoria, 1 5-12. Tzeng, W.G. 2010. Data confidentiality and robustness in decentralized cloud storage systems, Dissertation of National Chiao Tung University.
Vattani, A. 2011. k-means requires exponentially many iterations even in the plane. Discrete and Computational Geometry, 45 (4), 596616.

Vouk, M.A. 2008. Cloud computing- issues, research and implementations. Information Technology Interfaces, 31-40. Wang, G. and Ng, T.S.E. 2010. The impact of virtualization on network performance of Amazon EC2 data center. Paper Presented at the 29th IEEE Conference on Computer Communications (IEEE INFOCOM), 1-9. Weiss, A. 2007. Computing in the clouds. netWorker, 11 (4), 16-25.
The Hadoop Distributed File System, http://hadoop.apache.org/hdfs/.

- E49 -

You might also like