An Evaluation of Mapreduce Framework in Cluster Analysis: Tanvir Habib Sardar, Ahmed Rimaz Faizabadi, Zahid Ansari

2017 International Conference on Intelligent Computing,Instrumentation and Control Technologies (ICICICT)
An Evaluation of MapReduce Framework in Cluster

Analysis
Tanvir Habib Sardar1, Ahmed Rimaz Faizabadi2, Zahid Ansari3
1,2,3,
Dept. of CSE, P.A.College of Engineering
1
tanvir_is@pace.edu.in
2
rimaz_is@pace.edu.in
3
zahid_cs@ pace.edu.in
Abstract— Data is growing exponentially due to the World 1) Determine the number of clusters k, from the
Wide Web and scientific enhancements, requires proper dataset. It is done with theoretical and conceptual
strategies and techniques to deal with it. Thus, it requires
increasing the computational requirements. Efficient
considerations that may be proposed to determine
processing paradigms and smart implementation how many clusters to be formed.
architecture is the key to meet the scalability and 2) Generate K centroid (the centre point of the
performance requirements in large scale data analysis. cluster) beginning at random. Determination of
Clustering is one of the popular data mining techniques to initial centroid is done at random from objects
analyse datasets. MapReduce programming paradigm, on
the top of Hadoop distributed architecture, is widely used
provided, then to calculate the i cluster centroid
in today’s world in order to obtain efficiency for large next, use the following formula:
dataset clustering. In this work, we have experimented and ∑
v= ,
validated the efficiency of K-means clustering algorithm in
MapReduce paradigm over hadoop architecture while Where v: cluster centroid, xi: the object, n: the
processing different sized datasets in combination of number of objects to be members of the cluster
different hadoop cluster sizes. 3) Calculate the difference of data objects in
terms of distance of each object to each centroid.
Keywords— Clustering, Data Mining, MapReduce, Hadoop, To calculate the distance between the object with
Distributed Computing.
the centroid K-means use Euclidian distance
I. INTRODUCTION measure.
Data is growing exponentially due to the digital 4) Each object is allocated into the nearest
enhancements of the modern world, requires proper centroid. To perform the allocation of objects into
strategies and techniques to deal with it [1]. Data each cluster during the iteration can generally be
Mining is known as the process of analysing data to done explicitly where every object is declared as a
extract interesting patterns and knowledge. Data member of the cluster by measuring the distance of
mining is used for analysis purpose to analyse the proximity of objects towards the center point of
different type of data [2]. the cluster.
Clustering is the process by which a set of data 5) Do iteration, and then specify a new centroid
objects is partitioned into subsets such that the data position using equation of step 2.
elements of a same cluster are similar to one 6) Repeat step 3 if the new centroid position is
another and different from the elements of other not the same.
clusters [3]. Clustering is a classification of similar With the rapid growth of data volume and
objects into several different groups, it is usually the development of information technology,
applied in the analysis of statistical data which can traditional clustering algorithms cannot meet the
be utilized in various fields, for example, machine demand [6]. Distributed Computing is a method
learning, data mining, pattern recognition, image aimed at solving computational problems mainly by
analysis and bioinformatics [4]. sharing the computation over a network of
The method of K-means algorithm as follows [5]: interconnected devices. Each individual system
connected on the network is called a node and the
978-1-5090-6106-8/17/$31.00 ©2017 IEEE 110

collection of many nodes that form a network is has few benefits over traditional distributed file
called a cluster. Apache Hadoop is open source systems such as [10]:
architecture. Hadoop is used for processing large x Hdfs is extremely fault-tolerant.
datasets which is developed in java. It is a x Can be organized on low cost hardware.
framework that helps for distributed processing of x High throughput access to data.
large datasets across network of computers using x Can pact with large datasets.
simple programming modules. MapReduce HDFS runs on master/slave architecture. It
programing paradigm is used by Hadoop consists of certain number of datanodes (slaves) and
architecture to process the large datasets across a a namenode (master). First a dataset for processing
number of connected nodes of computing systems is uploaded into namenode, then namenode divides
[7]. the dataset into blocks of data and sends to
MapReduce programming model works as individual datanode for processing and finally
follows [8]: accumulate the result obtained from different
The system takes a set of input key/value pairs, slaves.
and outputs a set of output key/value pairs. In this work, we proposed a modification of
MapReduce executes the entire task as two traditional K-means algorithm into MapReduce
functions: Map and Reduce. paradigm in order to execute it over hadoop
x Map-function takes an input pair and outputs platform. Then we have evaluated the result
a set of intermediate key/value pairs. The obtained from MapReduced K-means with different
MapReduce collects together all intermediate cluster and dataset size.
values associated with the same intermediate
key i and converts them to the Reduce II. METHODOLOGY
function. In our proposed methodology the traditional
x Reduce-function receives an intermediate key k-means algorithm is modified with MapReduce
i and a set of values for that key. It combines paradigm so that executing MapReduced
together these values to form a possibly modification of K-means can be executed over
reduced set of values. Generally, just zero or hadoop cluster. The objective behind this work is to
one output value is formed per Reduce find out the efficiency gain (in terms of execution
invocation. time) of MapReduced K-means over different sized
A Basic MapReduce Data Processing Scheme is hadoop cluster. Our modified k-means is required
depicted in figure 1. to be executed over cluster of node.
Commodity machines from multiple
Split 1 machines which are sharing a private LAN
reduce managed by a switch, act as a Master node which
Input
Split 2
supervises data and flow control over all other
dataset Output
d nodes in the Hadoop cluster.
reduce The experiments are conducted on
Split n datasets consisting of 2-D data points. This dataset
is not pre-processed from any datasets and it is not
Fig. 1: A Basic MapReduce Data Processing Scheme
required to be pre-processed as well. Hence, the
dataset can be directly put over the master node and
Hadoop Distributed File System (HDFS) is
hdfs would take care of it in splitting in sizes,
a distributed file system managed by hadoop. It
sending it to different nodes and accumulating
provides the access to large amounts of data on
results. The master node in a hadoop cluster firstly
Hadoop cluster [9]. HDFS is intended to run on
splits the dataset into different data chunks then it
commodity hardware. It has many resemblances
sends each chunk to a slave node. The tracking of
with distributed file systems in use today. But hdfs
data chunks is the sole responsibility of hdfs; a
MapReduce programmer is unaware of it. The
978-1-5090-6106-8/17/$31.00 ©2017 IEEE 111

MapReduced K-means will start mapper operation Step3: In Mapper class, "configure" function is used
to each slave node when a chunk of data is to open the file and read the centroids and store in
available there. the data structure (used ArrayList for this)
The initial step of designing MapReduce Step4: Mapper read the data file and emits the
routines Kmeans algorithm is to define and analyse nearest centroid with the point to the reducer.
the input and output of the implementation. Input is Step5: Reducer collects all this data and calculates
given as pair, where “key” is the cluster centroid the new corresponding centroids and release.
and “value” is the data points in the dataset. The Step6: In the job configuration, we read both files
input centroids (k) to the k-means are provided in and checked if there is a change between old and
the code itself in our design. The input is, as usual, new centroid and it is less than 0.1 then
k number of 2-D points selected from the dataset. convergence is reached else repeat step 2 with
For each iterations, the cluster centres of reduce new centroids.
operations the centroid value will be changed and Figure 2 is the depiction of the proposed algorithm.
recalculated to a new value. It can be achieved by
following the algorithm to design MapReduce
routines for K-means clustering. Input
Each data chunk created by master node’s Dataset
hdfs (to send in slaves HDFS) is the part of data
points provided in the dataset. Each Map operation
of K-means here calculates the centroid based on Split 1 Split 2 Split n
points it received in the data chunk. So, each slave
has now created their own centroid of k-means for a
particular part of data points it has received. Once Map 1 Map 2 Map n
Mapper is invoked the given vector is assigned to
the cluster that is closely related to. After the
assignment, the centroid of that particular cluster is
recalculated.
The recalculation is done by the Reduce reduce 1 reduce n
routine and also it rearranges the cluster to avoid
creations of clusters with extreme sizes i.e. cluster
having too less data vectors or a cluster having too
many data vectors. Finally, once the centroid of the Clustered
given cluster is updated, the new set of vectors and data
clusters are re-written to the disk and is ready for
the next iteration. Fig. 2: The Execution of Proposed Algorithm
Proposed modified K-means: Input dataset is split
into different data chunks at master node and then III. RESULT AND DISCUSSION
sent to different slaves. Map function at each node The experiments are conducted on datasets
adjusts the cluster centres based on input data consisting of 2-D data points. This dataset is not
points in the split it has received. The reducer pre-processed from any datasets and it is not
gathers the data value information of each cluster required to be pre-processed as well.
and re-computes the k cluster centers.
Algorithm: The experimentation of the dataset of
Step1: Initially centroids are randomly selected different sizes is conducted with the combinations
based on data. In our implementation we used 3 of different cluster size to analyse the efficiency of
centroid values randomly chosen. distributed processing provided by hadoop. For
Step2: The Input file contains initial centroid and example, a dataset with 100,000 2-D points is
data. experimented with 3 node, 5 node, 8 node and 10
978-1-5090-6106-8/17/$31.00 ©2017 IEEE 112

node cluster. Similarly, experiments with datasets minutes and 10 seconds, 7 minutes and 05 seconds,
consist of 150,000; 200,000 and 300,000 2-D points 6 minutes and 15 seconds, and 5 minutes and 10
data are conducted and execution time is recorded seconds respectively.
in seconds. A brief description of different In the fourth observation, we have fixed the
execution time with respect to different dataset is dataset to 300,000 2-D points and executed k-
provided in the following table. Our experiments to means algorithm modified using MapReduce
evaluate hadoop cluster is based on observation of paradigm against different cluster sizes.
cluster scale-up: keeping the data size constant and Experiments of 3-Node, 5-node, 8-node till 10-
increasing the cluster size. node clusters shows a reduction in execution time
Data 100,000 150,000 200,000 300,000 to 10 minutes and 30 seconds, 9 minutes and 10
Size points points points points seconds, 8 minutes and 10 seconds, and 6 minutes
3 node 270 380 490 630 and 45 seconds respectively.
5 node 210 290 370 480 We have scrutinized the result obtained
8 node 115 200 260 380
10 node 65 110 130 190
from all four experiments in order to analyse and
Table 1. Execution Time in Seconds conclude hadoop performance. We have taken
50,000 data points as a unit of performance
To obtain the performance gain by our evaluation to critically analyse the performance
modified K-means, we have conducted few obtained from different hadoop cluster. In order to
experiments which can be classified into 4 major make 50,000 data points as a unit, for example, the
observations. execution time of 100,000 points containing dataset
In the first observation, we have fixed the is divided by 2 (for example 270/2=135, for 3-node
dataset to 100,000 2-D points and executed k- cluster computing) , 150,000 points containing
means algorithm modified using MapReduce dataset is divided by 3 and so on. Therefore, for
paradigm against different cluster sizes. Then we each 50,000 data point, the time taken to execute
have executed proposed algorithm starting with the parallel K-means algorithm by hadoop clusters
node 3 then 5-node and 8-node cluster till 10- node is taken into account and ratio of their respective
cluster and experienced a reduction in execution performance is calculated.
time to 270, 240, 200 and 150 seconds respectively. Ratio is a quantitative relation between two
We have observed from the result that as the cluster amounts showing the number of times one value
size increases the execution time of the program contains or is contained within the other. The ratio
decreases. is calculated based on keeping the 10-node
In the second observation, we have fixed the execution time equals to 1 for each datasets. Thus,
dataset to 150,000 2-D points and executed k- it is very useful for determining the difference of
means algorithm modified using MapReduce performance between theorical (ideal) and practical
paradigm against different cluster sizes. (experimental) hadoop performance. The details of
Experiments of 3-Node, 5-node, 8-node till 10- our analysis are provided in the table 2. It can be
node clusters shows a reduction in execution time clearly shown that the reduction of execution time
to 6 minutes and 20 seconds, 5 minutes and 30 as per increase of cluster size is not very smooth i.e.
seconds, 5 minutes and 55 seconds, and 4 minutes it cannot be depicted using a line graph with a
respectively. We have observed from the result that straight line. This is due to the fact that there are
as the cluster size increases the execution time of many overhead nodes of a cluster contains such as
the program decreases. processing overhead of systems processes and basic
In the third observation, we have fixed the tools (like background antivirus tools execution)
dataset to 200,000 2-D points and executed k- execution with different priority, networking (data
means algorithm modified using MapReduce transfer) overhead between nodes, data overhead
paradigm against different cluster sizes. (Namenode have trivial amount of data stored in
Experiments of 3-Node, 5-node, 8-node till 10-node journal/fsimage, replication factor of hadoop, hdfs
clusters shows a reduction in execution time to 8 checksum) etc.
978-1-5090-6106-8/17/$31.00 ©2017 IEEE 113

processing is not uniform in terms of speed up of

Ratio 3-Node 5-Node 8-Node 10-Node clustering, during the experiment and recorded the
Calculation various performance changes and shown in
100,000 pts 135 105 57.5 32.5
performance graph chart.
Ratio 4.15 3.23 1.76 1
150,000 pts 126.66 96.66 66.66 36.66
Ratio 3.45 2.71 1.81 1 REFERENCES
200,000 pts 122.5 92.5 65 32.5
Ratio 3.76 2.84 2 1
[1] Lovele Sharma, Vivek Srivastava, “Performance
300,000 pts 105 80 63.66 31.66
Enhancement of Information Retrieval via Artificial
Ratio 3.3 2.52 2 1
Intelligence”, International Journal of Scientific
Table 2: For Each Datasets of 50,000 points, Execution Time
Research in Science, Engineering and Technology,
(in sec) and Respective Ratio
Volume 3 | Issue 1 | January-February – 2017.
In table 2, ratio specifies the performance gain by [2] Bansal, Sharma and Goel, “Improved K-mean Clustering
larger hadoop clusters. For example, it shows that Algorithm for Prediction Analysis using Classification
for dataset consist of 100,000 datapoints, 10-node Technique in Data Mining”, International Journal of
cluster is 4.15 time faster than 3-node cluster. Computer Applications (0975 – 8887) Volume 157 – No
6, January 2017.
700 [3] Satish Chaurasiya and Ratish Agrawal, “K-Means
Clustering With Initial Centroids Based On Difference
600
Operator”, International Journal of Innovative Research
500 100,000 pts in Computer and Communication Engineering, Vol. 5,
400
150,000 pts Issue 1, January 2017.
300 [4] A. Chauhan, G. Mishra, and G. Kumar, “Survey on Data
200,000 pts
200 Mining Techniques in Intrusion Detection,” vol. 2, no. 7,
300,000 pts pp. 2–5, 2011.
100
[5] Zulfadhilah, Prayudi and Riadi, “Cyber Profiling using
0 Log Analysis and K-Means Clustering”, International
3-node 5-node 8-node 10-node
Journal of Advanced Computer Science and
Applications, Vol. 7, No. 7, 2016.
Fig. 3: The Tradeoff between cluster size and datapoints in [6] Ezhilan, Vignesh Kumar, Arvind Kumar, Afzal Khan,
MapReduced KMeans “Implementation of Optimised K-Means Clustering on
The overall combined scenario of the Hadoop Platform”, International Journal of Advanced
execution of proposed Kmeans algorithm in respect Research in Computer Science and Software
to different hadoop cluster size and data point are Engineering, Volume 6, Issue 2, February 2016.
[7] Chaturbhuj and Chaudhary, “Improved K-means
provided into the bar chart of fig. 3. clustering on Hadoop”, International Journal on Recent
IV. CONCLUSIONS and Innovation Trends in Computing and
Communication, Volume: 4 Issue: 4, pages 601 – 604,
With the huge volume of data generated April 2016.
[8] Shweta Mishra, Vivek Badhe, “Improved Map Reduce K
from the World Wide Web and scientific
Mean Clustering Algorithm for Hadoop Architecture”,
experiments, hadoop clusters are being formed and International Journal Of Engineering And Computer
widely used in order to cope with processing huge Science, Volume 5 Issues 7 July 2016, Page No. 17144-
volume of data. MapReduce programming model is 17147.
designed to give programmers develop code for [9] Raghu Garg and Himanshu Aggarwal, “Big Data
Analytics Recommendation Solutions for Crop Disease
hadoop framework. Four different datasets of
using Hive and Hadoop Platform”, Indian Journal of
different sizes are used and experimented with four Science and Technology, 9(9), 2016.
different sized hadoop clusters. From our [10] Cheng et al., “Research On Hdfs-based Web Server
experiments, it has been observed that as the Cluster”, 2011 International Conference on E-Business
number of nodes increases the execution time and E-Government (ICEE), 2011.
decreases. We also found some of the interesting
cases, which concludes to the fact that hadoop
978-1-5090-6106-8/17/$31.00 ©2017 IEEE 114

An Evaluation of Mapreduce Framework in Cluster Analysis: Tanvir Habib Sardar, Ahmed Rimaz Faizabadi, Zahid Ansari

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Evaluation of Mapreduce Framework in Cluster Analysis: Tanvir Habib Sardar, Ahmed Rimaz Faizabadi, Zahid Ansari

Uploaded by

Copyright:

Available Formats

2017 International Conference on Intelligent Computing,Instrumentation and Control Technologies (ICICICT)

An Evaluation of MapReduce Framework in Cluster

978-1-5090-6106-8/17/$31.00 ©2017 IEEE 110

978-1-5090-6106-8/17/$31.00 ©2017 IEEE 111

978-1-5090-6106-8/17/$31.00 ©2017 IEEE 112

978-1-5090-6106-8/17/$31.00 ©2017 IEEE 113

processing is not uniform in terms of speed up of

978-1-5090-6106-8/17/$31.00 ©2017 IEEE 114

You might also like