You are on page 1of 6

Parallelization Of K-Means Clustering Algorithm

Amithash Prasad #1
Department of Electrical And Computer Engineering University of Colorado UCB 430, Boulder CO 80309, USA
amithash.prasad@colorado.edu

Abstract Clustering algorithms are used in various applications like image analysis, image classication, performance analysis, nancial modeling and so on. The K-Means algorithm provides an efcient way of clustering multi dimensional data into K groups by minimizing the intra cluster distance and maximizing the inter cluster distance[1]. One of the main restrictions of the k-means algorithm is that the number of clusters is decided A priori which is removed by a method called node merging. The parallelization of the algorithm on a distributed memory multi-processor using the Message passing interface is discussed.

grouping similar data. This can be used in various applications from marketing to earth quakes studies.[3] Clustering algorithms can be classied into 4 broad categories, that is, exclusive, overlapping, hierarchical and probabilistic. In exclusive clustering, data is classied such that no single element belongs to two different groups. Overlapping clustering on the other hand is the exact opposite, where an element can belong to different clusters. Hierarchical clustering classies data into groups in a hierarchical way with some clusters being a part of another (This method is useful in data mining problems). Probabilistic clustering is beyond the scope of this text. Among these categories the k-means clustering algorithm belongs to the exclusive clustering varient[3]. Most of the clustering algorithms in literature use some form of a distance measure of which the euclidean being the most common. Care must be taken in choosing the distance measure as this will determine to a large extent the correctness of the clustering methodology. Some data sets use circular or non-Cartesian data space for which a more appropriate distance measure has to be used. For the subsequent part of this paper, I assume that the sample space is Cartesian in nature and hence adhering to the euclidean distance measure. For the remaining part of the text, N is the total number of point in the sample space, M being the number of dimensions of each data point and K being the number of clusters required. III. T HE K-M EANS C LUSTERING A LGORITHM The K-Means clustering algorithm is an effective exclusive procedure to cluster N M-Dimensional data into K clusters. The algorithm starts with K points selected at random, chosen as the centroids for the data set. Once this is accomplished, Each point in the sample space, is associated to the centroid whose distance to the current point is minimum. After which the winning centroid re-adjusts its position to be the mean of its member points. This processes is repeated iteratively till convergence. If cj ; j = 0, 1, ...K 1 are the centroids where cjm is the mth dimension of the centroid element cj , and xi ; i = 0, 1, ...N 1 be points in the sample space where xim being the mth dimension of the point xi then the euclidean distance dij between these two points is

I. I NTRODUCTION Many means of clustering exist to this day. One of the most efcient of them is the K-Means algorithm[1] which enables one to cluster a set of multi dimensional data into K clusters where intra cluster distance is minimized and inter cluster distance is maximized. The algorithm is efcient but also computation intensive, with the time of computation directly dependent on the product of the number of data elements and the number of required clusters. The algorithm has three main disadvantages other than computation time. One, the clusters are assumed to be spherical in shape. Two, the algorithm requires the number of clusters to be known before hand[2] and three, the algorithm is inherently serialized which makes it impossible to parallelize. The rst two is alleviated by the K -Means[2] algorithm but adds to the complexity of the algorithm and remains serialized whose parallelization would be largely inefcient. In subsequent sections I will introduce modications to the original k-means algorithm which will try to eliminate the last two restrictions but continues to assume that the clusters are spherical in shape. Section II presents the background. Section III discusses the k-means algorithm and the modications to the algorithm which allows efcient parallelization. IV discusses a method introduced in this paper called electrostatic partitioning, V discusses the Node merging algorithm introduced in this paper which allows the application to recognize redundant clusters. VI summarizes the parallelization achieved. VII proposes theoretical limits of parallelization, VIII introduces the test bed for the experiment. And nally IX presents the results on speedup of the parallelization. II. BACKGROUND Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with nding a structure in a collection of unlabeled data. In other words, clustering is the processes of

given by:
M 1

IV. E LECTROSTATIC DATA PARTITIONING For the modied k-means algorithm to work described in section III, the initial centroids must be such that they are as far away from each other, in other words, the dataset must be partitioned into K sets and a representation element must be chosen from each of these regions. A number of methods are expressed in literature one of which maximizes the sum of squared distances[5]. I will be using a method which is a slight modication from this which I call electrostatic data partitioning which minimizes the sum of squared inverses of the distances to previously selected centroids to nd the current centroid. The algorithm starts with selecting the rst centroid c0 to be one of the data points in the sample space at random. Selection of the subsequent centroids cj ; j = 1, 2, ..K 1 is as follows:
j1 K1

dij =
m=0

(cjm xim )2

(1)

Element xi is associated to a centroid cj if: dij = min(dik , k = 0, 1, ...K 1) (2)

Followed which the centroids are readjusted to acknowledge this change by


Nj 1

xkm cjm = where,


k=0

Nj

; m = 0, 1, ...M 1

(3)

cj = xi ; Nj = N (4)
k=0

1 < d2 ik

j1

k=0

1 ; l = 0, 1, ...N 1, l = i d2 lk

(5)

j=0

Where xkm is the mth dimension of the kth element associated to the centroid cj and Nj being the number of elements associated to centroid cj . This processes is continued for all points in the sample space, till no point changes its association (Convergance) or a maximum number of iterations are reached. This procedure has a small drawback. As the centroids are updated for every element association, the data set cannot be partitioned among processors for effective parallelization. In order to overcome this disadvantage, the algorithm is modied to update the centroids as per equation 3 every iteration instead on an association basis. This modication now makes the convergence and correctness of the algorithm to be completely dependent on the selection of the initial centroid points which is delt with in section IV. The parallelization of the modied k-means algorithm involves 3 all-reduction operations[4], one to calculate the global mean of each centroid point, the second to compute the total number of data points in each cluster and nally the third to compute a global OR operation to determine if a change in association has occurred for the current iteration which is also the exit condition for the iterative k-means algorithm. This is much better compared to the parallelization of the k-means algorithm in its original form where 2N + 1 all-reduction operations are performed every iteration. The 3 all-reduction operations are further compacted into a single all-reduction operation by using the user operation creation function offered by the message passing interface[4]. Note that the N data points are evenly divided into p processors and hence each processor has a subset of the sample space with each of the p 1 processors having N data points, while the last p processor has N ((p 1) N ). p

The above equation informally put, goes through each and every point xi in the sample space, nds the sum of the inverse squared distance to each of the previously selected centroids and chooses the point for which this quantity is minimum. The parallelization of this procedure is achieved by making the processor whose rank is 0 be the one to select the rst centroid, and broadcasting[4] the same to the remaining processors. Now, each and every processor elects a data point in its subset satisfying the the above condition, and an all-reduction operation is performed where the global minimum value is computed along with the winning processor containing the winning data point which broadcasts the selected centroid. The procedure is continued till all the centroids are selected, after which every processor will contain the centroids which represent the partitioned sample space. The name electrostatic data partitioning is used as the procedure is similar to placing K equally charged particles such that the force on each particle from its neighbor particles is minimized. As the magnitude of the electrostatic force is proportional to the inverse square of the distance between two points, the magnitude of the total force exerted on this particle will be proportional to the sum of the forces from all selected particles. This method was found to be superior to methods such as linear selection where the initial centroids are selected to be on the line formed by the points, one having the minimum of all the dimensions and the other having the maximum of all dimensions. The second method tried was to maximize the sum of distances which failed to partition the data appropiately and failed for some test data sets.

V. N ODE M ERGING Let K be the number of logical clusters in the input dataset, and assuming K > K , then after clustering the data set into K clusters, there will exist at least two clusters ci and cj such that dij < ri + rj where ri and rj are the distances to the point associated to the clusters at ci and cj respectively and is farthest from ci and cj . and if K = K then there will be no such centroids and if K < K the results of which is unknown. For the rest of the text it is assumed that K K . The algorithm which makes use of this property is described below: ri = dik ; dik > dil ; l = 0, 1, 2, ...Ni (6)

where dik is the distance of the element xk from its centroid ci . Two centroids are chosen to be merged if the following condition is satised: dij < ri + rj (7)
Fig. 1. Data ow diagram of the k-means clustering application

Where dij is the distance between centroids ci and cj while ri and rj are the radii of the centroids ci and cj respectively. As there are times when the user will not desire the nodes to be merged, the application just informs the user about the nodes which can be merged. From there it is up to the user to decide on the direction to take with the given information. Rradii calculation is done in parallel by each node nding the radius as per equation 6 from its local data set and then a reducing operation is performed to choose the maximum radii among all the processors for every centroid. Processor 0 chosen to be the root node, will have the correct values of radii after the reducing operation. The node merging algorithm described, has an execution time dependent on K << N , parallelization of which would result in diminishing results and hence carried out serially on a single processor; the root node. VI. PARALLELIZATION OVERVIEW The data ow diagram for the entire application is shown in Figure: 1. First, the application performs a parallel I/O operation to fetch the data set from a binary le thus each processor will have at least n data points, with the exception p of the last node which reads in the remaining data. Then the data is partitioned into K regions and a representation point is chosen from each of these regions to form the initial centroid set. The iterative k-means algorithm clusters the data till the algorithm converges to yield static clusters where the data points no longer migrate from one cluster to another or till a maximum number of iterations are achieved. Finally the radius of each cluster is computed for the node-merging algorithm. Each of these operation are performed in parallel by passing messages to one another or performing global reduction operations which were discussed in the previous sections. Finally node 0 performs the node merging algorithm and writes the results to another le. The data points are

rendered by performing a parallel I/O operation where the data is written to a binary le which will contain the data set along with their association. VII. T HEORETICAL PARALLELIZATION LIMITS Mathematical models were constructed for each of the p algorithms to compute the theoretical speedup possible. Tak , p p Tac , Tar are the computation times with a processor count of p for the k-means, initial centroid calculation and rap p p dius computation respectively while Tck , Tcc , Tcr are their p p p communication times. Sk , Sc and Sr are the speed up of each of these algorithms. By observation, the following were computed assuming that buttery or tree algorithms are used for the broadcast, reduction and all reduction algorithms.
p Tak p Tck p Tk

= = = = = = = = =

4kd + 8km +

4n (km + m + 2) p 2ln(p) (ts + 4tc (3 + k + 2km)) ln(2) p p Tak + Tck (8)

ak bk ck
p Sc p Sc

4k(m + 2) 4n (km + m + 2) 2ts + 8tc (3 + k + 2km) ln(2) 1 Tar p p Tar + Tcr p (ar + br ) br + ar p + cr pln(p) 2ts + 8tc (3 + k + 2km) cr = br ln(2) 4n (km + m + 2)

(9) (10)

pmax c

For the electrostatic data partitioning, pmax was derived by c p nding the rst deravative of Sc with respect to p and nding

the value for which this quantity goes to zero. This will be the theoretical limit of parallelization for a given data set for which n,m and k are constant.
p Tac p Tcc p Tc

= = = = =

4nk(m + 1)(k 1) p kln(p) (3ts + 8tc (4 + m)) ln(2) p p Tac + Tcc k(3ts + 8tc (4 + m)) (11)

ac bc
p Sc p Sc

pmax c

4nkln(2) (m + 1)(k 1) 1 Tac = p p Tac + Tcc bc p = bc + pln(p)ac = bc = 4nkln(2) (m + 1)(k 1)

(12) (13)

Similarly the same was derived for the radius computation as well. Here the theoretical limit on the maximum number of processors is pmax r
p Tar p Tcr p Tr

Fig. 2.

Test image 1

4n(m + 1) p ln(p) (ts + 16ktc ) = ln(2) p p = Tar + Tcr = ts + 16ktc = = 4nln(2) (m + 1) 1 Tar p p Tar + Tcr br p br + pln(p)ar br 4nln(2) (m + 1) = ar ts + 16ktc (14)

ar br
p Sr p Sr

= =

(15) (16)

pmax r

The theoretical limit on the speedup can be given by pmax = min(pmax , pmax , pmax ) but is not entirely accurate as the kr c k means model is just for a single iteration, and hence can be the dominating factor in determining the maximum number of processors for a given data set after which increasing p will provide diminishing results. VIII. T EST B ED In order to prove the correctness of the algorithm, a few test data sets were chosen which will be described in this section. In order to prove by visual validation, portable gray map images were converted into data sets, by taking the x,y coordinates of darker pixels as data points and hence allowing one to visualize the input data set in terms of an image. Figure 2 and Figure 3 are two such images used to validate the correctness of the algorithm. The k-means clustering application was given a value of K=5,6,7,10 for image 1 and K=8,9,10,20 for image 2 and the clustering was found to work without glitches and the node merging was also observed to be efcient. The data sizes of these data sets ranged from 300,000

Fig. 3.

Test image 2

data points to 22,000,000 data points. Due to the large size of these test data sets, they were not run on a single processor, but the test was started from a processor count of 4. IX. R ESULTS The data sets described in section VIII was run on an IBM BG/L machine with processor counts ranging from 4 to 512 and the plots of the speedup of the 3 algorithms are provided in the gures below. As it can be noticed from Figure 4 where the number of data points are much less compared to that of Figure 5, the speedup is also affected showing the scalability of these algorithms. That is, even though the speed up attens for

Fig. 4.

Speed up achieved for test image 1

Fig. 5.

Speedup achieved for test image 2

gure 4 at a processor count of around 256, increasing the data size by a factor of 100 gives a much anticipated linear speedup. The numbers in the plots can be a bit misleading T1 as the speedup is not computed as T p but rather computed T4 as T p because of memory issues. Thus one can expect some sort of scaling which can be applied to the plots to express them as standard speedup. The experiments were conducted as a set of 5 and their mean we used in computing the values for each point in the gures 4 and 5.

As expected, the electrostatic data partitioning which contains an all reduce and a broadcast for every iteration, exhibits poorer scalability compared to the other sections. From the mathematical models described in section VII, it is apparent that the speedup is directly dependent on the number of data points. But increasing the number of clusters K or the dimensionality M of the data will increase the communication cost and hence hamper the possible speedup.

X. C ONCLUSION K-means algorithm was parallelized and its speed and scalability were observed. The experiment was performed on an IBM BG/L high performance computer and the results of speedup was computed. The k-means algorithm largely depends on the initial centroid selection and this is alleviated using electrostatic data partitioning. A concept called node merging has been introduced to provide the capability of arriving at K which is the actual number of clusters in the data set from a higher value K K which is initially provided by the user. XI. F UTURE W ORK k-means algorithm assumes that the clusters are spherical in shape. This limitation will be removed in future by using a pair of points to represent an elliptical cluster which is a generalization from the centroid concept. Vectorization if supported by the architecture will be a great performance gain for the algorithms and will also be incorporated. The algorithm with the node merging capabilities can be used to archive hierarchical clustering by rst clustering the data set into K clusters, and then perform the same on each and every resulting clusters, till there remains only clusters with a single element. We can nally achieve a tree description of the data set and allow the user to view the set at any resolution. I will also try to work on a node splitting algorithm which will allow the application to recognize a cluster which can logically be split into K clusters. ACKNOWLEDGMENTS Computer time and support were provided by NSF MRI Grant #CNS-0421498, NSF MRI Grant #CNS-0420873, NSF MRI Grant #CNS-0420985, NSF sponsorship of the National Center for Atmospheric Research, the University of Colorado, and a grant from the IBM Shared University Research (SUR) program. R EFERENCES
[1] J. B. MacQueen, Some methods for classication and analysis of multivariate observations, Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, 1:281-297, 1967. [2] Y.-M. Cheung, k star means: A new generalized k-means clustering algorithm, 2003. [3] M. Eucci. A tutorial on clustering algorithms. Individual. [Online]. Available: http://home.dei.polimi.it/matteucc/Clustering/ tutorial html/index.html [4] Message passing interface. MPICH. [Online]. Available: http: //www-unix.mcs.anl.gov/mpi/mpich/ [5] M. N. Joshi, Parallel k - means algorithm on distributed memory multiprocessors, April 2003.

You might also like