You are on page 1of 4

A K-means Algorithm with a Novel Non-Metric Distance

Mu-Chun Su and Chien-Hsing Chou Department of Electrical Engineering, Tamkang University, Taiwan, R.O.C. Email:muchun@ee.tku.edu.tw

Abstract
In this paper, we propose a new clustering algorithm to cluster data. The proposed algorithm adopts a new non-metric measure based on the idea of symmetry. The detected clusters may be a set of clusters of different geometrical structures. Three data sets are tested to illustrate the effectiveness of our pro posed algorithm. Keywords: K-means algorithm, Data Clustering, Pattern Recognition

hyperspherical-shaped clusters of equal size are usually detected. This measure is useless or even undesirable when clusters tend to develop along principal axes. To take care of hyperellipsoidal-shaped clusters, the Mahalanobis distance from x to m ,

D(x, m) = (x m)T 1(x m) , is one of the popular


choices. The matrix is the covariance matrix of a pattern population, m is the mean vector, and x represents an input pattern. One of the major difficulties associated with using the Mahalanobis distance as a similarity measure is that we have to recompute the inverse of the sample covariance matrix every time a pattern changes its cluster domain, which is computational expensive. Based on the above discussions, we propose a non-metric measure based on the concept of point symmetry. We intend to trade-off flexibility in clustering data with computational complexity. By employing the measure, we are able to detect crossed, ring-shaped, or compact clusters. The paper is organized as follows. In Section 2, we briefly present the idea of point symmetry and the proposed symmetrical distance. The clustering algorithm employing the symmetrical distance is discussed in Section 3. Several examples are used to demonstrate the effectiveness of the new measure. Section 4 presents the simulation results. Finally, Section 5 concludes the paper.

1. Introduction
Cluster analysis is one of the basic tools for exploring the underlying structure of a given data set and is being applied in a wide variety of engineering and scientific disciplines such as medicine, psychology, biology, society, pattern recognition, and image processing. The primary objective of cluster analysis is to partition a given data set of multidimensional vectors (patterns) into so-called homogeneous clusters such that patterns within a cluster are more similar to each other than patterns belonging to different clusters. Cluster seeking is very experiment-oriented in the sense that cluster algorithms that can deal with all situations are not yet available. Extensive and good overview of clustering algorithms can be found in the literature [1]-[3]. Perhaps the best known and most widely used member of the family is the K -means algorithm or the Isodata algorithm [4]. Lately neural networks, for examp le, competitive-learning networks [5], self-organizing feature maps [6], and adaptive resonance theory (ART) networks [7]-[8] also often have been used to cluster data. Each approach has its own merits and disadvantages While it is easy to consider the idea of a data cluster on a rather informal basis, it is very difficult to give a formal and universal definition of a cluster. In order to mathematically identify clusters in a data set, it is usually necessary to first define a measure of similarity or proximity which will establish a rule for assigning patterns to the domain of a particular cluster centered. As it is to be expected, the measure of similarity is problem dependent. The most popular similarity measure is the Euclidean distance. The smaller the distance, the greater the similarity. By using Euclidean distance as a measure of similarity,

2. The symmetrical distance


Unless a meaningful measure of distance or proximity between pairs of objects has been established, no meaningful cluster analysis is possible. The most common proximity index is the Minkowski metric, which measures dissimilarity [1]. Given N patterns, , x i = ( xi1 ,L , xin )T , i=1, 2,K N, the Minkowski metric for measuring the dissimilarity between the jth and kth patterns is defined by
d ( j, k) = (

| x
i =1

ji

xki | )
r

1 r

(1)

where r1. The Euclidean distance ( r = 2 ) is one of the most commo n Minkowski distance metrics. By using the Euclidean distance, the conventional K-means algorithm tends to detect

hyperspherical-shaped clusters. Since clusters can be of arbitrary shapes and sizes the Minkowski metrics seem not a good choice for situations where no a priori information about the geometric characteristics of the data set to be clustered exists. Therefore, we have to find another more flexible measure. Looking around us, we get the immediate impression that almost every interesting area consists of a qualitative and generalized form of symmetry. Symmetry is such a powerful concept and its workings can be seen in many aspects of the world. For example, a sphere has the highest possible symmetry; no twist or turn is detectable. The common starfish has five appropriate planes of symmetry and a five-fold rotation axis. They show how the laws of nature give symmetry to their products. Since symmetry is so common in the abstract and in nature, it is reasonable to assume some kinds of symmetry exit in the structures of clusters. Based on this idea, we will assign patterns to a cluster center if they present a symmetrical structure with respect to the cluster center. The problem is how to find a metric to measure symmetry. A kind of symmetrical metric has been proposed by Reisfeld et al. and they used the symmetry transform as context -free attention operators [9]. For our opinions, their symmetrical metric is useful in image processing instead of in cluster analysis. In K-means algorithm, the cluster centroids represent the most important information. Therefore "point symmetry" is suitable to be applied in the K-means algorithm. Base on above discussions, we propose a non-metric distance. The symmetrical distance is defined as follows. Given N patterns,
xi , i = 1,L, N , and a reference vector

compute

ds ( x1 , c ) =

|| ( x1 c ) + (x 2 c ) || || (x1 c) || + || (x 2 c) ||

0 =0 , 2 +2

d s( x2 , c ) =
d s ( x3 , c ) =

|| ( x2 c) + ( x1 c) || 0 = =0 || ( x2 c) || + || ( x1 c) || 2 + 2

, and .

|| ( x3 c) + ( x4 c) || 2 = = 0.437 , || ( x3 c ) || + || ( x4 c) || 1+ 5 || ( x4 c) + (x 3 c) || 2 = = 0.437 || ( x4 c) || + || ( x3 c) || 5 +1

d s ( x4 , c ) =

Therefore, the patterns x 1 and x 2

are the most

symmetrical pair relative to the reference vector c .

Fig. 1 An example of the symmetrical distance.

3. The proposed clustering algorithm


We assign patterns to clusters that are closest to them in the symmetrical sense. The algorithm is summarized as follows: Step 1. Initialization: We randomly chose K data points from the data set to initialize K cluster centroids. Step 2. Coarse-Tuning: Using the ordinary K-means algorithm with the Euclidean distance to update the cluster centroids. The cluster centroids are able to converge to the coarse location of cluster centroids. We then proceed the fine-tuning procedure. Step 3. Fine-Tuning: For each pattern, find the cluster centroid nearest it in the symmetrical sense. That is, we find the cluster centroid k * which is nearest to the input pattern x using the minimum-value criterion: k * = Arg min d s ( x(t), c k ) (3)
k =1,K,K

c (e.g. a cluster

center), the symmetrical distance between a pattern

x j and the reference vector c is defined as


d s ( x j , c) = min
i j

|| (x j c) + ( x i c) || (|| x j c || + || xi c ||)

(2)

where the denominator term is used to normalize the symmetrical distance so as to make the symmetrical distance insensible to the Euclidean distances || x j c || and || x i c || . If the right-hand term of Eq. (2) is minimized when x i = x j* then the pattern x j* is denoted as the symmetrical pattern relative to
x j with respect to

c . Note that Eq. (2) is minimized

when the pattern 2c x j exists in the data set. The ideal of the point symmetry is very simple and intuitive. It is instructive to observe the geometrical interpretation of the definition of the symmetrical distance. Fig. 1 gives the concept. For this case, we have four patterns x 1 = ( 2 , 0 )T , x 2 = ( 2 , 0 )T ,
x 3 = ( 0 , 1 ) T , and x 4 = ( 1 , 2 )T and one reference

vector c = ( 0 , 0 ) T . According to Eq.(2) we can easily

If the symmetrical distance d s ( x, c k * ) is smaller than a prespecified parameter then assign the data point x to the k * cluster. Otherwise, the data point is assigned to the cluster centroid that is nearest it in the Euclidean sense. Step 4. Updating: Compute the new centroids of the resulting clusters. The updating rule is given

below.

proposed algorithm to cluster data.

c k (t + 1) =

1 Nk

i S k ( t )

(4)

where S k (t ) is the set whose elements are the patterns assigned to the kth cluster in Step 3 at the tth iteration and N k is the number of elements in S k . Step 5. Continuation: If no patterns changes cluster then stop. Otherwise go to Step 3.

4. Experimental Results
We test three data sets to test the proposed clustering algorithm. The parameter is chosen for 0.15 and this was kept the same irrespective of the data set used. Example 1:We generated a m ixture of spherical and ellipsoidal clusters, as show in Fig. 2(a). There is no clear border between the clusters. The total number of data points is 577. According to our clustering algorithm, we first clustered the data sets using the ordinary K-means algorithm with the Euclidean distance. Fig. 2(b) shows the clustering result. We notice that these are several misclassified data points. We then used the symmetrical distance as the dissimilarity measure and entered the fine-tuning procedure. The clustering result is given in Fig. 2(c). Obviously, the clustering performance was greatly improved. Example 2:This data sets contains 400 data points distributed on two linear clusters, as shown in Fig 3(a). Fig 3(b) shows the clustering result achieved by the ordinary K-means algorithm with the Euclidean distance. Fig 3(c) illustrates that our algorithm works well for cluster with linear structure. Example 3:This data set contains a combination of ring-shaped and rectangular compact clusters, as show in Fig. 4(a). The total data points are 350. We use this data set to illustrate that the symmetrical distance can be also used to detect rectangular compact clusters. The clustering result achieved by the ordinary K -means algorithm with the Euclidean distance is shown in Fig. 4(b), the final clustering result is illustrated in Fig. 4(c). Fig. 2(a) The data set contains of a mixture of compact spherical and ellipsoidal clusters.

Fig. 2(b) The clustering result achieved by the K-means algorithm with the Eu clidean distance.

Fig. 2(c) The final clustering result achieved by our clustering algorithm.

5. Conclusions
We have proposed a new K-means algorithm using the symmetrical distance. The proposed algorithm can be used to group a given data set into a set of clusters of different geometrical structures. The price paid for the flexibility in detecting clusters is the increase of computational complexity. Note that the numbers of clusters are usually not known a priori for real data sets, we suggest to use the method proposed in [10] to estimate the numbers and then apply our

Fig. 3(a) The data set contains of two linear cluster.

Fig. 3(b) The clustering result achieved by the K-means algorithm with the Euclidean distance.

Fig. 4(c) The final clustering result achieved by our clustering algorithm.

References
[1] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentic Hall, New Jersey, 1988. [2] R. O. Duda and P. E., Pattern Classification and Scene Analysis, New York: Wiley, 1973. [3] J. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, New York: Plenum, 1981. [4] G. H. Ball and D. I. Hall, Some Fundamental Concepts and Synthesis Procedures for Pattern Recognition Preprocessors, in Proc. of Int. Conf. Microwaves, Circuit Theory, and Information Theory, Tokyo, Japan, pp. 281-297, Sep. 1964. [5] T. Kohonen, The Neural Phonetic Typewritter, IEEE Computer, vol. 27, no. 3, pp. 11-12, 1988. [6] T. Kohonen, Self-Organization and Associative Memory, 3rd ed. New York, Berlin: Springer-Verlag, 1989. [7] G. A. Carpenter and S. Grossberg, A Massively Parallel Architecture for a Self-Organizing Neural Pattern Recognition Machine, Computer Vision, Graphics, and Image Proc, vol. 37, pp. 54-115, 1987. [8] G. A. Carpenter and S. Grossberg, ART2: Self-Organization of Stable Category Recognition Codes for Analog Input Patterns, Appl. Optics, vol. 26, no. 23, pp. 4919-4930, Dec. 1987. [9] D. Reisfeld, H. Wolfsow, and Y. Yeshurun, Context -Free Attentional Operators: the Generalized Symmetry Transform, International Journal of Computer Vision, vol. 14, pp. 119 -130, 1995. [10] M. C. Su, N. DeClaris, and T. K. Liu, Application of Neural Networks in Cluster Analysis, in IEEE Int. Conf. on System, Man, and Cybernetics, pp. 1-6, Orlando, USA, 1997.

Fig. 3(c) The final clustering result achieved by our clustering algorithm.

Fig. 4(a) The data set contains a combination of ring-shaped and rectangular compact clusters.

Fig. 4(b) The clustering result achieved by the K-means algorithm with the Euclidean distance.

You might also like