Professional Documents
Culture Documents
Mu-Chun Su and Chien-Hsing Chou Department of Electrical Engineering, Tamkang University, Taiwan, R.O.C. Email:muchun@ee.tku.edu.tw
Abstract
In this paper, we propose a new clustering algorithm to cluster data. The proposed algorithm adopts a new non-metric measure based on the idea of symmetry. The detected clusters may be a set of clusters of different geometrical structures. Three data sets are tested to illustrate the effectiveness of our pro posed algorithm. Keywords: K-means algorithm, Data Clustering, Pattern Recognition
hyperspherical-shaped clusters of equal size are usually detected. This measure is useless or even undesirable when clusters tend to develop along principal axes. To take care of hyperellipsoidal-shaped clusters, the Mahalanobis distance from x to m ,
1. Introduction
Cluster analysis is one of the basic tools for exploring the underlying structure of a given data set and is being applied in a wide variety of engineering and scientific disciplines such as medicine, psychology, biology, society, pattern recognition, and image processing. The primary objective of cluster analysis is to partition a given data set of multidimensional vectors (patterns) into so-called homogeneous clusters such that patterns within a cluster are more similar to each other than patterns belonging to different clusters. Cluster seeking is very experiment-oriented in the sense that cluster algorithms that can deal with all situations are not yet available. Extensive and good overview of clustering algorithms can be found in the literature [1]-[3]. Perhaps the best known and most widely used member of the family is the K -means algorithm or the Isodata algorithm [4]. Lately neural networks, for examp le, competitive-learning networks [5], self-organizing feature maps [6], and adaptive resonance theory (ART) networks [7]-[8] also often have been used to cluster data. Each approach has its own merits and disadvantages While it is easy to consider the idea of a data cluster on a rather informal basis, it is very difficult to give a formal and universal definition of a cluster. In order to mathematically identify clusters in a data set, it is usually necessary to first define a measure of similarity or proximity which will establish a rule for assigning patterns to the domain of a particular cluster centered. As it is to be expected, the measure of similarity is problem dependent. The most popular similarity measure is the Euclidean distance. The smaller the distance, the greater the similarity. By using Euclidean distance as a measure of similarity,
| x
i =1
ji
xki | )
r
1 r
(1)
where r1. The Euclidean distance ( r = 2 ) is one of the most commo n Minkowski distance metrics. By using the Euclidean distance, the conventional K-means algorithm tends to detect
hyperspherical-shaped clusters. Since clusters can be of arbitrary shapes and sizes the Minkowski metrics seem not a good choice for situations where no a priori information about the geometric characteristics of the data set to be clustered exists. Therefore, we have to find another more flexible measure. Looking around us, we get the immediate impression that almost every interesting area consists of a qualitative and generalized form of symmetry. Symmetry is such a powerful concept and its workings can be seen in many aspects of the world. For example, a sphere has the highest possible symmetry; no twist or turn is detectable. The common starfish has five appropriate planes of symmetry and a five-fold rotation axis. They show how the laws of nature give symmetry to their products. Since symmetry is so common in the abstract and in nature, it is reasonable to assume some kinds of symmetry exit in the structures of clusters. Based on this idea, we will assign patterns to a cluster center if they present a symmetrical structure with respect to the cluster center. The problem is how to find a metric to measure symmetry. A kind of symmetrical metric has been proposed by Reisfeld et al. and they used the symmetry transform as context -free attention operators [9]. For our opinions, their symmetrical metric is useful in image processing instead of in cluster analysis. In K-means algorithm, the cluster centroids represent the most important information. Therefore "point symmetry" is suitable to be applied in the K-means algorithm. Base on above discussions, we propose a non-metric distance. The symmetrical distance is defined as follows. Given N patterns,
xi , i = 1,L, N , and a reference vector
compute
ds ( x1 , c ) =
|| ( x1 c ) + (x 2 c ) || || (x1 c) || + || (x 2 c) ||
0 =0 , 2 +2
d s( x2 , c ) =
d s ( x3 , c ) =
|| ( x2 c) + ( x1 c) || 0 = =0 || ( x2 c) || + || ( x1 c) || 2 + 2
, and .
|| ( x3 c) + ( x4 c) || 2 = = 0.437 , || ( x3 c ) || + || ( x4 c) || 1+ 5 || ( x4 c) + (x 3 c) || 2 = = 0.437 || ( x4 c) || + || ( x3 c) || 5 +1
d s ( x4 , c ) =
c (e.g. a cluster
|| (x j c) + ( x i c) || (|| x j c || + || xi c ||)
(2)
where the denominator term is used to normalize the symmetrical distance so as to make the symmetrical distance insensible to the Euclidean distances || x j c || and || x i c || . If the right-hand term of Eq. (2) is minimized when x i = x j* then the pattern x j* is denoted as the symmetrical pattern relative to
x j with respect to
when the pattern 2c x j exists in the data set. The ideal of the point symmetry is very simple and intuitive. It is instructive to observe the geometrical interpretation of the definition of the symmetrical distance. Fig. 1 gives the concept. For this case, we have four patterns x 1 = ( 2 , 0 )T , x 2 = ( 2 , 0 )T ,
x 3 = ( 0 , 1 ) T , and x 4 = ( 1 , 2 )T and one reference
If the symmetrical distance d s ( x, c k * ) is smaller than a prespecified parameter then assign the data point x to the k * cluster. Otherwise, the data point is assigned to the cluster centroid that is nearest it in the Euclidean sense. Step 4. Updating: Compute the new centroids of the resulting clusters. The updating rule is given
below.
c k (t + 1) =
1 Nk
i S k ( t )
(4)
where S k (t ) is the set whose elements are the patterns assigned to the kth cluster in Step 3 at the tth iteration and N k is the number of elements in S k . Step 5. Continuation: If no patterns changes cluster then stop. Otherwise go to Step 3.
4. Experimental Results
We test three data sets to test the proposed clustering algorithm. The parameter is chosen for 0.15 and this was kept the same irrespective of the data set used. Example 1:We generated a m ixture of spherical and ellipsoidal clusters, as show in Fig. 2(a). There is no clear border between the clusters. The total number of data points is 577. According to our clustering algorithm, we first clustered the data sets using the ordinary K-means algorithm with the Euclidean distance. Fig. 2(b) shows the clustering result. We notice that these are several misclassified data points. We then used the symmetrical distance as the dissimilarity measure and entered the fine-tuning procedure. The clustering result is given in Fig. 2(c). Obviously, the clustering performance was greatly improved. Example 2:This data sets contains 400 data points distributed on two linear clusters, as shown in Fig 3(a). Fig 3(b) shows the clustering result achieved by the ordinary K-means algorithm with the Euclidean distance. Fig 3(c) illustrates that our algorithm works well for cluster with linear structure. Example 3:This data set contains a combination of ring-shaped and rectangular compact clusters, as show in Fig. 4(a). The total data points are 350. We use this data set to illustrate that the symmetrical distance can be also used to detect rectangular compact clusters. The clustering result achieved by the ordinary K -means algorithm with the Euclidean distance is shown in Fig. 4(b), the final clustering result is illustrated in Fig. 4(c). Fig. 2(a) The data set contains of a mixture of compact spherical and ellipsoidal clusters.
Fig. 2(b) The clustering result achieved by the K-means algorithm with the Eu clidean distance.
Fig. 2(c) The final clustering result achieved by our clustering algorithm.
5. Conclusions
We have proposed a new K-means algorithm using the symmetrical distance. The proposed algorithm can be used to group a given data set into a set of clusters of different geometrical structures. The price paid for the flexibility in detecting clusters is the increase of computational complexity. Note that the numbers of clusters are usually not known a priori for real data sets, we suggest to use the method proposed in [10] to estimate the numbers and then apply our
Fig. 3(b) The clustering result achieved by the K-means algorithm with the Euclidean distance.
Fig. 4(c) The final clustering result achieved by our clustering algorithm.
References
[1] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentic Hall, New Jersey, 1988. [2] R. O. Duda and P. E., Pattern Classification and Scene Analysis, New York: Wiley, 1973. [3] J. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, New York: Plenum, 1981. [4] G. H. Ball and D. I. Hall, Some Fundamental Concepts and Synthesis Procedures for Pattern Recognition Preprocessors, in Proc. of Int. Conf. Microwaves, Circuit Theory, and Information Theory, Tokyo, Japan, pp. 281-297, Sep. 1964. [5] T. Kohonen, The Neural Phonetic Typewritter, IEEE Computer, vol. 27, no. 3, pp. 11-12, 1988. [6] T. Kohonen, Self-Organization and Associative Memory, 3rd ed. New York, Berlin: Springer-Verlag, 1989. [7] G. A. Carpenter and S. Grossberg, A Massively Parallel Architecture for a Self-Organizing Neural Pattern Recognition Machine, Computer Vision, Graphics, and Image Proc, vol. 37, pp. 54-115, 1987. [8] G. A. Carpenter and S. Grossberg, ART2: Self-Organization of Stable Category Recognition Codes for Analog Input Patterns, Appl. Optics, vol. 26, no. 23, pp. 4919-4930, Dec. 1987. [9] D. Reisfeld, H. Wolfsow, and Y. Yeshurun, Context -Free Attentional Operators: the Generalized Symmetry Transform, International Journal of Computer Vision, vol. 14, pp. 119 -130, 1995. [10] M. C. Su, N. DeClaris, and T. K. Liu, Application of Neural Networks in Cluster Analysis, in IEEE Int. Conf. on System, Man, and Cybernetics, pp. 1-6, Orlando, USA, 1997.
Fig. 3(c) The final clustering result achieved by our clustering algorithm.
Fig. 4(a) The data set contains a combination of ring-shaped and rectangular compact clusters.
Fig. 4(b) The clustering result achieved by the K-means algorithm with the Euclidean distance.