Professional Documents
Culture Documents
Summary: A problem common to all clustering techniques is the difficulty of deciding the
number of clusters present in the data. The aim of this paper is to compare three methods
based on the hypervolume criterion with four other well-known methods. This evaluation
of procedures for determining the number of clusters is conducted on artificial data sets.
To provide a variety of solutions the data sets are analysed by six clustering methods.
We finally conclude by pointing out the performance of each method and by giving some
guidance for making choices between them.
So D' is a dilation of the convex hull about its centroid. It is difficult to compute in
practice the value of the constant of dilation c given by Ripley and Rasson (Ripley
and Rasson, (1977)) but this constant can be estimated by c = Jn:n n
where V is
the number of vertices of the convex hull H(D) (Moore (1984 )).
The realization of a Poisson process within the union of k subsets DI , D2 , ... , Dk can
be considered as the realization of k Poisson pro ces ses of the same intensity within
the k sub sets DI , D2 , ... , Dk (Neveu (1974)).
Let us denote by p. = {Cf, C;, ... , cn
the optimal partition of E into k clusters
and by D? the estimate of the convex compact set D7 containing Cr
So D? is the
dilated version of Df.
We propose the following decision rule for estimating k; checking for t=2, 3, ... :
• if D~2 n D? -:I 0, then we conclude that there is no clustering structure in the
data;
• if, for all {i,j} C {1,2, ... ,t}, i -:Ij: D;tnD/ = 0, and
if, for any integer s with 2 :S s < t and for all {i,j} c {I, 2, ... , s}, i -:I j :
D;' nD~' = 0, then we conclude that the natural partition contains at least t
clusters and we examine the partition into (t + 1) clusters;
D;t D/
• if there exists {i,j} C {I, 2, ... , t}, i -:I j: n -:10, and
if, for any integer s with 2 :S 8 < t and for all {i,j} c {1,2, ... ,s}i -:I j
D;' nD~' = 0, then we conclude that the data set contains exactly t -1 natural
clusters; so we have k = t.
Q( ) = W(P,k) C
X W(P,k -1) > .
4. Results
In order to make the comparison of the methods for determining the appropriate
number of clusters, we have chosen six well-known clustering procedures (nearest
neighbour, furthest neighbour, centroid, Ward, k-means and hypervolume) and four
artificial data sets in Rm (weIl separated clusters, one single homogeneous group,
elongated clusters and the well-known Ruspini data). We have then applied seven
methods to determine the best number of clusters to the results so obtained.
181
. " J ~ .t..
+ +
.p- ~ .." ......
... +
t \... .. ..
..
~
. ~. t ....
. ! ....+..... ..
... +
+.. + +.f. +
tt.t. ..
.... .. .. ..+
The first column of the table lists the names of the clustering procedures.
Here we have a very good clustering structure: for three clusters all six clustering
procedures revea.l the true classification. This is expressed by a "+" in the second
column of Table 1. The seven last columns show the results given by the seven me-
thods for the determination of the optimal number of clusters.
For example the application of the upper tail rule (M5) to the results given by the
furthest neighbour clustering procedure leads to the conclusion that there are two
clusters in the data of Figure 1.
Here the graphica.l elbow method (MI) gives the correct result for most of our clu-
stering proceduresj this is implied by the fact that the three clusters are disjoint and
182
weIl separated. The two methods based on the hypervolume criterion (M2 and M3)
as weIl as the moving average control rule (M6) and Wolfe's test (M4) give also the
expected result. Let us also mention that the Marriot's test (M7) performs weH here;
the curve Pdet(W) has its minimum value for k* = 3.
The sign "-" in one of the seven last columns of a table means that the basic as-
sumptions of the method are not fulfiIled. For example, methods M2 and M3 are
based on the hypervolume criterion; so they are only applicable to the results given
by the hypervolume clustering procedure. M5 and M6 are valid only for hierarchie
clustering methods. The results of Table 1 show that we must be very careful; even
with the simplest examples, problems can arise: not all the methods do retrieve the
expected number of clusters.
4.2 Second set of data: one single homogeneous group
'. .
• '. • • '
• .. •
+
.' . • ++
• + • • • .p- t .... +
.'
• + "'1-+ .. + •
• • '
.q.+. + ..
•
.. •
i"'"
+ ••
' •• . .. .. • • • ••
•
•• .. ...
• ,. • ••• •
+ ...• • • '.
!++
+ •• •'.
:. +• • • ++ • •+ • '.
•+ •.. •
• •+• • + • +
+.
+' • •
• •• •
nearest nelghbour 1 - - 1 2 2
turthest nelghbour . - - 2 2 4
X
X
centroid '" - - 2 3 3 X
Ward 4 - - 3 2 4 X
K-means 4 - - 2 - - X
hypervolume 1 1 1 - - - X
Results of Table 2 show that the three methods based on the hypervolume clustering
criterion are very efficient when the problem is to test if there is any grouping structure
in a data set; in this case the hypervolume criterion decreases monotonicaHy with k.
The other methods are not very efficient. Furthermore Marriot's test (M7) is not
183
applicable here since it is only valid when the number of clusters is greater or equal
to two. This is expressed by a "X" in the eighth column of Table 2. A "*,, in the
table means that the results given by a method are not clear enough to conclude.
4.3 Third set of data: elongated clusters
+
++
+
++
+
++
+
++
+
+ ++
+ +
++ ++
+ +
+ + ++
+
++ ++
+
++
+
++ + .... +++ + +++ ++.... +....+
+ + + ++ +
++ +++ ....++ .... + -+ + .... +
+
++
+
++
11 Elongated clusters 11 11 MI I M2 I M3 I M4 I M5 I M6 I M7 11
nearest nelghbour + 3 - - 3 2 3 10
furthest nelghbour - '" - - 3 2 2 10
centroid - ".
- - 3 2 2 10
Ward - 3 - - 3 2 3 10
K-means - 3 - - 3 - - lU
hypervolume + 3 '" 3 - - - 10
Tab.3: Elongated clusters.
Only the nearest neighbour and the hypervolume clustering methods reproduce the
"natural" classification when we fix the number of clusters to three. Let us remark
that Wolfe's test (M4) yields the correct number of clusters when applied to the re-
sults of the furthest neighbour, centroid, Ward and k-means procedures ... but the
classifications obtained by these clustering methods are not the "natural" one.
184
XX
( Xx
)
This data set comprises 75 points in the plane and is often used to test clustering
procedures. Usually one recognizes four "natural" clusters.
Here all the six clustering procedures yield the "natural" classification of the data
when we fix the number of clusters to four.
Methods MI, M2, M3, M4 and M6 performs weIl. Method M5 lacks in giving a
correct result.
5. Conclusions
We have investigated the problem of deciding on the number of clusters which are
present in a data set, and particularly to compare three selection methods based on
the hypervolume criterion with four other well-known methods.
It appears that there is a big problem when we want to choose a good clustering
procedure: some methods give the appropriate number of clusters, but based on a
"bad" classification. So we have to take into account the apriori information we
have on the clustering methods and the underlying hypotheses of each of them: the
185
chaining effect of the nearest neighbour methodj the furthest neighbour, centroid,
Ward and k-means methods favour spherical clustersj the hypervolume method is
based on a hypothesis of convexity of the clusters, ...
However, all the three methods based on the hypervolume criterion perform well.
For the graphical elbow method (MI) the decision on whether such plots contain the
necessary "discontinuity in slope" is likely to be very subjective. Nevertheless MI
yields the most appropriate and clear results when we plot the hypervolume criterion
against the number of groups.
Method M2 works usually very weH, but it may fail to obtain good results in presence
of elongated clusters containing a small number of points. It should be interesting
and useful to find a more appropriate value for the constant of dilation c when we
use it to detect the optimal number of clusters.
The likelihood ratio test (M3) seems to be one of the most interesting methods in order
to obtain the appropriate number of natural clusters. Unfortunately the distribution
of the test statistic is not known. But in practice it gives clear and relevant results.
Looking at all the results we have obtained, we can recommand to use simultaneously
several cluster analysis techniques and different methods for determining the optimal
number of clusters and to analyse aH the results in order to have more information
about the clusters: size, shape, convexity, tightness, separation, ... and to take into
account that information in order to choose the "best" classification and to interpret
it carefully.
References:
ANDERBERG, M.R. (1973): Cluster Analysis for Applications. Academic Press, New
York.
BOCK, H.H. (1985): On some significance tests in cluster analysis. Journal 0/ Classifica-
tion, 2, 77-108.
DIDAY, E. et Collaborateurs (1979): Optimisation en Classification Automatique. INRIA,
Paris.
EVERITT, B. (1980): Cluster analysis. Halsted Press, London.
GORDON, A.D. (1981): Classification. Chapman and Hall, London.
HARDY, A., and RASSaN, J.P. (1982): Une nouvelle approche des problemes de classifi-
cation automatique. Statistique et Analyse des donnees, 7, 41-56.
HARDY, A. (1983): Une nouvelle approche des problemes de classification automatique. Un
modele - Un nouveau critere - Des algorithmes - Des applications. Ph.D Thesis, F.U.N .D.P.,
Namur, Belgium ..
HARDY, A (1993): Criteria for determining the number of groups in a data set based on
the hypervolume criterion. Technical report, FUNDP, Namur, Belgium.
MOORE, M. (1984): On the estimation of a convex set. The Annals 0/ Statistics, 12, 9,
1090-1099.
NEVEU, J. (1974): Processus ponctuels. Technical report, Laboratoire de Calcul des Pro-
babilites, Universite Paris VI.
RIPLEY, B.D., and RASSaN, J.P. (1977): Finding the edge of a Poisson Forest. Journal
0/ Applied Probability, 14, 483-491.
WISHART, D. (1978): CLUSTAN User Manual, 3rd ed., Program Library Unit, University
of Edimburgh.