You are on page 1of 8

178

An examination of procedures for determining


the number of clusters in a data set
Andre HARDY
Unite de Statistique, Departement de Mathematique,
Facultes Universitaires N.-D. de la Paix, 8 Rempart de la Vierge, B-5000 Namur, Belgium
& Facultes Universitaires Saint-Louis, 43 Boulevard du Jardin Botanique, B-IOOO
Bruxelles, Belgium

Summary: A problem common to all clustering techniques is the difficulty of deciding the
number of clusters present in the data. The aim of this paper is to compare three methods
based on the hypervolume criterion with four other well-known methods. This evaluation
of procedures for determining the number of clusters is conducted on artificial data sets.
To provide a variety of solutions the data sets are analysed by six clustering methods.
We finally conclude by pointing out the performance of each method and by giving some
guidance for making choices between them.

1. The clustering problem


The basic data for cluster analysis is a set of n entities E = {Xl, X2, ... , X n } , on
which the value of m measurements have been recorded. We suppose here that we
have m quantitative variables. We want to find a partition of the set of objects into
k clustersj k is supposed fixed. Let Pk denote the set of all partitions of E into k
clusters.
For this problem to be mathematically well-defined, we associate, to each P in Pb the
value of a clustering criterion W(P, k) which measures the quality of each partition
into k clusters.
The problem is then to find the partition p. that maximizes or minimizes the criterion
W(P, k) over all partitions into k clusters.

2. The hypervolume method and the hypervolume criterion


We assume a clustering model when the observed points are a realization of a Pois-
son process in a set D of Rm, where D is the union of k disjoint convex domains
D I , D2 , ... , Dkj Ci C {Xl, X2, ... , X n } is the subset of observations belonging to
Di (1 ~ i ~ k). The problem is to estimate the unknown domains Di in which the
points are distributed.
The maximum likelihood estimation of the k subsets D I , D2 , ... , Dk is constituted
by the k subgroups Ci of points such that the sum of the Lebesgue measures of their
disjoint convex hulls H(Ci) is minimum (1 ~ i ~ k) (Hardy and Rasson (1982)j
Hardy (1983)).
So the hypervolume criterion can be written as follows:
k
W : Pk ----+ R+: P = {Cl' C2 , ... , Cd '-+ W(P, k) = L m(H(Ci ))
i =I
where H(C;) is the convex hull of the points belonging to Ci and m(H(Ci )) is the
rn-dimensional Lebesgue measure of that convex hull.
E. Diday et al. (eds.), New Approaches in Classification and Data Analysis
© Springer-Verlag Berlin Heidelberg 1994
179

3. Methods to determine the number of clusters


The first three methods are based on the hypervolume criterion (Hardy (1993)).
3.1 A classieal geometrie method (MI)
This well-known method consists in plotting the value of a clustering criterion W
against k, the number of clusters. With every increase in k there will be a decrease
in W. A discontinuity in slope should correspond to the true number of "natural"
clusters. As stated by Gordon (1981) this procedure can be unreliable; some clustering
criteria can show large changes when analysing unstructured data. We show that this
method associated with the hypervolume criterion gives interesting results.
3.2 Method based on the estimation of a eonvex set (M2)
This method is based on the following problem: "Given a realization of a homogeneous
planar Poisson process of unknown intensity within a compact convex set D, find D".
The estimate of Dis given by D' = g(H(D)) + c· s(H(D)) where
• H(D) is the convex hull of the points belonging to D;
• g(H(D)) is the centroid of H(D);
• s(H(D)) = H(D) - g(H(D)).

So D' is a dilation of the convex hull about its centroid. It is difficult to compute in
practice the value of the constant of dilation c given by Ripley and Rasson (Ripley
and Rasson, (1977)) but this constant can be estimated by c = Jn:n n
where V is
the number of vertices of the convex hull H(D) (Moore (1984 )).
The realization of a Poisson process within the union of k subsets DI , D2 , ... , Dk can
be considered as the realization of k Poisson pro ces ses of the same intensity within
the k sub sets DI , D2 , ... , Dk (Neveu (1974)).
Let us denote by p. = {Cf, C;, ... , cn
the optimal partition of E into k clusters
and by D? the estimate of the convex compact set D7 containing Cr
So D? is the
dilated version of Df.
We propose the following decision rule for estimating k; checking for t=2, 3, ... :
• if D~2 n D? -:I 0, then we conclude that there is no clustering structure in the
data;
• if, for all {i,j} C {1,2, ... ,t}, i -:Ij: D;tnD/ = 0, and
if, for any integer s with 2 :S s < t and for all {i,j} c {I, 2, ... , s}, i -:I j :
D;' nD~' = 0, then we conclude that the natural partition contains at least t
clusters and we examine the partition into (t + 1) clusters;

D;t D/
• if there exists {i,j} C {I, 2, ... , t}, i -:I j: n -:10, and
if, for any integer s with 2 :S 8 < t and for all {i,j} c {1,2, ... ,s}i -:I j
D;' nD~' = 0, then we conclude that the data set contains exactly t -1 natural
clusters; so we have k = t.

3.3 A likelihood ratio test (M3)


The existence of an explicit model associated with the hypervolume method allows
180

us to formulate a likelihood ratio test for the number of groups.


Let Xl, X2, , ... , x" be a random sampie from a Poisson process on k disjoint convex
sets D l , D 2 , ... , Dk in a rn-dimensional Euclidean space, as explained in section 2.
For a given integer k ~ 2, we test whether a sub-division into k clusters is significantly
better than a subdivision into k - 1 clusters, i.e. the hypothesis Ho: t=k against the
alternative H l : t=k-1.
Let us denote by
• C = {Cl' c2 , ... , Ck} the optimal partition of {Xl, X2, , ... , X,,} into k clusters,
with respect to the hypervolume criterion;
• D = {D l , D 2 , ... , Dk-d the corresponding optimal partition into k -1 clusters.
The likelihood function can be written as follows (Hardy and Rasson (1982))

where I D is the indicator function of the set D.


80 the likelihood ratio takes the form:

Q(x) = supDfD(x;t=k-l) =( W(P,k) )"


supcfc(x; t = k) W(P, k - 1)

We have Q(x) E [0,1]. Thus we will reject Ho iff:

Q( ) = W(P,k) C
X W(P,k -1) > .

Unfortunately we do not know the exact Ho distribution of the statistic 8 (as it is


often the case with statistics derived from other clustering methods). Nevertheless
in practice, we can use the following rule: reject Ho if 8 takes large values i.e. if 8 is
close to 1. Practically we will apply the test in a sequential way: if ko is the smallest
value of k ~ 2 for which we reject Ho, we will consider ko - 1 as the appropriate
number of natural clusters.
For simulation and comparison purposes, we have selected four other methods for
estimating k; these methods are well-known in the scientific literat ure and were cho-
sen because they are available in the cluster analysis software CLU8TAN (Wishart,
1978): Wolfe's test (M4), upper tail rule (M5), moving average control rule (M6) and
Marriot's test (M7).

4. Results
In order to make the comparison of the methods for determining the appropriate
number of clusters, we have chosen six well-known clustering procedures (nearest
neighbour, furthest neighbour, centroid, Ward, k-means and hypervolume) and four
artificial data sets in Rm (weIl separated clusters, one single homogeneous group,
elongated clusters and the well-known Ruspini data). We have then applied seven
methods to determine the best number of clusters to the results so obtained.
181

4.1 First set of data: weIl separated clusters

. " J ~ .t..
+ +
.p- ~ .." ......
... +

... t'" ...


+ ... +

t \... .. ..
..
~
. ~. t ....

. ! ....+..... ..
... +
+.. + +.f. +
tt.t. ..
.... .. .. ..+

Fig. 1: Three weIl separated clusters.


Here we have simulated a Poisson process with three weIl separated convex sets in a
two-dimensional space (Figure 1). We have obtained the following results for k with
our seven estimation procedures:

11 WeIl separated clusters 11 11 MI I M2 I M3 I M4 I M5 I M6 I M7 11


nearest neigtibour + 3 - - 3 2 3 3
furthest neighbour + 5 - - 3 2 3 3
centroid + 3 - - 3 3 3 3
Ward + 3 - - 3 3 3 3
K-means + 3 - - 3 - - 3
hypervolume + 3 3 3 - - - 3

Tab. 1: WeH separated clusters.

The first column of the table lists the names of the clustering procedures.
Here we have a very good clustering structure: for three clusters all six clustering
procedures revea.l the true classification. This is expressed by a "+" in the second
column of Table 1. The seven last columns show the results given by the seven me-
thods for the determination of the optimal number of clusters.
For example the application of the upper tail rule (M5) to the results given by the
furthest neighbour clustering procedure leads to the conclusion that there are two
clusters in the data of Figure 1.
Here the graphica.l elbow method (MI) gives the correct result for most of our clu-
stering proceduresj this is implied by the fact that the three clusters are disjoint and
182

weIl separated. The two methods based on the hypervolume criterion (M2 and M3)
as weIl as the moving average control rule (M6) and Wolfe's test (M4) give also the
expected result. Let us also mention that the Marriot's test (M7) performs weH here;
the curve Pdet(W) has its minimum value for k* = 3.
The sign "-" in one of the seven last columns of a table means that the basic as-
sumptions of the method are not fulfiIled. For example, methods M2 and M3 are
based on the hypervolume criterion; so they are only applicable to the results given
by the hypervolume clustering procedure. M5 and M6 are valid only for hierarchie
clustering methods. The results of Table 1 show that we must be very careful; even
with the simplest examples, problems can arise: not all the methods do retrieve the
expected number of clusters.
4.2 Second set of data: one single homogeneous group

'. .
• '. • • '
• .. •
+

.' . • ++
• + • • • .p- t .... +
.'
• + "'1-+ .. + •
• • '
.q.+. + ..

.. •
i"'"
+ ••
' •• . .. .. • • • ••

•• .. ...
• ,. • ••• •
+ ...• • • '.

!++
+ •• •'.
:. +• • • ++ • •+ • '.
•+ •.. •

• •+• • + • +
+.
+' • •
• •• •

Fig. 2: One single homogeneous group.


Here we have simulated a Poisson process in a single rectangular set in the plane; so
the 150 points are independently and uniformly distributed in this set (Figure 2).

11 Data without structure 11 MI I M2 I M3 I M4 I M5 I M6 I M7 11

nearest nelghbour 1 - - 1 2 2
turthest nelghbour . - - 2 2 4
X
X
centroid '" - - 2 3 3 X
Ward 4 - - 3 2 4 X
K-means 4 - - 2 - - X
hypervolume 1 1 1 - - - X

Tab. 2: One homogeneous group.

Results of Table 2 show that the three methods based on the hypervolume clustering
criterion are very efficient when the problem is to test if there is any grouping structure
in a data set; in this case the hypervolume criterion decreases monotonicaHy with k.
The other methods are not very efficient. Furthermore Marriot's test (M7) is not
183

applicable here since it is only valid when the number of clusters is greater or equal
to two. This is expressed by a "X" in the eighth column of Table 2. A "*,, in the
table means that the results given by a method are not clear enough to conclude.
4.3 Third set of data: elongated clusters

+
++
+
++
+
++
+
++
+
+ ++
+ +
++ ++
+ +
+ + ++
+
++ ++
+
++
+
++ + .... +++ + +++ ++.... +....+
+ + + ++ +
++ +++ ....++ .... + -+ + .... +
+
++
+
++

Fig. 3: Elongated clusters.


In this example, we have chosen three elongated clusters such that no one is linearly
separable from the two others. The results obtained are given in Table 3. A "-" in
the second column of Table 3 means that the classification obtained when we fix the
number of clusters to three is not the expected one.

11 Elongated clusters 11 11 MI I M2 I M3 I M4 I M5 I M6 I M7 11
nearest nelghbour + 3 - - 3 2 3 10
furthest nelghbour - '" - - 3 2 2 10
centroid - ".
- - 3 2 2 10
Ward - 3 - - 3 2 3 10
K-means - 3 - - 3 - - lU
hypervolume + 3 '" 3 - - - 10
Tab.3: Elongated clusters.

Only the nearest neighbour and the hypervolume clustering methods reproduce the
"natural" classification when we fix the number of clusters to three. Let us remark
that Wolfe's test (M4) yields the correct number of clusters when applied to the re-
sults of the furthest neighbour, centroid, Ward and k-means procedures ... but the
classifications obtained by these clustering methods are not the "natural" one.
184

4.4 Fourth set of data: Ruspini data

XX
( Xx
)

Fig. 4: Ruspini data.

This data set comprises 75 points in the plane and is often used to test clustering
procedures. Usually one recognizes four "natural" clusters.

Ruspini data "MIIM2IM3IM4IM5IM6IM7"


nearest nelghbour + 4 - - 4 2 4 5
furthest nelghbour + 4 - - 4 2 4 5
centroid + 4 - - 4 2 4 5
Ward + 4 - - 4 2 4 5
K-means + 4 - - 4 - - 5
hypervolume + 4 4 4 - - - 5

Tab. 4: Ruspini data.

Here all the six clustering procedures yield the "natural" classification of the data
when we fix the number of clusters to four.
Methods MI, M2, M3, M4 and M6 performs weIl. Method M5 lacks in giving a
correct result.

5. Conclusions

We have investigated the problem of deciding on the number of clusters which are
present in a data set, and particularly to compare three selection methods based on
the hypervolume criterion with four other well-known methods.
It appears that there is a big problem when we want to choose a good clustering
procedure: some methods give the appropriate number of clusters, but based on a
"bad" classification. So we have to take into account the apriori information we
have on the clustering methods and the underlying hypotheses of each of them: the
185

chaining effect of the nearest neighbour methodj the furthest neighbour, centroid,
Ward and k-means methods favour spherical clustersj the hypervolume method is
based on a hypothesis of convexity of the clusters, ...
However, all the three methods based on the hypervolume criterion perform well.
For the graphical elbow method (MI) the decision on whether such plots contain the
necessary "discontinuity in slope" is likely to be very subjective. Nevertheless MI
yields the most appropriate and clear results when we plot the hypervolume criterion
against the number of groups.
Method M2 works usually very weH, but it may fail to obtain good results in presence
of elongated clusters containing a small number of points. It should be interesting
and useful to find a more appropriate value for the constant of dilation c when we
use it to detect the optimal number of clusters.
The likelihood ratio test (M3) seems to be one of the most interesting methods in order
to obtain the appropriate number of natural clusters. Unfortunately the distribution
of the test statistic is not known. But in practice it gives clear and relevant results.
Looking at all the results we have obtained, we can recommand to use simultaneously
several cluster analysis techniques and different methods for determining the optimal
number of clusters and to analyse aH the results in order to have more information
about the clusters: size, shape, convexity, tightness, separation, ... and to take into
account that information in order to choose the "best" classification and to interpret
it carefully.

References:
ANDERBERG, M.R. (1973): Cluster Analysis for Applications. Academic Press, New
York.
BOCK, H.H. (1985): On some significance tests in cluster analysis. Journal 0/ Classifica-
tion, 2, 77-108.
DIDAY, E. et Collaborateurs (1979): Optimisation en Classification Automatique. INRIA,
Paris.
EVERITT, B. (1980): Cluster analysis. Halsted Press, London.
GORDON, A.D. (1981): Classification. Chapman and Hall, London.
HARDY, A., and RASSaN, J.P. (1982): Une nouvelle approche des problemes de classifi-
cation automatique. Statistique et Analyse des donnees, 7, 41-56.
HARDY, A. (1983): Une nouvelle approche des problemes de classification automatique. Un
modele - Un nouveau critere - Des algorithmes - Des applications. Ph.D Thesis, F.U.N .D.P.,
Namur, Belgium ..
HARDY, A (1993): Criteria for determining the number of groups in a data set based on
the hypervolume criterion. Technical report, FUNDP, Namur, Belgium.
MOORE, M. (1984): On the estimation of a convex set. The Annals 0/ Statistics, 12, 9,
1090-1099.
NEVEU, J. (1974): Processus ponctuels. Technical report, Laboratoire de Calcul des Pro-
babilites, Universite Paris VI.
RIPLEY, B.D., and RASSaN, J.P. (1977): Finding the edge of a Poisson Forest. Journal
0/ Applied Probability, 14, 483-491.
WISHART, D. (1978): CLUSTAN User Manual, 3rd ed., Program Library Unit, University
of Edimburgh.

You might also like