Professional Documents
Culture Documents
Clustering Technique
ISRAEL GITMAN
AND
I. INTRODUCTION
62.
(b)
584
The generated partition is optimal in the sense that the program detects all of the existing unimodal fuzzy sets and
realizes the maximum separation [21] among them. The
algorithm attempts to solve problems 1), 2), and 3) mentioned above; that is, it is economical in memory space and
computational time requirements and also detects groups
which are fairly generally distributed in the feature space
[Fig. 1(c)]. The algorithm is a systematic procedure (as opposed to an iterative technique) which always terminates
and the computation time is reasonable.
An important distinction between this procedure and the
methods reported in the literature' is that the latter use a
distance measure (or certain average distances) as the only
means of clustering. We have introduced another "dimension," the dimension of the order of "importance" of every
point, as an aid in the clustering process. This is accomplished by associating with every point in the set a grade of
membership or characteristic value [21 ]. Thus the order of
the points according to their grade of membership, as well
as their order according to distance, are used in the algorithm. The latter partitions a sample from a multimodal
fuzzy set into unimodal fuzzy sets.
In Section II the concept of a fuzzy set is extended in order
to define both symmetric and unimodal fuzzy sets. The
basic algorithm consists of the two procedures, F and S,
which are described in detail in Sections III and IV, respectively. Section V deals with the application of the algorithm
to the clustering of data and the various practical implications. Section VI discusses the experimental results. Possible
extensions of the algorithm to handle very large data sets
(say greater than 30 000 points) are presented in Section
VII. The conclusions are given in Section VIII.
V
and
.-
i= {xif(x) f(xi)}2
2
=
x) < d(u, xi)}
{xdd(jl,
this so-called "prime mode" will be the centroid of the data set, rather
than a "mode" of a cluster. A measure of inhomogeneity is used to detect
clusters one at a time.
in [21 ] where
=f(x,i).
585
X2
x)
Qrb
(a)
(b)
Xi
X2
r2
Xi
Fig. 3. The points in Si are denoted by x. The point Xkl is on the boundary
of Si since IF includes no sample points in S.. The point Xk2 is an
interior point in Si, since F2 includes points in Si.
PROCEDURE F
Given a sample S= {(xi, fi)N} from a multimodal fuzzy set,
subject to certain conditions on f and S (see Theorem 1),
procedure F detects all the local maxima of f. It is divided
into two parts: in the first part, the sample is partitioned into
symmetric subsets and in the second, a search for the local
maxima in the generated subsets is performed.
In order to make the steps in the procedure clear, some
preliminary explanations are given below. An example
which demonstrates the procedure is presented later.
The number of groups (subsets) into which the sample is
partitioned is not known beforehand. The procedure is
initialized by the construction of two sequences: a sequence
A in which the points are ordered according to their grade
of membership, and a sequence A1 in which they are ordered
according to their distance to the mode of A (the first point
in A). The order of the points in the sequence A is the order
in which the points are considered for assignment into
groups. This process will initiate new groups when certain
conditions are satisfied. Whenever a group, say n, is initiated,
a sequence of points An is formed of all the points in S which
might be considered for assignment into group n. The first
III.
Part 1 of Procedure F
Let S = {(xi, fj)N} be a sample from a fuzzy set (assume,
for simplicity, that fi # fj for i #j).'
1) Initially it is required to generate the following two
sequences.
a) A = (yv1, Y2, YN) is a descending sequence of the
points in the sample ordered according to their grade
of membership; that is, fj 2 f, for 1< t, where fj and f,
are the grades of membership of yj and Yt, respectively.
where yI =y1,4 is the sequence
,
b) A1 =(yl, y' y',
Y3 *Y),
of the points ordered according to their distance to
that is, d(yl, yJ) < d(yl, y1) forj. t.5
We will also refer to A1 as the sequence of ordered
"candidate" points to be assigned into group 1. Thus y2
is the first candidate, and if it is assigned into group 1,
then y' becomes the next candidate, and so on. We can
therefore state that the current candidate point for group 1,
y1, is the nearest point to its mode yI(mY 1up1) except for
points that have already been assigned to group 1. This will
hold true for any sequence Ai; that is, y' -pi is the mode
for group i, and y' is its candidate point.
2) If yjmy=y, for i-2, 3,..., r-1, and Y 1, then yi,
i= 1, 2,"*, r-1, are assigned into group 1 and a new group
is initiated with y2- l2 Yr as its mode. That is, the sequence
A2=(yf y2 y32** y2) iS generated. The latter includes
from among the points that have not yet been assigned,
those points which are closer to Yr than the shortest distance
from Yr to the points that have already been assigned; this is
shown for one dimension in Fig. 4. The points in A2 are now
ordered according to their distance to Yr; that is, d(yr, y2)
y.d(ry2) forj.t.
586
S3S2 St
S4
' ,,I
rp
~~~ri
Fig. 4. At the stage where the point xi(=ji,) initiates a new group, all the
sample points that have already been assigned are in the domain VP.
Thus the nearest point in Fp to xi is at a distance Ri, which defines the
domain Fj of all the points which are at a shorter distance to xi than Ri.
The sample points in Fi will be ordered as candidate points to be
assigned into the group in which xi is the mode.
XIx lX7X
IX9
1X3
X15
lX
X20
Fig. 5. The characteristic functionf and the 30 point sample for the example are shown. The dotted lines indicate the partition (the sets Si)
resulting from the application of part I of procedure F. We can observe
that x15 and x25 are the only interior modes in the partition and thus
will be recognized as the local maxima points (vi) off.
are interior points. This is done according to the definition
given in the previous section.
Part 2
Let {Si,
yui)l}
procedure F. For every mode pi and set Si, a point x,j and a
(Y1, Y2,
Y,Y30)
X30)-
Thus the latter will initiate a new group and a new sequence
A2 will be generated.
587
After the first four groups are initiated, the sequences Ai ture. Rather than an arbitrary order which is the usual case,
the points are finally assigned in the order in which they
and the resulting partition to this point are as follows:
appear in the sequence A.
i A - (Y1, Y2,
Y30)
Specifically, let S = {(xi, fi)N} be a sample from a fuzzy set,
=: (X15, X14, X16, X13, X12, Xll,
and
{(vi, fJ(v))K} c S be the sample of the K local maxima
X17, X25, X24,
, X26,
f Assume that f1(xi)Af+j(xj) for i #j, and f(vj)>f(vj)
of
vi
... IX21)
for i .j. Let A be the sequence of the points ordered accordI
(A (-Y'l,1 Y2L * . , Y30J
-:: X1 5, X14, X1l6, X 13, X17X127? ing to their grade of membership, and suppose that the K
i K in A. We
local maxima of f are in locations pi, i= 1,
can infer the following proposition.
SI = (XI 5, X1 4, X 1 6 1l 13 17,
Proposition: The point xj in location j in the sequence
(A (y2)
A, PM<j<P(M+1), M<K, can only be assigned into one of
(X12)
the groups ieIM= {1, 2, * * * , M}.
S2= (X12,)
If f(Xpr), r= M+ 1, M + 2,.* , K is the local maximum
(A3 (yS y, y3)
x9)
(Xi
of group r, then only points with a lower grade of membership can be assigned into group r. Since all the points that
S3 = (X11,)
A4 - (Y4,Y4 4, ,143) = (X25, X24,X26, X23, X27, X28, precede location p, in A have higher grades of membership,
none of them can be assigned into group r, r=M+ 1,
X22, X29, X21, X30, X20,
M+2, * **,K.
This proposition implies that all the points in A which are
X18)
found in the locations Pj .J<p will automatically be asS4 =(X25,X24)
signed into group 1; the points in locations P2<] <P3 will
be divided between group 1 and group 2, and so on.
In relation to the procedure described above, we note the
Procedure S uses the following rule: assign the point xj
following.
in location j in the sequence A into the group in which its
1) The sequence A2 includes only one point (its mode) nearest neighbor with a higher grade of membership (all
since the nearest point to x12 in S has already been the points preceding xj in A) has been assigned. This rule
assigned. Therefore there are no sample points in S to applies to all the points with the exception of the local
maxima that initiate new groups. Note that the rule is difgenerate a symmetric fuzzy set whose mode is x12.
2) At the stage shown, x1o in the sequence A is to be as- ferent from the "nearest neighbor classification rule" [5 ] besigned. The candidate points for the four groups that cause of the particular order in which the points are introhave already been initiated are x12, no candidate, x10, duced.
Theorem 2: Let f be a piecewise continuous characterand x26, respectively. Thus x1o will be assigned to
group 3 since it is identical with the latter's candidate istic function of a fuzzy set.
Let S = {(xi, fj)N} be an infinite sample from f, such that
point.
3) No more points will be assigned into group 1, since its
1) for every xi in the domain of f and for an oa 0, the set
candidate x12 has already been assigned to another
F = {xId(xi, x) < a/2} includes at least one sample
group, and thus cannot be replaced as a candidate for
S.
point
group 1.
If L-*O, then procedure S partitions the given sample into
The resulting symmetric fuzzy sets generated by the applicaunimodal
fuzzy sets.
2
of
5.
Part
shown
in
F
are
1
Fig.
tion of part of procedure
3: Let S be a sample from a fuzzy set with a
Theorem
13
modes
to
each
of
the
test
the procedure is now applied to
characteristic
function f. Let f and S be constrained as in
can
In
5
we
Fig.
are
interior points.
detect which of these
2.
Theorem
in
interior
points
and
are
modes
x25
see that only the
x15
If x-+0, then every final set is a union of the sets Si gentheir corresponding sets, and therefore only two local
erated
in part 1 of procedure F.
the
this
partial result,
maxima are discovered. Based on
=
X
If
E', a more powerful result than Theorem 2 can be
in
next
section
the
end
the
of
at
be
continued
will
example
for
simplicity we will state it for the case of two local
stated;
S.
demonstrate
to
order
procedure
maxima.
Theorem 4: Let f be a piecewise continuous characteristic
IV. PROCEDURE S
function of a fuzzy set and d the distance between its two
Procedure S partitions a sample from a fuzzy set into local maxima.
unimodal fuzzy sets, providing the local maxima of f are
Let S be a sample from f, such that,
known. Thus this procedure uses the information obtained
1) for every point xi in the domain of f and for a finite
from the application of procedure F; that is, the number,
ox > 0, a d, the set F = {xjd(x., x) < o} includes at least
location, and characteristic values of the local maxima off.
one point in S, and
The rule for assigning the points differs from the known classification rules appearing in the pattern recognition litera2) the local maxima, (v1, f(v1)), (v2, f (v2)) are in S.
=
11
I,
X19,
588
x1
f,
II
14
-23)4
--I
X15
X17
X23 X2025
r.
589
TABLE I*
Group
Number
Ellipsoidal
Spherical
u=25
a=15
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
6=25
a=15
101(1)
100
100
100
100
97
95
71
71
69
29
29
22
7
4
3
2
201(100, 1) 202(99, 3)
100
100
100
100
99
100
99
99(2)
97
97
94
96
88
94(2)
81
78(1)
79(1)
20
19
66
12
8
21
3
8
21
3
2
19
15
13
1
100
100
100
100
100
99
85
81
a=20
u=25
4t
* n(nl, n2, n3) indicates that there is a total of n points in the corresponding group of which n1, n2, and n3 are from different categories.
t In this case 5 additional groups of 4, 2, 2, 2, 2 points, respectively,
were
generated.
TABLE II*
6
Spherical
15
20
25
4000
2500
4000
92
17
18
15
20
25
3500
4000
4500
65
31
15
Ellipsoidal
JEt
(percent)
(minute)
0.1
10.3
10.5
9.1
11.6
12.8
3.40
5.29
5.10
0.1
0.3
18.7
9.7
16.8
23.9
3.33
5.59
5.04
E.
Jv)
DeJ(v
rpercent)
Data
CPU
9 The prototype vectors which have been used for the data sets
listed in Appendix II.
are
of equal probability density have different shapes and orientations (see [7]). The reference partition that we have
used is the partition into the original ten categories of 100
points each. It is appreciated that this partition cannot be
achieved by any clustering technique because ofoverlapping
among the categories, in particular for the case of a= 25.
Two types of errors have been used to grade the partitions.
1) Em, the mixing error, defines the error caused by some
of the points of category i being assigned to categoryj,
i #1; it is therefore a result of the possible overlapping
among the categories or the linking of several categories.
2) Et, the total error, consists of Em plus the error produced by the generation of small clusters not in the
original set of ten. These small clusters are the result
of the fact that a finite sample from a Gaussian distribution can be made up of several modes.
IEEE TRANSACTIONS ON
590
TABLE III
Group
Number
Ellipsoidal
=15
1
2
3
4
5
6
7
8
9
10
101(1)
100
100
100
100
97
95
71
71
69
100
100
100
100
100
96
95
92
57
48
11
12
13
14
15
16
17
18
29
29
22
7
4
3
2
38
20
11
8
7
5
5
5
3500
158(58)
100
100
100
100
100
100
100
63
42
16
11
10
19
20
21
4
4
TABLE IV
Ellipsoidal
a=15, T2=3500
f(v1)
Em (percent)
E, (percent)
CPU
65
70
70
0.1
0
5.8
9.7
11.2
9.5
3.33
3.29
3.33
as well as a
experimentally.
Threshold Filtering
In this process we reduce the sample size before applying
procedures F and S. A small threshold T1 is employed for
filtering purposes while a large value T2 (equivalent to T in
the previous section) is used to evaluate the final grade of
membership.
The first point, say x1, is introduced. Then all the other
points are introduced sequentially and the distance from x1
to every point is measured. If d(x1, xi) < T1, then the grade
of membership of x1 is increased by 1; the corresponding
point xi is assignedfinally into the group into which x1 will
later be assigned. Thus xi is not considered further in the
application of procedures F and S. On the other hand, if
d(x1, xi)> T1, then xi will again be introduced until every
point has been assigned. When this process of filtering is
terminated, there remains a smaller set of points, x1,
X2,.* *, XN with the temporary grades of membership
nl, n2,. ,nN, where
N
591
(a)
xi
(b)
__El_
(c)
fd
i= 1
then set
f(xi) = ni + ni +, nm.
If N is of such a size that can be handled by the available
computer, then the algorithm can be employed; if not, a
further filtering stage can be imposed in the same manner.
Although threshold filtering has been used before, it has a
particular significance here. This is because the points
which are filtered out contribute to the partition of the entire set since they are represented in the grade of membership of the points which are included for clustering.
~N
(d)
K
x
592
1970
It is suggested that the clustering algorithm reported in groupj,]j i. Then there exists a subset Sii c Si which satisfies
this paper possesses three advantages over the ones dis- the condition d(4i, xr)>> (pi, xq) for xreSii. Thus xq precedes
cussed in the literature.
all the points in Sii in the sequence Ai. Since xq is assigned
to
group j,j =# i, it will not be replaced as the candidate point
1) It does not require a great amount of fast core memory
in
group i, and thus will block all the points in Sii from being
and therefore can be applied to large data sets. The
storage requirement is (20N+ CN+ S)10 bytes, where assigned into group i.
N is the number of points to be partitioned, 20N and Proof of Theorem 1
CN are required for the fixed portion of the program
The lemma implies that if ji is not a local maximum then
and the variable length data sequences (A, Ai), respec- it must be on the
boundary of Si. It remains to be shown that
tively, and S is the number of storage locations re- if it is a local maximum
of f, then it is an interior point.
quired for the given set of data points. Obviously, S
If pi is a local maximum, then assumption 1 of Theorem 1
depends on the particular resolution of the magnitude
implies that the subset Si is
of the components of the data vectors.
2) The amount of computing time is relatively small.
Si = {xjfd(ui, xi) < q}, where n >E/2.
(1)
3) The shape of the distribution of the points in a group Now let be the
xt
sample point such that
(category) can be quite general because of the distributions that the unimodal fuzzy sets include. This can be
Rt = d(ui,xt) = min [d(Ci,Xk)].
Xk(SthSi)
an advantage, especially in practical problems in
which the categories are not distributed in "round" Assumption 1 implies that R, . e.
clusters.
To show that the set F ={xld(x,, x) < R1} includes at least
one sample point in Si, we may consider the line segment
APPENDIX I
joining x, and ji,, and the point xi, on this line such that
Proof of Theorem I
d(xin, pi)=e/2. Defining the set F=
x)<e/2}, assumption 2 assures that F includes at least one sample
Lemma: The sets Si are disjoint symmetric fuzzy sets.
and (1) shows that this point is in Si.
Proof: Let Aii define a subsequence of A; of the points point
that have been assigned into group i, arranged in the order Proof of Theorem 2
that they stand in A. Let Ai be the sequence of candidate
Without loss of generality, let us assume that f has only
points to be assigned into group i. Bearing in mind pro- two local maxima. Let H be the optimal hypersurface
cedure F, any two points xp and xq can be assigned to the separating f into the two unimodal fuzzy sets, and S, and
same set Si if and only if their order in Aii corresponds to S2 the optimal partition of S.
their order in Ai. Suppose their order does not correspond,
Suppose that (n-1) points have already been assigned
that is,
correctly, thus generating the sets S(n -1) c S1 and S(n 1) S
and that xneS1 is the point to be assigned next. It is suffiAiiad * xp,
Xq
cient to show that there exists a sample point x eS(n1
and
such that
Ai = (.. Xq, * * * Xp * * .).
d(x0, x.) < min [d(xn, x)] XpOS2 -n1)
Then xp in Aii must be assigned first. But xq precedes xp as a
Let
candidate to be assigned into group i; thus xq will prevent
xp from being assigned into Si since it is not replaced as a
r1 = {xld(x,,, x) = oc/2}, f(u) = sup
[f(x)]
candidate point, unless it is assigned to Si. Thus if Ni is the
xer
number of points that have been assigned into group i, then and IF = {xjd(u, x) <. /2}. Clearly, f(u)
.f(xn), since xn is
for every n< Ni:
not a local maximum. In the limit when a- 0, f (xn) <(x) for
every xe- F,. Let xv be the point such that
d(x, xv) = min [d(Xn, Xj)].
d(i, x)), d(pi, xj)} for j = 1, 2, , n-1,xj, S
{xld(xin,
...
xjce(F;")
and
f(xn)
>
f(Xr)
forr=n + 1,
,Ni, xreSi
min
XpES2
[d(x,, xp)]
<
min
X E
>#(n- 1)
[d(x,, xi)]
xveS such
Proof of Theorem 3
In this proof we make use of the lemma to Theorem 1.
Note that in the proof of this lemma, none of the constraints
of Theorem 1 were applied; thus the sets Si generated by
part 1 of procedure F are always symmetric and disjoint
fuzzy sets.
Let us assume that f has only two maxima and let H be
the optimal hypersurface separating f into the two unimodal fuzzy sets. It is sufficient to show that if Si is a set
generated by the above procedure, then it is on one side
(either inside or outside) of H. Then the application of
Theorem 2 will complete the proof.
Suppose that Si includes points on both sides of H, say
the sets Si, and Si2(Si1uSi2=Si) and suppose that pi (the
mode of Si) is in Si,. Let x2 be the point such that
(}
and
Sk
S n
Fk
f(x2)
and
d(pi, X2),
which implies that Si is not symmetric. This contradicts the
above assumption. An application of Theorem 2 completes
the proof since if S, and S2 is the optimal partition, then
S, is on one side of H and S2 is on its other side.
d(pi, Xr)
<
Proof of Theorem 4
Let S, and S2 denote the optimal partition of S. Suppose
that (n-1) points have already been assigned correctly, thus
generating S(- 1) c S, and S(n - 1) c S2, and suppose xneS1 is
the next point to be assigned.
Let xu be the point such that
d(xn, Xu)
min
=
Xj E
(S'(n 1) ')tasun
-
1 ))
[d(xn, xj)].
II
593
The vectors are the first ten of the eighty prototype vectors
given in [8].
VI
v2
-77 -57
-67 -57
-131 19
27 -65
-63
83
53
51
67
-87
-73 -69
73
53
49
57
V3
-791
-13
69
- 11
65
33
11
V4 <5
-27
59
-55
25
-27
-33
-47
67j 33
31
53
- I
67
47
89
V_6 V7
13
-19 -37
- 5 -43
45
47
-75 -71
-93 -87
-49 -21
- 71 - 37
-71 -91
29
-59
69
65
-25
- 17
41
-65
- 37
-65
V8 V9
Vo
3 99
- 3 51
43
27
21
29
3
23
59
43
5
-65
-73
-97
-35
11
35 25
-43
51
-23
21
5
87
-17
15
-41
-85
REFERENCES
[1] G. H. Ball, "Data analysis in the social sciences: What about the
details?" 1965 Fall Joint Computer Conf. AFIPS Proc., vol. 27, pt. 1.
Washington, D. C.: Spartan, 1965, pp. 533-559.
[2] G. H. Ball and D. J. Hall, "ISODATA, A novel method of data analysis
and pattern classification," Stanford Research Institute, Menlo Park,
Calif., April 1965.
[3] R. E. Bonner, "On some clustering techniques," IBM J. Res. alnd
Develop., vol. 8, pp. 22-32, January 1964.
[4] R. G. Casey and G. Nagy, "An autonomous reading machine,"
IEEE Trans. Computers, vol. C-17, pp. 492-503, May 1968; also IBM
Corp., Yorktown Heights, N. Y., Research Rept. RC-1768, February 1967.
[5] T. M. Cover and P. E. Hart, "Nearest neighbor pattern classification," IEEE Trans. Information Theory, vol. IT-13, pp. 21-27,
January 1967.
[6] A. A. Dorofeyuk, "Teaching algorithms for a pattern recognition
machine without a teacher based on the method of potential functions," Automation and Remote Control, vol. 27, pp. 1728-1737,
December 1966.
[7] R. 0. Duda and H. Fossum, "Pattern classification by iteratively
determined linear and piecewise linear discriminant functions," IEEE
Trans. Electronic Computers, vol. EC-15, pp. 220-232, April 1966.
[8]
, "Computer-generated data for pattern recognition experiments," available from C. A. Rosen, Stanford Research Institute,
Menlo Park, Calif., 1966.
[9] W. D. Fisher, "On grouping for maximum homogenity," Amer. Stat.
Assoc. J., vol. 53, pp. 789-798. 1958.
[10] 0. Firschen and M. Fischler, "Automatic subclass determination for
pattern-recognition applications," IEEE Trans. Electronic Computers
(Correspondence), vol. EC-12, pp. 137-141, April 1963.
[11] E. W. Forgy, "Detecting natural clusters of individuals," presented
at the 1964 Western Psych. Assoc. Meeting, Santa Monica, Calif.,
September 1964.
[12] J. A. Gengerelli, "A method for detecting subgroups in a population
and specifying their membership," J. Psych., vol. 55, pp. 457-468,
1963.
[13] T. Kaminuma, T. Takekawa, and S. Watanabe, "Reduction of
clustering problem to pattern recognition," Pattern Recognition, vol.
1, pp. 195-205, 1969.
[14] J. MacQueen, "Some methods for classification and analysis of
multivariate observations," Proc. 5th Berkeley Symp. on Math.
Statist. and Prob. Berkeley, Calif.: University of California Press,
1967, pp. 281- 297.
[15] R. L. Mattson and J. E. Dammann, "A technique for determining
and coding subclasses in pattern recognition problems," IBM J. Res.
and Develop., vol. 9, pp. 294-302, July 1965.
[16] G. Nagy, "State of the art in pattern recognition," Proc. IEEE, vol.
56, pp. 836-862, May 1968.
[17] D. J. Rogers and T. T. Tanimoto, "A computer program for classifying plants," Science, vol. 132, pp. 115-118, October 1960.
[18] C. A. Rosen and D. J. Hall, A pattern recognition experiment with
near-optimum results," IEEE Trans. Electronic Computers (Correspondence), vol. EC-15, pp. 666-667, August 1966.
[19] J. Rubin, "Optimal classification into groups: An approach for solving the taxonomy problem," IBM Rept. 320-2915, December 1966.
[20] J. H. Ward, ";Hierarchical grouping to optimize an objective function," Amer. Stat. Assoc. J., vol. 58, pp. 236-244, 1963.
[21 ] L. A. Zadeh, "Fuzzy sets," Information and Control, vol. 8, pp. 338353, 1965.