Professional Documents
Culture Documents
24
overall or global result is not easily interpreted in terms of a global objective function.
Finally, some clustering algorithms operate by mapping the clustering problem to a different domain, and solving a related problem in that domain. For instance, a proximity matrix
defines a weighted graph, where the nodes are the points being clustered, and the weighted
edges represent the proximities between points, i.e., the entries of the proximity matrix, and
thus, clustering is equivalent to breaking the graph into connected components, one for each
cluster.
While some approaches have been proposed for organizing clustering algorithms into groups, all
of these approaches are somewhat arbitrary since, as the previous discussion indicates, there are
many different ways in which clustering algorithms can be similar and different. However, in the
upcoming discussion of specific clustering algorithms, we shall use the following groupings and order.
1. Center-based Clustering.
2. Hierarchical Clustering
3. Density-Based Clustering
4. Graph-based Clustering
5. Subspace Clustering
6. Mixture Models
7. Selected Clustering Techniques
In the section on Selected Clustering Techniques, we will cover a variety of clustering techniques
that were not covered in the preceding sections and which illustrate some important issues, e.g.,
scalability to be able to handle large data sets.
5.6.1 K-means
Basic Algorithm
The K-means clustering technique is very simple and we immediately begin with a description of
the basic algorithm. See Algorithm 1. Note that K is a user specified parameter, i.e., the number
of clusters desired.
5.6.1 K-means
25
The initial centroids, or seeds, as they are often called, are typically chosen randomly from the
set of all data points. Because of this, the set of clusters produced by the K-means algorithm can
vary from one run to another. This raises the issue of how to evaluate which clustering is best, a
topic that will be discussed shortly.
The proximity measure used to evaluate closeness varies, depending on the type of data. For
instance, if we have points in two or three dimensional space, Euclidean distance is appropriate, while
for documents, the cosine measure is typically used. While any proximity measure could potentially
be employed, termination of the algorithm may not be guaranteed, as it is with Euclidean distance
and the cosine measure.
The centroid associated with a cluster is typically the mean (average) of the points in the cluster.
(This is the reason the algorithm is called K-means.) In particular, if C i is the ith cluster, |Ci | is
the number of points in the ith cluster, x is a point in that cluster, and xj is the j th attribute of
that point,Pthen the j th attribute (component) of, mi , the mean of the ith cluster, is defined by
mij = |C1i | xCi xj . In other words, the mean of a set of points is obtained by taking the mean of
each attribute for the points in the cluster.
In the absence of numerical problems, this algorithm always converges to a solution, i.e., reaches
a state where no points are shifting from one cluster to another, and hence, the centroids dont
change. However, since most of the convergence occurs in the early steps, the condition on line 5 is
often replaced by a weaker condition, e.g., repeat until only 1% of the points change clusters.
n
K X X
X
i=1 xCi j=1
(mij xj )2
(5.12)
In other words, for each data point, we calculate its error, i.e., its Euclidean distance to the
closest centroid, and then, as a measure of the overall goodness of the clustering, we calculate the
5.6.1 K-means
2.5
2.5
2.5
1.5
1.5
1.5
0.5
0.5
0.5
1.5
0.5
0.5
1.5
26
1.5
0.5
0.5
1.5
1.5
0.5
0.5
1.5
5.6.1 K-means
27
better distributed, a suboptimal clustering, i.e., higher squared error, results. For both figures, the
positions of the cluster centroids in the various iterations are indicated by large crosses.
One technique that is commonly used to address the problem of choosing initial centroids is to
perform multiple runs, each with a different set of randomly chosen initial centroids, and then select
the set of clusters with the minimum SSE. However, this strategy may not work very well, depending
on the data set and the number of clusters sought. We constructed an example, shown in Figure
5.16, to show what can go wrong. The data consists of 5 pairs of clusters, where the clusters in a
pair of clusters are closer to each other than to any other cluster. The probability that an initial
centroid will come from any given cluster is 0.10, but the probability that each cluster will have
exactly one initial centroid is much lower. (It should be clear that having one initial centroid in each
cluster is usually a very good starting situation.) In general, if there are K clusters and each cluster
has n points, then the probability, P , of selecting one initial centroid from each cluster is given by
the following equation. (This assumes sampling with replacement.)
P =
(5.13)
From this formula we can calculate that the chance of having one initial centroid from each
cluster is 10!/1010 = 0.00036.
Despite this, an optimal clustering will be obtained as long as two initial centroids fall anywhere in
a pair of clusters, since the centroids will redistribute themselves, one to each cluster. Unfortunately,
it is relatively likely that at least one pair of clusters will have only one initial centroid. In this latter
case, because the pairs of clusters are far apart, the K-means algorithm will not redistribute the
centroids between pairs of clusters, and thus, only a local minimum will be achieved.
Figure 5.17 shows that if we start two initial centroids per pair of clusters, then, even when the
both initial centroids are in a single cluster, the centroids will redistribute themselves so that the
true clusters are found. However, Figure 5.18 shows that if a pair of clusters has only one initial
centroid, and another pair has three, then two of the true clusters will be combined and one cluster
will be split.
Because of the problems with using random seeds for initialization, problems that even repeated
runs may not overcome, other techniques are often used for finding the initial centroids. One effective
technique is to take a sample of points and cluster it using a hierarchical clustering technique. K
clusters are extracted from the hierarchical clustering and the centroids of those clusters are used as
the initial centroids. This approach often works well, but is practical only if the sample is relatively
small, e.g., a few hundred to a few thousand (hierarchical clustering is expensive), and if K is
relatively small compared to the sample size. We will illustrate this approach in the section on
hierachical clustering.
Later on we will discuss two other approaches that are useful for producing better quality (lower
SSE) clusterings: using a variant of K-means that is less susceptible to initialization problems, e.g.,
bisecting K-means, and using post-processing to fix-up the set of clusters produced.
5.6.1 K-means
Iteration 1
Iteration 2
2.5
2.5
1.5
1.5
0.5
0.5
1.5
0.5
0.5
1.5
1.5
0.5
(a)
0.5
1.5
(b)
Iteration 3
Iteration 4
2.5
2.5
1.5
1.5
0.5
0.5
1.5
0.5
0.5
1.5
1.5
0.5
(c)
0.5
1.5
(d)
Iteration 5
Iteration 6
2.5
2.5
1.5
1.5
0.5
0.5
1.5
0.5
(e)
0.5
1.5
1.5
0.5
(f)
0.5
1.5
28
5.6.1 K-means
Iteration 1
Iteration 2
2.5
2.5
1.5
1.5
0.5
0.5
1.5
0.5
0.5
1.5
1.5
0.5
(a)
0.5
1.5
(b)
Iteration 3
Iteration 4
2.5
2.5
1.5
1.5
0.5
0.5
1.5
0.5
0.5
1.5
1.5
0.5
(c)
(d)
Iteration 5
3
2.5
1.5
0.5
1.5
0.5
0.5
1.5
(e)
0.5
1.5
29
5.6.1 K-means
30
10
12
14
16
18
20
5.6.1 K-means
Iteration 2
8
Iteration 1
8
10
12
14
16
18
20
(a)
10
18
20
16
Iteration 4
8
Iteration 3
14
(b)
12
10
(c)
12
14
16
18
20
10
12
14
16
18
(d)
Figure 5.17. Five Pairs of Clusters with a Pair of Initial Centroids Within Each Pair of Clusters.
20
31
5.6.1 K-means
Iteration 2
8
Iteration 1
8
10
12
14
16
18
20
(a)
10
(c)
14
16
18
20
12
Iteration 4
8
Iteration 3
(b)
10
12
14
16
18
20
10
12
14
16
18
20
(d)
Figure 5.18. Five Pairs of Clusters with More or Fewer than Two Initial Centroids Within a Pair of Clusters.
32
5.6.1 K-means
33
e.g., financial analysis, apparent outliers can be the most interesting points, e.g., unusually profitable
customers.
Another common approach uses post-processing to try to fix up the clusters that have been
found. For example, small clusters are often eliminated since they frequently represent groups of
outliers. Also, two small clusters that have relatively low SSE and are close together can be merged,
while a cluster with relatively high SSE can be split into smaller clusters.
ISODATA is a version of K-means that uses the previously discussed postprocessing techniques to
produce high quality clusters. ISODATA first chooses initial centroids in a manner that guarantees
that they will be well separated. More specifically, the first initial centroid is the mean of all data
points, and each succeeding centroid is selected by considering each point in turn, and selecting any
point that is at least a distance, d (a user-specified parameter), from any initial centroid selected so
far. ISODATA then repeatedly a) finds a set of clusters using the basic K-means algorithm and b)
applies postprocessing techniques. Execution halts when a stopping criteria is satisfied. Note that the
K-means algorithm is run after postprocessing because the clusters that result from postprocessing
do not represent a local minimum with respect to SSE. The ISODATA algorithm is shown in more
detail below. The reader should note that ISODATA has a large number of parameters.
Algorithm 2 ISODATA.
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
The splitting phase of ISODATA is an example of strategies that decrease the total SSE by
increasing the number of clusters. Two general strategies are given below.
Split a cluster. Often the cluster with the largest SSE is chosen, but in the case of ISODATA the
cluster with the largest standard deviation for one particular attribute is chosen.
Introduce a new cluster centroid. Often the point that is farthest from any cluster center is
chosen.
5.6.1 K-means
34
Likewise, the merging phase of ISODATA is just one example of strategies that decrease the
number of clusters. (This typically increases SSE.) Two general strategies are given below.
Disperse a cluster. In other words, remove the centroid that corresponds to the cluster and reassign the points to other clusters. Ideally the cluster which is dispersed should be the one that
increases the total SSE the least.
Merge two clusters In ISODATA, the clusters with the closest centroids are chosen, although
another, perhaps better, approach is to merge the two clusters that result in the smallest
increase in total SSE. These two merging strategies are, respectively, the same strategies that
are used in the hierarchical clustering techniques known as the Centroid method and Wards
method. Both methods are discussed later in Section 5.7.
Finally, note that while pre- and post-processing techniques can be effective, they are based on
heuristics that may not always work. Furthermore, they often require that the user choose values
for a number of parameters.
Bisecting K-means
The bisecting K-means algorithm is a straightforward extension of the basic K-means algorithm that
is based on a simple idea: To obtain K clusters, split the set of all points into two clusters, select one
of these clusters, and split it, and so on, until K clusters have been produced. In detail, bisecting
K-means works as follows.
Algorithm 3 Bisecting K-means Algorithm.
1:
2:
3:
4:
5:
6:
7:
8:
Initialize the list of clusters to contain the cluster containing all points.
repeat
Select a cluster from the list of clusters
for i = 1 to number of iterations do
Bisect the selected cluster using basic K-means
end for
Add the two clusters from the bisection with the lowest SSE to the list of clusters.
until Until the list of clusters contains K clusters
There are a number of different ways to choose which cluster is split. For example, we can choose
the largest cluster at each step, the one with the least overall similarity, or use a criterion based on
both size and overall similarity.
Bisecting K-means tends to be less susceptible to initialization problems. Partly this is because
bisecting K-means performs several trial bisections and takes the best one in terms of SSE, and
partly this is because there are only two centroids at each step.
We often refine the resulting clusters by using their centroids as the initial centroids for the
basic K-means algorithm. This is necessary, because while the K-means algorithm is guaranteed
to find a clustering that represents a local minimum with respect to SSE, in bisecting K-means we
are using the K-means algorithm locally, i.e., to bisect individual clusters. Thus, the final set of
clusters does not represent a clustering that is a local minima with respect to SSE.
To illustrate both the operation of bisecting K-means and to illustrate how it is less susceptible
to initialization problems, we show, in Figure 5.19, how bisecting K-means finds 10 clusters in the
data set originally shown in Figure 5.16.
Finally, note that by recording the sequence of clusterings produced as K-means bisects clusters,
we can also use bisecting K-means to produce a hierarchical clustering.
5.6.1 K-means
Iteration 2
10
12
14
16
18
20
(a)
10
14
16
18
20
10
12
14
16
18
20
10
12
14
16
18
20
10
12
14
16
18
20
20
10
12
14
16
18
20
(g)
10
12
14
16
18
20
(h)
8
10
(i)
Iteration 10
18
16
Iteration 9
8
14
(f)
12
Iteration 8
(e)
Iteration 7
10
(d)
Iteration 6
(c)
Iteration 5
12
(b)
Iteration 4
Iteration 3
Iteration 1
8
12
14
16
18
20
(j)
10
12
14
16
18
20
35
5.6.1 K-means
36
K X X
n
X
i=1 xCi j=1
(mij xj )2
(5.14)
5.6.1 K-means
10
10
15
15
15
10
10
15
15
10
10
15
37
3
4
15
10
5.6.1 K-means
15
10
10
15
38
5.6.1 K-means
39
Where Ci is the ith cluster, |Ci | is the number of points in the ith cluster, x is a point in that
cluster, and xj is the j th attribute of that point, and mij is the j th attribute (component) of, mi ,
the mean of the ith cluster.
We can solve for the mpk component, 1 k n, of the pth centroid, mp , by differentiating the
SSE, setting it to 0, and solving, as indicated below.
K
n
XXX i
(mj xj )2
p SSE =
p
mk
mk i=1
j=1
xCi
K X X
n
X
i
2
p (mj xj )
m
k
i=1
j=1
xCi
xCp
xCp
2 (mpk xk ) = 0
xCp
xk mpj =
1 X
xk
|Cp |
xCp
Thus, we find that mp = |C1p | xCp x, the mean of the points in the cluster. This is pleasing
since it agrees with our intuition of what the center of a cluster should be.
However, we can instead ask how we can obtain a partition of the data into K clusters such
that the sum of the distances (not squared distance) of points from the centroid of their clusters is
minimized. In mathematical terms, we now seek to minimize the sum of the absolute error, SAE,
as shown in the following equation.
v
K X uX
X
u n
t (mij xj )
(5.15)
SAE =
P
i=1 xCi
j=1
This problem has been extensively studied in operations research under the name of the multisource Weber problem and has applications to the location of warehouses and other facilities that
need to be centrally located. We leave it as an exercise to the reader to verify that the centroid in
this case is still the mean.
Finally, we can instead ask how we can obtain a partition of the data into K clusters such that
the sum of the L1 distances of points from the center of their clusters is minimized. We are seeking
to minimize the the sum of the L1 absolute, errors, SAE1 , as given by the following equation.
SAE1 =
K X X
n
X
i=1 xCi j=1
|mij xj |
(5.16)
We can solve for the mpk component, 1 k n, of the pth centroid, mp , by differentiating the
SAE1 , setting it to 0, and solving, as indicated below.
n
K
XXX i
|mj xj |
p SAE1 =
p
mk
mk i=1
j=1
xCi
K X X
n
X
i
p |mj xj |
m
k
i=1
j=1
xCi
xCp
xCp
40
|mi xk | = 0
mpk k
i
sign(xk mpk )) = 0
p |mk xk | = 0
mk
xCp
If we solve for mp , we find that mp = median{x Cp }, the point whose coordinates are
the median values of the corresponding coordinate values of the points in the cluster. While the
median of a group of points is straightforward to compute, this computation is not as efficient as
the calculation of a mean.
In the first case we say we are attempting to minimize the within cluster squared error, while in
the second case and third cases we just say that we are attempting to minimize the absolute within
cluster error, where the error is measured either by the L1 or L2 distance.
Initial medoids are chosen in a simple way. (Again we illustrate using similarities.) The first
initial medoid is chosen by finding the point that is has the highest sum of similarities to all other
points. Each of the remaining initial medoids is chosen by selecting the the non-medoid point which
41
has the highest sum of similarities to all non-medoid points, excluding those points that are more
similar to one of the currently chosen initial medoids.
While the K-medoid algorithm is relatively simple, it should be clear that it is expensive compared
to K-means. More recent improvements of the K-medoids algorithm have better efficiency than the
basic algorithm, but are still relatively expensive and will not be discussed here. Relevant references
may be found in the bibliographic remarks.
4:
This approach is the divisive version of the single link agglomerative technique that we will
see shortly.