Professional Documents
Culture Documents
Clustering
LESSON 33
Clustering Algorithms
1
Algorithms for Clustering
Clustering
Hard Soft
Clustering Clustering
Hierarchical Partitional
Square Others
Divisive Agglomerative Error
The soft clustering algorithms are based on fuzzy sets, rough sets, artifi-
cial neural nets (ANNs), or evolutionary algorithms, specifically genetic
algorithms (GAs).
Hierarchical Algorithms
2
Hierarchical algorithms produce a nested sequence of partitions of the
data which can be depicted by using a tree structure that is popularly
called as dendrogram.
In the former, starting with a single cluster having all the patterns, at
each successive step a cluster is split; this process continues till we end
up with each pattern in a cluster or a collection of singleton clusters.
Agglomerative Clustering
3
2. Find closest pair of clusters and merge them. Update the proximity
matrix to reflect the merge.
Note that step 1 in the above algorithm requires O(n2 ) time to compute
pairwise similarities and O(n2 ) space to store the values, when there
are n patterns in the given collection.
However, when the data sets are larger in size, these algorithms are not
feasible because of the non-linear time and space demands.
Partitional Clustering
4
Partitional clustering algorithms generate a hard or a soft partition of
the data.
K-Means Algorithm
Example 1
(1, 1, 1)t (1, 1, 2)t (1, 3, 2)t (2, 1, 1)t (6, 3, 1)t
(6, 4, 1)t (6, 6, 6)t (6, 6, 7)t (6, 7, 6)t (7, 7, 7)t
5
Cluster1 Cluster2 Cluster3
Considering the first three points as the initial seed points of a 3-partition,
The three cluster representatives and assignment of the remaining 7 points
to the nearest centroids is shown in Table 1 using L1 metric as the distance
measure
The centroids of the clusters are
6
Subsequent iterations do not lead to any changes in the assignment of points
or in the centroids obtained. K-Means algorithm minimizes the sum of the
squared deviations of patterns in a cluster from the center. If mi is the
centroid of the ith cluster, then the function minimized by the K-Means
algorithm is
PK P
i=1 xith cluster (x mi )t (x mi ).
Note that the value of the function is 55.5 for the 3-partition. However, if
we consider the initial seeds to be
affect any changes in the assignment of patterns. The value of squared error
criterion for this partition is 8. This shows that the K-Means algorithm is
not guaranteed to find the globally optimal partition.
7
It is best to use the K-Means algorithm when the clusters are hyper-
spherical.
It does not generate the intended partition, if the partition has non-
spherical clusters; in such a case, a pattern may not belong to the same
cluster as its nearest centroid.
Even when the clusters are spherical, it may not find the globally op-
timal partitions as discussed in Example 1.
Several schemes have been used to find the globally optimal partition
corresponding to the minimum value of the squared-error criterion.
Assignment
Let us say that any two points belong to the same cluster if the Eu-
clidean distance between them is less than a threshold of 5 units. Form
the Clusters.
Pattern f1 f2 f3 f4
1 1 0 0 1
2 0 1 1 0
3 0 1 1 0
4 1 0 0 1
8
3. Consider the two-dimensional dataset given below.
A = (0.5, 0.5); B = (2, 1.5); C = (2, 0.5); D = (5, 1);
E = (5.75, 1); F = (5, 3); G = (5.5, 3); H = (2, 3).
Obtain 3 clusters using the single-link algorithm.
4. Obtain three clusters of the data given in problem 3 using the K-Means
algorithm.
References