Professional Documents
Culture Documents
7.1
Introduction
7.2
7.3
7.4
Clustering
7.5
Clustering
7.6
Clustering
7.7
Definition of Clusters
7.8
Definition of Clusters
7.9
7.10
7.11
7.12
7.13
Clustering Procedures
7.14
Clustering Procedures
7.15
Clustering Procedures
7.16
Clustering Procedures
7.17
Clustering Concepts
Choosing Variables:
7.18
Clustering Concepts
7.19
Clustering Concepts
Similarity and Dissimilarity Measurement:
Similarity and dissimilarity refers to the likeness of two
objects.
A proximity measure can be used to describe similarity or
dissimilarity.
There are several techniques in widespread use to
determine the proximity of one object in relation to
another.
The most intuitive of these is a distance measure.
7.20
Clustering Concepts
7.21
Clustering Concepts
k
a bi
i 1 i
D( a , b)
1
2 2
7.22
Clustering Concepts
7.23
Clustering Concepts
Standardization of variables:
7.24
Clustering Concepts
7.25
Clustering Concepts
7.26
Clustering Concepts
Weights and Threshold values:
7.27
Clustering Concepts
7.28
Clustering Algorithms
Partitioning algorithms
Hierarchical algorithms
Density-based algorithms
7.29
7.30
Place k points into the space represented by the objects that are being
clustered. These points represent initial group centroids.
Assign each object to the group that has the closest centroid.
When all objects have been assigned, recalculate the positions of the k
centroids.
Repeat Steps 2 and 3 until the centroids no longer move. This produces a
separation of the objects into groups from which the metric to be minimized
can be calculated.
7.31
7.32
7.33
7.34
7.35
The table shows five schools with the number of students and
teachers, the results of TAS(Texas Academic Scores) exams, and
the existence of a PTA (Parents and Teachers Association).
The first operation begins by selecting the initial k-means for the
k clusters.
7.36
7.37
Partition Algorithms:k-means
Algorithm
7.38
Partition Algorithms:k-means
Algorithm
7.39
7.40
Partition Algorithms:k-means
Algorithm
7.41
Hierarchical Algorithms
7.42
Hierarchical Algorithms
7.43
Hierarchical Algorithms
The similarities between the new merged cluster and any of the
other clusters need to be recalculated.
7.44
Hierarchical Algorithms
7.45
Hierarchical Algorithms
7.46
Hierarchical Algorithms
7.47
Hierarchical Algorithms
7.48
Hierarchical Algorithms
7.49
Single-link Method
7.50
Single-link Method
Because C3,C4, and C5 have the shortest distances, they are merged
After the first merger, we still have six clusters. Because the number
of clusters is not one, we should repeat step 4 until the number
becomes one.
It is easy to see that the shortest distances involve two pairs of
clusters,(C1,C2) and (C345,C6).
Practical Applications of Data
7.51
Single-link Method
7.52
Single-link Method
7.53
Single-link Method
When C12 and C3456 are merged together to form C123456,the only
other cluster left to be merged is C78.
Merging it with C123456 ,we finally have only one cluster that
contains all the objects.
7.54
Complete-link Method
7.55
Complete-link Method
7.56
Complete-link Method
7.57
Complete-link Method
7.58
Group-average Method
Using the same table used for the single-link and the completelink methods we can illustrate this method as follows.
The first three steps are the same as the two previous methods,
we omit them here.
7.59
Group-average Method
D(C1,C345)=10 and D(C2,C345) = 7. Since the merging criterion of the groupaverage method to use the mean distance of the different distances, the mean
distance is (10+7)/2 = 8.5.
The next clusters to be merged are C345 and C6. (Table 7.13)
7.60
Group-average Method
From Table 7.13,it is easy to see that the next clusters
until there is only one cluster left, we get the final result
shown Figure 7.3.
7.61
Centroid Method
7.62
Centroid Method
7.63
Centroid Method
7.64
Centroid Method
From Table 7.16,we can easily see
that the clusters to be merged are C4
and C5. The intermediate matrices are
omitted to show the final result in
Figure 7.5.
If we set the critical value to be less
than or equal to 5 as indicated with
the dotted line in the figure, we have
five clusters.
7.65
Centroid Method
7.66
Density-search Algorithms
The goal of this algorithm is to search for the sparse spaces, the
regions with low density, and separate them from the highdensity regions.
7.67
Density-search Algorithms
S ij Sijk
k 1
W
k 1
ijk
7.68
Density-search Algorithms
S i j k 1
Xi k X j k
Rk
7.69
Density-search Algorithms
To search for the sparse regions that should be separated from the
high density regions, the algorithm must first define how far
apart two points can be in order to be considered to be in a sparse
region.
7.70
Density-search Algorithms
3. Identify a point that has the nearest similarity measure with the
points that are already in the cluster. Add the point to the cluster.
4. Adjust the similarity measure of the cluster by averaging and
determine the discontinuity.
5. Determine whether the point should be placed into the cluster
against a threshold value. If it is not allowed to be in the cluster, a
new cluster should be initialized.
6. Repeat steps 3-5 until all points are examined and classified into
the clusters.
7.71
Density-search Algorithms
The school database in Table 7.21 is chosen to illustrate how the density-search
algorithm produces the clusters.
In step 2,we form an initial cluster with schools A and E because thy show the
nearest similarity measure(0.857) of all the comparisons.
During step 3,we find three identical similarity measures(0.714) that are the
second nearest to 0.857.
They are the measures between schools (B, C), (A, D), and (C, E).
7.72
Density-search Algorithms
7.73