You are on page 1of 4

2009 Fourth International Conference on Innovative Computing, Information and Control

A Hierarchical Clustering Algorithm based on K-means with Constraints*

GuoYan Hang, DongMei Zhang, JiaDong Ren JiaDong Ren, ChangZhen Hu


College of Information Science and Engineering, School of Computer Science and Technology,
Yanshan University, Beijing Institute of Technology,
Qinhuangdao, China Beijing, China
hgy@ysu.edu.cn jdren@ysu.edu.cn

Abstract—Hierarchical clustering is one of the most important Besides, the existing algorithms by combining
tasks in data mining. However, the existing hierarchical hierarchical clustering and K-means [4] ignore the existing
clustering algorithms are time-consuming, and have low constraints [5]. Clustering is traditionally considered as an
clustering quality because of ignoring the constraints. In this unsupervised method for data analysis. However, in some
paper, a Hierarchical Clustering Algorithm based on K-means cases the background knowledge is known in addition to the
with Constraints (HCAKC) is proposed. In HCAKC, in order data instances. Typically, the background knowledge is in
to improve the clustering efficiency, Improved Silhouette is the form of pairwise constraints (must-link and cannot-link).
defined to determine the optimal number of clusters. In Clustering quality can be improved through utilizing the
addition, to improve the hierarchical clustering quality, the existing constraints. Carlos Ruiz enhanced the density-based
existing pairwise must-link and cannot-link constraints are algorithm DBSCAN with constraints upon data points to get
adopted to update the cohesion matrix between clusters.
the new algorithm C-DBSCAN [6]. C-DBSCAN has
Penalty factor is introduced to modify the similarity metric to
superior performance to DBSCAN even with a small
address the constraint violation. The experimental results
show that HCAKC has lower computational complexity and
number of constraints. However, the efficiency of C-
better clustering quality compared with the existing algorithm DBSCAN is not good. How the K-means clustering
CSM. algorithm can be profitably modified to make use of the
constraints is demonstrated in cop-kmeans [7]. Although the
Keywords- Hierarchical clustering; Improved Silhouette; K- clustering accuracy of cop-kmeans is improved, the
means; constraints constraint violation problem has not been well addressed. I.
Davidson [8] incorporated the pairwise constraints into the
agglomerative hierarchical clustering to improve the
I. INTRODUCTION
clustering quality. However, the problem of constraint
Clustering is an important analysis tool in many fields, violation is still not solved.
such as pattern recognition, image classification, biological In this paper, we propose HCAKC, a new method for
sciences, marketing, city-planning, document retrievals, etc. hierarchical clustering based on K-means with existing
Hierarchical clustering is one of the most widely used pairwise constraints. In HCAKC, Improved Silhouette,
clustering methods. CUCMC (Constraints-based Update of Cohesion Matrix
At present, several existing clustering algorithms focus between Clusters) are defined. The optimal number of
on combination of the advantages of hierarchical and clusters is determined through computing the average
partitioning clustering algorithms [1-2]. K-means, which is Improved Silhouette of the dataset such that the time
one of the representative partitioning methods, obtains the complexity can be reduced. The initial clusters of HCAKC
number of clusters through minimizing the objective are obtained through running K-means. In order to improve
function. K-means has higher efficiency compared with the the clustering quality of the hierarchical clustering, the
hierarchical methods. However, the number of clusters K existing pairwise must-link and cannot-link constraints are
needs to be fixed iteratively. Thus, K-means is often incorporated into the agglomerative hierarchical clustering.
required to be run many times and is computationally CUCMC is done based on the existing constraints. The
expensive. How to determine the number of clusters penalty factor [9] is introduced into cohesion [2] similarity
becomes an increasingly important problem. The common metric to address the constraint violation.
trail-and-error method [3] generally depends on certain This paper is organized as follows: In section II, we
clustering algorithms and is inefficient when the dataset is give the basic concepts and definitions. In section III, we
large. present our hierarchical clustering algorithm called
HCAKC. Section IV shows the experimental results of the
*
This work is supported by the National High Technology Research and clustering algorithm. Finally we conclude the paper in
Development Program ("863"Program) of China (No. 2009AA01Z433) section V.
and the Natural Science Foundation of Hebei Province P.R. China
(No.F2008000888).

978-0-7695-3873-0/09 $29.00 © 2009 IEEE 1479

Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on July 08,2010 at 20:10:42 UTC from IEEE Xplore. Restrictions apply.
II. BASIC CONCEPTS AND DEFINITIONS any two clusters. Let M= {(Ci, Cj)} be the set of must-link
A silhouette [4] is a function that measures the constraints, indicating that cluster Ci and Cj should be in the
similarity of an object with objects of its own cluster same class, C= {(Ci, Cj)} be the set of cannot-link
compared with the objects of other clusters. constraints, indicating that Ci and Cj should be in the
For a cluster C consisting of data points p1, p2… pn, the different classes. For clusters Cp, Cq, Cr(p, q, r∈ [1,n]), in
radius r of C is defined as formula(1), where c is centroid of order to satisfy M, Chs(Cp, Cq) in X is updated to 1, and
C and d(pi, c) is the Euclidean distance between pi and c. Chs(Cp(Cq), Cr)is updated to max(Chs(Cp, Cr), Chs(Cq, Cr)).
And Chs(Cp, Cq) in X is updated to 0 in order to satisfy C.
1
In Fig. 2, we give an example to show the process of
⎛1 n 2 ⎞2
CUCMC, where must-link constraint M (C1, C2) is known.
r = ⎜ ∑ d ( pi − c ) ⎟ (1) From Fig. 2, we can obtain that Chs(C1, C2)=0.4,
⎝ n i =1 ⎠ Chs(C1, C3)=0.2, Chs(C2, C3)=0.1 respectively without
considering the existing constraint. Since C1, C2 need to
satisfy M (C1, C2), Chs(C1, C2) in X, the cohesion matrix, is
join( pi , c j ) = exp(− d ( pi , ci ) − d ( pi , c j ) / ri ) (2) updated to 1, and the Chs(C1, C3) is updated to 0.2.
The CUCMC of the example is as follows:
Where join(pi, cj) is the intention of pi to be joined into ⎡1 0.4 0.2⎤ ⎡1 1 0.2⎤
X= ⎢⎢ ⎥
1 0.1⎥ → X= ⎢⎢ 1 0.2⎥⎥
Cj. The cohesion [2] of Ci and Cj is calculated as formula(3).
⎢⎣ 1 ⎥⎦ ⎢⎣ 1 ⎥⎦
∑ join( p, C ) + ∑ join( p, C )
j i In general, the existing constraints are in the form of
Chs ( Ci , Cj ) =
p∈Ci p∈Cj
(3) m={(xi, xj)}, c={(xi, xj)}, where m indicates that point xi
Ci + Cj and xj should be in the same cluster, and c indicates that
point xi and xj should be in the different clusters. The M=
Definition 1 IS (Improved Silhouette) {(Ci, Cj)} and C= {(Ci, Cj)} can be obtained through
Let S be a dataset consisting of clusters C1, C2…Ct. The utilizing the propagation of constraints.
distance between each object oi(oi ∈ Cj, j ∈ [1,t]) and the Penalty factor w, w are introduced in order to address
centroid of its own cluster is denoted as ai. bi is the the constraints violation.
minimum distance between oi and each centroid of the
other t-1 clusters. The IS(oi) is defined as formula(4). Sim(Ci , C j ) = Chs (Ci , C j ) − w≠ M ( Ci ,C j ) − w≠ C (Ci ,C j ) (5)

IS ( oi ) = (bi − ai ) / max(ai , bi ) (4)


w≠ M ( Ci ,C j ) works as must-link (Ci, Cj) is violated,
In formula (4), the meanings of ai and bi have changed and w≠C (Ci ,C j ) works as Cannot-link (Ci, Cj) is violated.
compared with the traditional silhouette computation. Both y
ai and bi denote the distance to the cluster centroid. 8
The average IS of dataset corresponding to different
6 C2
partition is calculated. The maximal IS of the dataset
corresponds to the optimal partition of the dataset. 4
We take point A in Fig. 1 as an example to show the IS 2
C1

computation of a data point. C3

(1)Obtain the centroids of clusters C1, C2, C3 0 A(1,0) 2 4 6 8 x


respectively: Centroid1=(1.4, 1.2), Centroid2=(4.4, 4.8),
Centroid3=(6.5, 0.8333); Figure 1. The IS computation of data point
(2)Calculate aA, the distance between A and the centroid
0.4
of its own cluster: aA = (1−1.4)2 + (0 −1.2)2 =1.2649. The C1 C2 C1 C2
distances between A and each centroid of C2 and C3 can be
0.2
obtained similarly, and they are 5.8822 and 5.5628 0.2 0.1
respectively. Since bA denotes the minimum distance
according to the definition of IS, thus let bA=5.8822. C3
C3
(3)The IS of A can be obtained based on formula (4).
IS(A)=(bA-aA)/max(aA, bA)=0.7850. Figure 2. The update of cohesion matrix of cluster with constraints
Definition 2 CUCMC (Constraint-based Update of
Cohesion Matrix between Clusters)
Suppose that {Ck }k =1 is the set of given clusters and
n

X= [Chs (Cs , Ct ) ]n×n is the existing cohesion matrix between

1480

Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on July 08,2010 at 20:10:42 UTC from IEEE Xplore. Restrictions apply.
III. HIERARCHICAL CLUSTERING ALGORITHM BASED ON 3: repeat
K-MEANS WITH CONSTRAINTS 4: { for (each point x in S)
5: assign x to the closest sub-cluster based on the
The CSM [2] needs to specify K. Different K leads to
distance to the centroid;
different clustering results. Thus, how to determine the
6: update the centroid of each sub-cluster;
appropriate K becomes especially important. Besides, the
7: } until (no points change between the t clusters) //utilize
existing constraints are not considered in CSM, so the
the K-means on the S, where K equals to t
accuracy of the clustering results will not be high.
8: Compute the cohesion matrix X between the t clusters;
In HCAKC, we plot the curve about the average IS of
9: If ( (Ci Cj )∈ M or (Ci Cj )∈ C)
dataset to be clustered and the number of partitions. The
10: implement CUCMC;
optimal number of clusters is determined by the maximum
11: If (Ci Cj ) violates M (C)
of the curve, since the average IS of a dataset not only
reflects the density of clusters, but also the dissimilarity 12: w, w are enforced on the cohesion matrix;
between clusters. The cohesion matrix X is constructed 13: do{ Extract the maximal chs (Ci, Cj);
according to the cohesion between any two clusters. The 14: If (Ci and Cj do not belong to the same sub-cluster)
existing pairwise constraints are incorporated into the 15: merge the two sub-clusters which they belong to a
hierarchical clustering. CUCMC is implemented based on new subcluster;
the existing constraints. Thus, the clustering results are 16: t:=t-1; } while (t>K).
greatly optimized. In our algorithms, S is the dataset to be end
clustered; K is the optimal number of clusters; n is the size In HCAKC, Find-K is firstly run in order to determine
of S; m is the number of sub-clusters; M={(Ci, Cj)}, (i, the optimal number K of the dataset to be clustered. Then,
j ∈ [1,t]) is the set of existing must-link constraints; C={(Ci, K-means is adopted to form t clusters initially, in which t is
Cj)}, (i, j ∈ [1,t]) is the set of existing cannot-link more than K. The cohesion matrix named X between t
constraints. clusters is obtained base on formula (3). Afterwards, the
Algorithm Find-K existing constraints sets M={(Ci, Cj)} and C={(Ci, Cj)} are
Input: S considered to implement the CUCMC. The penalty factor is
Output: K introduced to address the constraint violation. When must-
begin link (Ci, Cj) is violated, w≠ M ( Ci ,C j ) is forced on the similarity
1: partition S into t clusters: C1, C2…Ct, according to the metric according to formula (5). The row i column j in X is
geometry distribution of S set at Sim(Ci, Cj).
2: repeat{
3: for (i=1; i<=t; i++) IV. EXPERIMENTAL RESULTS
4: {for (each object x in Ci)
All of our experiments have been conducted on a
5: {calculate ISi(x), the improved silhouette of x;
computer with 2.4Ghz Intel CPU and 512M main memory.
6: calculate ISi , the average Improved Silhouette of S, The operating system of the computer is Microsoft
1 t Windows XP. HCAKC is compared with CSM to evaluate
and ISi = ∑ ∑ ISi ( x) ;
n i =1 x∈ci
the clustering quality and time performance of HCAKC.
The algorithms are all implemented in Microsoft Visual
7: plot the curve about t and ISi in the 2-dimensional C++6.0.
coordinate system; } We performed our experiments on the UCI datasets:
8: t:=t+1; } Ionosphere, iris, breast-cancer, credit-g, page-blocks. The
9: } until (the curve reaches the maximum) must-link and cannot-link constraints are generated
10: K: =t; artificially by utilizing the same method with [7]. The
end details of datasets are shown in Tab.1. For instance, D1 is
In algorithm Find-K, dataset S is firstly partitioned into the Ionosphere dataset consisting of 355 instances from two
t clusters: C1, C2…Ct, according to the geometry clusters. Accuracy [7], one of the clustering quality
distribution of S. IS is introduced into the algorithm Find- measures, is computed to compare the clustering results
K. The closer the improved silhouette of a cluster to 1, the between HCAKC and CSM. We averaged the measures for
more similar the objects of the same cluster will be. The 100 trails on each dataset. Fig. 3 and Fig. 4 show the
curve, which is about cluster number t and the average IS, experimental results comparing HCAKC with CSM.
is plotted. The number of clusters corresponding to the HCAKC and CSM are run on D1 (i.e. Ionosphere
maximum of the curve is the optimal number of clusters. dataset) with constraints respectively. Fig. 3 shows the
Algorithm HCAKC accuracy results comparing HCAKC with CSM on the
Input: S, n, m, M, C Ionosphere dataset. From Fig. 3, we can get the conclusion
Output: The K clusters that CSM is lower in accuracy compared with HCAKC
begin with varying size of constraints.
1: Call the Algorithm Find-K; In CSM, the constraints are not considered. In HCAKC,
2: Initially select t points as the centroids of sub-clusters the constraints are incorporated into the hierarchical
arbitrarily (t>>K); clustering to update the cohesion matrix, and the constraints

1481

Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on July 08,2010 at 20:10:42 UTC from IEEE Xplore. Restrictions apply.
violation is addressed as well. Thus, HCAKC is better in V. CONCLUSION
terms of clustering accuracy. In order to improve the time efficiency and clustering
The experiments have been conducted on the datasets: quality of CSM, a new method named HCAKC is proposed
iris, breast-cancer, credit-g and page-blocks respectively to in this paper. In our proposed algorithm, the curve graph
compare the time efficiency of HCAKC with that of CSM. about average IS of the dataset and different partition
From Fig. 4, we can conclude that HCAKC is better than number has been plotted. The optimal number of clusters is
CSM in CPU running time on different datasets. determined by locating the maximum of the curve graph. As
The cluster number K needs to be specified as a a result, the complexity of the process when determining the
parameter before the CSM algorithm. The time cost of the number of clusters has been improved. Thereafter, the
parameter setting is expensive, since the K-means needs to existing constraints have been incorporated to complete the
be run iteratively. HCAKC finds out the optimal K via CUCMC during the hierarchical clustering process. The
computing the average IS of the points in datasets, and the penalty factor is introduced into our algorithm to address the
time cost of this process is insignificant. The time efficiency constraints violation. Hence, the clustering quality has been
of HCAKC is obvious even when the scale of the dataset is improved. The results of the experiments have demonstrated
large. that the HCAKC algorithm is efficient in reducing the time
TABLE.1 PARAMETERS IN TESTING DATASET complexity and increasing the clustering quality.

Dataset Name Size Clusters REFERENCES


[1] L. Sun, T. C. Lin, H. C. Huang, B. Y. Liao, and J. S. Pan, “An
D1 Ionosphere 355 2
optimized approach on applying genetic algorithm to adaptive cluster
D2 Iris 150 3 validity index”, 3rd International Conference on International
Information Hiding and Multimedia Signal Processing, Kaohsiung,
D3 breast-cancer 277 2 Taiwan, Nov. 2007, vol. 2, pp. 582-585.
D4 credit-g 1000 2 [2] C.R. Lin, M.S. Chen, “Combining Partitional and Hierarchical
Algorithms for Robust and Efficient Data Clustering with Cohesion
D5 page-blocks 5473 5 Self-Merging”, IEEE Transaction On Knowledge and Data
Engineering, 2005, 17(2): 145-159.
[3] H. J. Sun, S. R. Wang and Q. S. Jiang, “FCM-Based model selection
algorithms for determining the number of cluster”, Pattern
CSM HCAKC Recognition, 2004, vol. 37(10), pp. 2027−2037.
1 [4] S. Lamrous, M. Taileb, “Divisive Hierarchical K-Means”, CIMCA
0. 9 2006: International Conference on Computational Intelligence for
0. 8 Modeling, Control and Automation, Jointly with IAWTIC 2006:
Accuracy(%)

0. 7 International Conference on Intelligent Agents Web Technologies


0. 6
0. 5 and Internet Commerce, Sydney, NSW, Australia, 2006, pp. 18-23.
0. 4 [5] S. C. Chu, J. F. Roddick, C. J. Su and J. S. Pan, “Constrained Ant
0. 3 Colony Optimization for data clustering”, 8th Pacific Rim
0. 2
0. 1 International Conference on Artificial Intelligence, PRICAI 2004:
0 Trends in Artificial Intelligence, Auckland, New Zealand, 2004, vol.
5 10 15 20 25 30 35 40 3157, pp. 534-543.
Constraints Ratio/Size(%) [6] C. Ruiz, M. Spiliopoulou and E. Menasalvas, “C-DBSCAN:
Density-based clustering with constraints”, In 11th International
Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular
Figure 3. HCAKC and CSM comparison in terms of accuracy Computing, Toronto, Canada, 2007, pp. 216-223.
[7] K. Wagstaff, C. Cardie, S. Rogers and S. Schroedl, “Constrained K-
HCAKC CSM means clustering with background knowledge”, Proceeding of the
17th International Conference on Machine Learning, 2001, pp. 577-
584.
80
Running time(%)

70 [8] I. Davidson and S. S. Ravi, “Agglomerative hierarchical clustering


60 with constraints: Theoretical and empirical results”, 9th European
50 Conference on Principles and Practice of Knowledge Discovery in
40
30 Databases, Porto, Portugal, 2005, pp. 59-70.
20 [9] M. Bilenko, S. Basu and R. J. Mooney, “Integrating constraints and
10
0 metric learning in semi-supervised clustering”, Proc. of the 21st Int’l
D2 D3 D4 D5 Conf. on Machine Learning, New York, ACM Press, 2004,
Datasets pp.81−88.

Figure 4. HCAKC and CSM comparison in terms of running time

1482

Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on July 08,2010 at 20:10:42 UTC from IEEE Xplore. Restrictions apply.

You might also like