You are on page 1of 10

Incremental Clustering and Dynamic Information Retrieval

  

M OSES C HARIKAR C HANDRA C HEKURI TOM ÁS F EDER R AJEEV M OTWANI

Abstract for information retrieval. We propose a model called incre-


mental clustering based primarily on the requirements for the
Motivated by applications such as document and image information retrieval application, although our model should
classification in information retrieval, we consider the prob- also be relevant to other applications. We begin by analyzing
lem of clustering dynamic point sets in a metric space. We several natural greedy algorithms and discover that they per-
propose a model called incremental clustering which is based form rather poorly in this setting. We then identify some new
on a careful analysis of the requirements of the information deterministic and randomized algorithms with provably good
retrieval application, and which should also be useful in other performance. We complement our positive results with lower
applications. The goal is to efficiently maintain clusters of bounds on the performance of incremental algorithms. We
small diameter as new points are inserted. We analyze sev- also consider the dual clustering problem where the clusters
eral natural greedy algorithms and demonstrate that they per- are of fixed diameter, and the goal is to minimize the num-
form poorly. We propose new deterministic and random- ber of clusters. Before describing our results in any greater
ized incremental clustering algorithms which have a prov- detail, we motivate and formalize our new model.
ably good performance. We complement our positive res- Clustering is used for data analysis and classification in a
ults with lower bounds on the performance of incremental al- wide variety of application [1, 12, 20, 27, 34]. It has proved
gorithms. Finally, we consider the dual clustering problem to be a particularly important tool in information retrieval for
where the clusters are of fixed diameter, and the goal is to constructing a taxonomy of a corpus of documents by form-
minimize the number of clusters. ing groups of closely-related documents [13, 16, 34, 35, 37,
38]. For this purpose, a distance metric is imposed over doc-
1 Introduction uments, enabling us to view them as points in a metric space.
The central role of clustering in this application is captured
We consider the following problem: as a sequence of by the so-called cluster hypothesis: documents relevant to a
points from a metric space is presented, efficiently maintain query tend to be more similar to each other than to irrelev-
a clustering of the points so as to minimize the maximum ant documents and hence are likely to be clustered together.
cluster diameter. Such problems arise in a variety of ap- Typically, clustering is used to accelerate query processing
plications, in particular in document and image classification by considering only a small number of representatives of the

clusters, rather than the entire corpus. In addition, it is used
Department of Computer Science, Stanford University, Stanford, CA for classification [11] and has been suggested as a method for
94305-9045. E-mail: moses@cs.stanford.edu. Supported by NSF facilitating browsing [9, 10].
Award CCR-9357849, with matching funds from IBM, Mitsubishi, Schlum-
berger Foundation, Shell Foundation, and Xerox Corporation. The current information explosion, fueled by the avail-
ability of hypermedia and the World-wide Web, has led to


Department of Computer Science, Stanford University, Stanford, CA


94305-9045. E-mail: chekuri@cs.stanford.edu. Supported by the generation of an ever-increasing volume of data, posing
NSF Award CCR-9357849, with matching funds from IBM, Mitsubishi,
a growing challenge for information retrieval systems to effi-
Schlumberger Foundation, Shell Foundation, and Xerox Corporation.


E-mail: tomas@theory.stanford.edu ciently store and retrieve this information [40]. A major issue
that document databases are now facing is the extremely high


Department of Computer Science, Stanford University, Stanford, CA


94305-9045. E-mail: rajeev@cs.stanford.edu. Supported by an rate of update. Several practitioners have complained that
Alfred P. Sloan Research Fellowship, an IBM Faculty Partnership Award, existing clustering algorithms are not suitable for maintain-
an ARO MURI Grant DAAH04-96-1-0007, and NSF Young Investigator
Award CCR-9357849, with matching funds from IBM, Mitsubishi, Schlum- ing clusters in such a dynamic environment, and they have
berger Foundation, Shell Foundation, and Xerox Corporation. been struggling with the problem of updating clusters without
frequently performing complete reclustering [4, 5, 6, 8, 35].
From a theoretical perspective, many different formulations
are possible for this dynamic clustering problem, and it is not
clear a priori which of these best addresses the concerns of
the practitioners. After a careful study of the requirements,
we propose the model described below.
Hierarchical Agglomerative Clustering. The clustering one. We define the performance ratio of an incremental clus-
strategy employed almost universally in information retrieval tering algorithm as the maximum over all update sequences
is Hierarchical Agglomerative Clustering (HAC) [12, 34, 35, of the ratio of its maximum cluster diameter (or, radius) to
37, 38, 39]. This is also popular in other applications such that of the optimal clustering for the input points.
as biology, medicine, image processing, and geographical in- Our model enforces the requirement that at all times an
formation systems. The basic idea is: initially assign the incremental algorithm should maintain a HAC for the points
input points to distinct clusters; repeatedly merge pairs of presented up to that time. As before, an algorithm is free to
clusters until their number is sufficiently small. Many in- use any rule for choosing the two clusters to merge at each
stantiations have been proposed and implemented, differing step. This model preserves all the desirable properties of
mainly in the rule for deciding which clusters to merge at each HAC while providing a clean extension to the dynamic case.
step. Note that HAC computes hierarchy trees of clusters In addition, it has been observed that such incremental al-
(also called dendograms) whose leaves are individual points gorithms exhibit good paging performance when the clusters
and internal nodes correspond to clusters formed by merging themselves are stored in secondary storage, while cluster rep-
clusters at their children. A key advantage of these trees is resentatives are preserved in main memory [32].
that they permit refinement of responses to queries by moving We have avoided labeling this model as the online cluster-
down the hierarchy. Typically, the internal nodes are labeled ing problem or referring to the performance ratio as a com-
with indexing information (sometimes called conceptual in- petitive ratio [25] for the following reasons. Recall that in
formation) used for processing queries and in associating se- an online setting, we would compare the performance of an
mantics with clusters (e.g., for browsing). Experience shows algorithm to that of an adversary which knows the update
that HAC performs extremely well both in terms of efficiency sequence in advance but must process the points in the or-
and cluster quality. In the dynamic setting, it is desirable der of arrival. Our model has a stronger requirement for
to retain the hierarchical structure while ensuring efficient incremental algorithms, in that they are compared to ad-
update and high-quality clustering. An important goal is to versaries which do not need to respect the input ordering,
avoid any major modifications in the clustering while pro- i.e., we compare our algorithms to optimal clusterings of
cessing updates, since any extensive recomputation of the in- the final point set, and no intermediate clusterings need be
dex information will swamp the cost of clustering itself. The maintained. Also, online algorithms are permitted super-
input size in typical applications is such that super-quadratic polynomial running times. In contrast, our model essentially
time is impractical, and in fact it is desirable to obtain close requires polynomial-time approximation algorithms which
to linear time. are constrained to incrementally maintain a HAC. It may be
interesting to explore the several different formulations of on-
A Model for Incremental Clustering. Various measures of line clustering; for example, when the newly inserted point
distance between documents have been proposed in the lit- starts off a new cluster, we could allow the points of one old
erature, but we will not concern ourselves with the details cluster to be redistributed among the remaining, rather than
thereof; for our purposes, it suffices to note that these dis- requiring that two clusters be merged together. The prob-
tance measures induce a metric space. Since documents are lem with such formulations is that they do not lead to HACs;
usually represented as high-dimensional vectors, we cannot moreover, they entail the recomputation of the index struc-
make any stronger assumption than that of an arbitrary met- tures for all clusters, which renders the algorithms useless
ric space, although, as we will see, our results improve in geo- from the point of view of applications under consideration
metric spaces. here.

metric space

Formally, the clustering problem is: given points in a

, partition the points into clusters so as to Previous Work in Static Clustering. The closely-related
minimize the maximum cluster diameter. The diameter of a problems of clustering to minimize diameter and radius are
cluster is defined to be the maximum inter-point distance in 
also called pairwise clustering and the -center problem, re-
it. Sometimes the objective function is chosen to be the max- spectively [2, 21]. Both are NP-hard [17, 28], and in fact
imum cluster radius. In Euclidean spaces, radius denotes the hard to approximate to within factor 2 for arbitrary metric
radius of the minimum ball enclosing all points in the cluster. spaces [2, 21]. For Euclidean spaces, clustering on the line
To extend the notion of radius to arbitrary metric spaces, we is easy [3], but in higher dimensions it is NP-hard to approx-
first select a center point in each cluster, whereupon the ra- imate to within factors close to 2, regardless of the metric
dius is defined as the maximum distance from the center to used [14, 15, 19, 29, 30]. The furthest point heuristic due
any point in the cluster. We will assume the diameter meas- to Gonzalez [19] (see also Hochbaum and Shmoys [23, 24])

 
ure as the default. gives a 2-approximation in all metric spaces. This algorithm

We define the incremental clustering problem as follows: requires distance computations, and when the met-



for an update sequence of points in , maintain a collec- ric space is induced by shortest-path distances in weighted

tion of clusters such that as each input point is presented, graphs, the running time is . Feder and Greene [14]

   
either it is assigned to one of the current clusters, or it starts gave an implementation for Euclidean spaces with running
off a new cluster while two existing clusters are merged into time .
Overview of Results. Our results for incremental clustering there is the issue of handling deletions which, though not
show that it is possible to obtain algorithms that are compar- important for our motivating application of information re-
able to the best possible in the static setting, both in terms of trieval, may be relevant in other applications. Finally, there is
efficiency and performance ratio. We begin in Section 2 by the question of formulating a model for adaptive clustering,
considering natural greedy algorithms that choose clusters to wherein the clustering may be modified as a result of queries
merge based on some measure of the resulting cluster. We es- and user feedback, even without any updates.
tablish that greedy algorithms behave poorly by proving that
a Center-Greedy algorithm has a tight performance ratio of 2 Greedy Algorithms
 , and a Diameter-Greedy algorithm has a lower bound
of     
. It seems likely that greedy algorithms behave We begin by examining some natural greedy algorithms.
better in geometric spaces, and we discover some evidence A greedy incremental clustering algorithm always merges
in the case of the line. We show that Diameter-Greedy has clusters to minimize some fixed measure. Our results indic-
performance ratio 2 for  
on the line. This analysis ate that such algorithms perform poorly.
suggests a variant of Diameter-Greedy, and this is shown to

achieve ratio 3 for all on the line. In Section 3 we present Definition 1 The Center-Greedy Algorithm associates a
the Doubling Algorithm and show that its performance ratio center for each cluster and merges the two clusters whose
is  , and that a randomized version has ratio
 . While the centers are closest. The center of the old cluster with the
obvious implementation of these algorithms is expensive, we larger radius becomes the new center. It is possible to define
variants of Center-Greedy based on how the centers of the
    
show that they can be implemented so as to achieve amort-
ized time per update. These results for the Doub- clusters are picked but we restrict ourselves to this definition
ling Algorithm carry over to the radius measure. Then, in for reasons of simplicity and intuitiveness.
Section 4, we present the Clique Algorithm and show that
Definition 2 The Diameter-Greedy Algorithm always
it has performance ratio 6, and that a randomized version
has ratio 
  . While the Clique Algorithm may appear to
merges those two clusters which minimize the diameter of
the resulting merged cluster.
dominate the Doubling Algorithm, this is not the case since
the former requires computing clique partitions, an NP-hard We can establish the following lower bounds on the per-
problem, although it must be said in its defense that the clique

formance ratio of these two greedy algorithms. We omit the
partitions need only be computed in graphs with  ver- proofs in this extended abstract.
tices. While the performance ratio of the Clique Algorithm is
8 for the radius measure, improved bounds are possible for  - Theorem 1 The Center-Greedy Algorithm has performance
dimensional Euclidean spaces; specifically, we show that the 
ratio at least %& .
radius performance ratio of the Clique Algorithm in  im-

proves to     
, which is 6 for  ! , and is
   
Theorem 2 The Diameter-Greedy Algorithm has perform-
asymptotic to 6.83 for large  . In Section 5, we provide lower ance ratio at least  , even on the line.
bounds for incremental clustering algorithms. We show that
even for "  and on the line, no deterministic or ran-
We now give a tight upper bound for the Center-Greedy

Algorithm. Note that for '( it has ratio 5, but for larger
domized algorithm can achieve a ratio better than 2. We im-
prove this lower bound to
# for deterministic algorithms
 its performance is worse than that of the algorithms to be
presented later.
in general metric spaces. Finally, in Section 6 we consider the
dual clustering problem of minimizing the number of clusters Theorem 3 The Center-Greedy Algorithm has performance
of a fixed radius. Since it is impossible to achieve bounded 
ratio of %) in any metric space.
ratios for general metric spaces, we focus on  -dimensional
Proof: Suppose that a set * of points is inserted. Let
 
Euclidean spaces. We present an incremental algorithm that
     
$  , and also provide a lower their optimal clustering be the partition +,.-0/21435
$
$
637/98: ,
  
has performance ratio
bound of    . with  as the optimal diameter. We will show that the dia-

Many interesting directions for future research are sug-  


%&  .
meter of any cluster produced by Center-Greedy is at most

gested by our work. There are the obvious questions of We define a graph ; on the set + of the optimal clusters,
improving our upper and lower bounds, particularly for the where two clusters are connected by an edge if the min-
dual clustering problem. An important theoretical question imum distance between them is at most  , where the distance
is whether the geometric setting permits better ratios than do between two clusters is the minimum distances between
metric spaces. Our model can be generalized in many dif- points in them. Consider the connected components of ; .
ferent ways. Depending on the exact application, we may Note that two clusters in different connected components
wish to consider other measures of clustering quality, such have minimum distance strictly greater than  . We say that
as: minimum variance in cluster diameter, and the sum of a cluster < intersects a connected component consisting of
squares of the inter-point distances within a cluster. Then, the optimal clusters />=@?53$
5
$
73A/9=@B if < intersects CEF$D G 1 /9=IH .
We claim that at all times, any cluster produced by Center- Unlike Diameter-Greedy, we can show that -Diameter
Greedy intersects exactly one connected component of ; . Greedy has a bounded performance ratio on the line.
Theorem 6 The -Diameter Greedy Algorithm has perform-
We prove this claim by induction over . Suppose the claim
ance ratio on the line.
is true before a new point arrives. Initially, is in a

cluster of its own and Center-Greedy has   clusters, each
of which intersect exactly one connected component of ; . Proof: In fact, we show that it produces a clustering with

Since there are   cluster centers, two of them must be -diameter at most the optimal diameter, and the factor of
in the same optimal cluster. This implies that the distance follows. Assume this holds before the last two clusters
between the two closest centers is at most  . If < 1 and <
are merged. Let 0153 3$
$
5
7378 be the intervals in the optimal
are the clusters that Center-Greedy merges at this stage, the
clustering, with maximum diameter  . Let / 1$3A/ 3$
$
5
3A/98 1
centers of < 1 and < must be at most  apart. Hence, both be the current clusters, each with -diameter at most  , of
clusters’ centers must lie in the same connected component of which two must be merged. If />= starts in  and ends in 
,
; , say  . By the inductive hypothesis, all points in < 1 and let =   ; notice that 1 5 8 1   . We 

< must be in  . Hence, all points in the new cluster < 14C9< assume that if / = ends in  then / = 1 starts in  ; otherwise,
must lie in  , establishing the inductive hypothesis. we could replace the argument in the intervals  F by an ar-

Since each cluster produced by Center-Greedy lies in gument either in the first intervals  1 35
$
5
3 , if there are at
exactly one connected component of ; , the diameter is least   clusters / = in this region, or in the last  inter- 

bounded by the maximum diameter of a connected compon- 
vals 
 103$
$
5
7378 , if there are at least  4  current clusters
ent, which is at most  
  . /9= in this region. Now, the bounds imply that for some  , we
For Diameter-Greedy in general metric spaces, we only have =  = 1  . If =! =" 1 !# , then the merging
have the following weak upper bound; the proof is deferred of /9= and /9= 1 is contained in a single interval  F and has
to the final version. diameter at most  . If say = $# and =" 1   , then the
gap ; between the two consecutive intervals  F and  F  1 in-
Theorem 4 For   , the Diameter-Greedy Algorithm has volved is at most  , since / =" 1 has -diameter at most  , so
a performance ratio in any metric space. the merger of / = and / = 1 has -diameter at most  given by
the -partition  F 3A; 3 F  1 . This completes the proof.
In spite of the lower bounds for greedy algorithms, they We comment briefly on the running time of this algorithm.
may not be entirely useless since some variant may perform In the above proof, the -diameter of an interval may be re-
well in geometric spaces. We obtain some positive evidence placed by an easily-computed upper bound: at the time of
in this regard via the following analysis for the line. The up- creation of interval %  3 '& , let %( 3)& be the gap containing * 
  
per bounds given here should be contrasted with the lower 
 , and let the upper bound be + ,- * .3)%/ 3 /) .
bound of 2 for the line shown in Section 5. The following
    
Maintaining the points sorted in a balanced tree, the run-
definitions underlie the analysis. ning time is for each of the points inserted.

Definition 3 Given a set + of points in the line, a  - 3 The Doubling Algorithm


partition subdivides the interval between the first and last
points of + into  subintervals whose endpoints are in + . We now describe the Doubling Algorithm which has per-
The  -diameter of + is the minimum over all  -partitions of formance ratio 8 for incremental clustering in general metric
the maximum interval length in a  -partition of + . The  -
diameter is the diameter, while the -diameter is the radius
spaces. The algorithm works in phases and uses two paramet-
ers 0 and 1 to be specified later, such that 0  *0    1 . 
of + where the center is constrained to be a point of + . 
At the start of phase  , it has a collection of   clusters

/ 1 37/ 3$
5
$
637/ 8 1 and a lower bound  = on the optimal clus-
We define the following family of algorithms based on the tering’s diameter (denoted by OPT ). Each cluster / = has a
notion of the  -diameter. center 2 = which is one of the points of the cluster. The fol-
Definition 4 The  -Diameter Greedy Algorithm merges lowing invariants are assumed at the start of phase  : (a) for
each cluster / F , the radius of / F defined as + ,3-46567 H  *2 F 3  
is at most 0 = ; (b) for each pair of clusters / F and /98 , the
those two clusters which minimize the  -diameter of the
merged cluster. Note that  -Diameter Greedy is the same as 
inter-center distance  *2 F 32 8 9:  = ; and, (c)  =; OPT .

Diameter-Greedy.
Each phase consists of two stages: the first is a merging
While Diameter-Greedy has ratio 2 for     and ratio stage in which the algorithm reduces the number of clusters

3 for . , we can show a lower bound of    on its by merging certain pairs; the second is the update stage in
performance ratio on the line. which the algorithm accepts new updates and tries to main-

tain at most clusters without increasing the radius of the
Theorem 5 The Diameter-Greedy Algorithm for the line has clusters or violating the invariants (clearly, it can always do
performance ratio for ' 
and performance ratio for so by making the new points into separate clusters). A phase

 . ends when the number of clusters again exceeds . 
Definition 5 The  -threshold graph on a set of points * 
 
Proof: We have   clusters at the end of the phase since

- 1 3 3$
5
$
73 : is the graph ;  * 3  such that 
 =3 F   if and only if  = 3 F   .   that is the terminating condition. From Lemma 2, the radius
of the clusters after the merging stage is at most 0  = 1 and
from the description of the update stage this bound is not vi-
The merging stage works as follows. Define  = 1  1  = ,

olated by the insertion of new points. The distance between
and let ; be the  =" 1 -threshold graph on the   cluster cen- the clusters centers after the merging stage is  = 1 , and a new

ters 2 1 32 3$
5
$
732 8 1 . The graph ; is used to merge clusters cluster is created only if a request point is at least  = 1 away
by repeatedly performing the following steps while the graph from all current cluster centers. Therefore, the cluster centers
is non-empty: pick an arbitrary cluster / = in ; and merge all have pairwise distance at least = 1 . Since at the end of the
its neighbors into it; make 2 = the new cluster’s center; and, 
phase we have ! cluster centers that are = 1 apart, the
remove /9= and its neighbors from ; . Let /1 3A/ 35
$
5
737/  be optimal clustering is forced to put at least two of them in the
the new clusters at the end of the merging stage. Note that it same cluster. It follows that OPT : = 1 .
is possible that  , when the graph ; has no edges,  Based on these lemmas, we make the following observa-
in which case the algorithm will be forced declare the end of tions. The algorithm ensures the invariant that  =  OPT at
phase  without going through the update stage. the start of phase  . The radius of the clusters during phase  is
at most 0 = 1 . Therefore, the performance ratio at any time
Lemma 1 The pairwise distance between cluster centers during the phase is at most 0 = 1A OPT  0 1 =@ OPT 
after the merging stage of phase  is at least = 1 . 0 1 . We choose 0  1)
 
; note, this choice satisfies the
condition that 0  *0'   1 . This leads to the following
Lemma 2 The radius of the clusters after the merging stage performance bound, which can be shown to be tight.
of phase  is at most  = 1  0  =; 0  =" 1 .
Theorem 7 The Doubling Algorithm has performance ratio
Proof: Prior to the merging, the distance between two 8 in any metric space, and this is tight.
clusters which are adjacent in the threshold graph is at most
= 1 , and their radius is at most 0 = . Therefore, the radius of An examination of the proof of the preceding theorem
the merged clusters is at most shows that the radius of the clusters is within factor 4 of the

 =; 69.0 1   =" 1


diameter of the optimal clustering, leading to the following
 = 1  0  0  =" 1 3 result.

where the last inequality follows from the choice that 0  0
  Corollary 1 The Doubling Algorithm has performance ratio
  1.  for the radius measure.
The update stage continues while the number of clusters is

at most . When a new point arrives, the algorithm attempts
A simple modification of the Doubling Algorithm, in which
we pick the new cluster centers by a simple left-to-right scan,
to place it in one of the current clusters without exceeding the
improves the ratio to
for the case of the line.
radius bound 0 = 1 : otherwise, a new cluster is formed with
the update as the cluster center. When the number of clusters While the obvious implementation of this algorithm ap-

reaches   , phase  ends and the current set of   clusters   pears to be inefficient, we can establish the following time
along with = 1 are used as the input for the   st phase.  bound, which is close to the best possible.
All that remains to be specified about the algorithm is the
initialization. The algorithm waits until 9  points have ar-       
Theorem 8 The Doubling Algorithm can be implemented to
rived and then enters phase  with each point as the center of
run in amortized time per update.
a cluster containing just itself, and with  1 set to the distance
Proof: First of all, we assume that there is a black-box
between the closest pair of points. It is easily verified that the
invariants hold at the start of phase  . The following lemma
for computing the distance between two points in the met-
ric space in unit time. This is a reasonable assumption in
invariants for the * st phase.  
shows that the clusters at the end of the  th phase satisfy the
most applications, and in any case even the static algorithms’
analysis requires such an assumption. In the information re-

Lemma 3 The  clusters at the end of the  th phase satisfy
trieval application, the documents are represented as vectors
and the black-box implementation will depend on the vector
the following conditions:
length as well as the exact definition of distance.
1. The radius of the clusters is at most 0  =" 1 . We now show how the Doubling Algorithm may be im-

     
plemented so that the amortized time required for processing
2. The pairwise distance between the cluster centers is at each new update is bounded by . We maintain the
least  =" 1 . edge lengths of the complete graph induced by the current

 
cluster centers in a heap. Since there at most clusters the
3.  = 1  OPT ,
where OPT is the diameter of the optimal space requirement is . When a new point arrives, we
clustering for the current set of points. compute the distance of this point to the each of the current
cluster centers, which requires   
time. If the point is ad- the expected value of
D is bounded by
ded to one of the current clusters, we are done. If, on the
 % &  ( 1 

#<   <  
other hand the new point initiates a new cluster, we insert into
D  D
 
D d
 
1) 
   
the heap edges labeled with the distances between this new
( 1
center and the existing cluster centers which takes
  OPT
 <

 
 D  < D d


 
time. For accounting purposes in the amortized analysis, we

  1
) 
associate credits with each inserted edge. We will show
 OPT 


   
that it is possible to charge the cost of implementing the mer-  
    * OPT


ging stage of the algorithm to the credits associated with the
edges. This implies the desired time bound.
Therefore, the expected diameter is at most  OPT .
We can assume, without loss of generality, that the mer-
ging stage merges at least two clusters. Let  be the threshold
4 The Clique Partition Algorithm
used during the phase. The algorithm extracts all the edges
from the heap which have length less than  . Let be the
We now describe the Clique Algorithm which has per-
    
number of edges deleted from the heap. The deletion from
formance ratio 6. This does not totally improve upon the
the heap costs time. The  -threshhold graph on
Doubling Algorithm since the new algorithm involves solv-
the cluster centers is exactly the graph induced by these
ing the NP-hard clique partition problem, even though it is
edges. It is easy to see that the procedure described to find
the new cluster centers using the threshold graph takes time

only on a graph with  vertices. Finding a minimum
clique partition is NP-hard even for graphs induced by points
linear in the number of edges of the graph, assuming that the
in the Euclidean plane [17], although it is in polynomial time
edges are given in the form of an adjacency list. Forming the
for points on the line. Since the algorithm needs to solve the
adjacency list from the edges takes linear time. Therefore, the
      clique partition problem on graphs with  vertices, this 
     
total cost of the merging phase is bounded by
time. The credit of   
placed with each
may not be too inefficient for small . 
edge when it is inserted in to the heap accounts for this cost, Definition 6 Given an undirected unweighted graph ; 
completing the proof. 
+ 3 
 , an , -clique partition is a partition of + + 1 C+ C

Finally, we describe a Randomized Doubling Algorithm
$
5
C'+ 8 such that the the induced graphs ;%-+ = & ’s are cliques.
with significantly better performance ratio. The algorithm A minimum clique partition is an , -clique partition with the
is essentially the same as before, the main change being in minimum possible value of , .
the value of 1 which is the lower bound for phase  . In the
deterministic case we chose 1 to be the minimum pairwise The Clique Algorithm is similar to the Doubling algorithm

distance of the first '  points, say . We now choose in that it also operates in phases which have a merging stage
a random value from %I5 3$& according to the probability followed by an update stage. The invariants maintained by
density function $ , set 1 to , and redefine 1 and the algorithm are different though. At the start of the  th
0 5  
     . Similar randomization of doubling algorithms 
phase we have   clusters /1$3A/ 3$
$
5
737/98 1 and a value =
such that: (a) the radius of each cluster / F is at most = ; the
diameter of each cluster / F is at most  = ; and, (c)  =; OPT .
was used earlier in scheduling [31], and later in other applic-
ations [7, 18].
The merging works as follows. Let  =" 1   = . We form
Theorem 9 The Randomized Doubling Algorithm has ex- the minimum clique partition of the  = 1 -threshold graph ;

pected performance ratio 
 in any metric space. of the cluster centers. The new clusters are then formed by
The same bound is also achieved for the radius measure. merging the clusters in each clique of the clique partition.
We arbitrarily choose one cluster from each clique and make
its center the cluster center of the new merged cluster. Let
Proof: Let be the sequence of updates and let the op-
timal cluster diameter for be
, where is the minimum
/1 37/ 3$
5
$
637/8& . be the resulting clusters. In the rest of the

pairwise distance of the first   points. The optimal value phase we also need to know which old clusters merged to
is at least , hence
:  . Suppose we choose  1
form each of the new clusters.
for some  6$    35
& . Let D be the maximum radius of the Lemma 4 The radius of the clusters after the merging stage
clusters created for with this value of . Using arguments is at most = 1 and the diameter is at most = 1 .
D is at most = 1 0 =>  =" 1  1A    , where  is the
similar to those in the proof of Theorem 7, we can show that
  Proof: Let / F ?53A/ F $ 35
$
$
A37/ F/ H be the clusters whose
largest integer such that =  = 1  1  = 1  OPT  union is the new cluster /F  and without loss of generality as-

. Let  be the integer such that  = 1 


 = $ and sume that the center of / F ? was chosen to be the center of


 =  .  Then, D   D1 ! when #"  , and D   D   1%& /F  . Since the centers of / F ?537/ F $ 3$
5
$
73A/ F / H induce a clique
. Let <  and <   be indicator variables for the in the = 1 -threshold graph, the distance from the new center
when 
the events % 
D '" D & , respectively. We claim that
& and % to each of the old cluster centers is at most  = 1 . The radius
of each of / F ?53A/ F $ 35
$
$
A37/ F/ H is at most = . Therefore it fol- Theorem 10 The Clique Algorithm has performance ratio

lows that the new radius is at most  =" 1   =   =" 1 and in any metric space, and this is tight.
the diameter is at most == 1  =   = 1 .
During the update phase, a new point is handled as fol- Since the radius of the clusters is within of the optimal
lows. Let the current number of clusters be , where , =  diameter, we obtain the following corollary.

 . Recall that /1 37/ 3$
5
$
73A/8& . are the clusters formed Corollary 2 The Clique Algorithm has performance ratio 
during the merging stage. If there exists such that " , =
   
and  32 F    = 1 , or if  , = and  32 F  =" 1 where
in any metric space for the radius measure.
/ F is a cluster which merged to form /F  , add to the cluster
/F  . If no such exists, make a new cluster with as the cen-
As in the case of the Doubling Algorithm, we can use ran-
domization to improve the bound. Let be the minimum dis-
ter. The phase ends when the number of clusters exceeds ,  
tance among the first   points. The randomized algorithm

or if there are   clusters at the end of the merging phase. sets  1  in phase  of the deterministic algorithm, where
The intuition behind the new algorithm is the following. is chosen from %I5 3$& according to the probability density
At the beginning of the phase we have   clusters and  1 . The analysis is similar to that of Theorem 9

a lower bound on the optimal. We use the lower bound to
function
D
 
and we omit the details.
increase the radius of our existing clusters and merge some
of them. To maintain the invariant for the lower bound in Theorem 11 The Randomized Clique Algorithm has per-
the next phase we need to ensure during this merging that formance ratio 
 in any metric space.

the number of clusters we have after the merging is no more 
that what the optimal algorithm can achieve using the lower Corollary 3 The Randomized Clique Algorithm has per-

bound for the next phase. The doubling algorithm achieved formance ratio

for the radius measure in any


this by picking an independent set as the new cluster cen- metric space.
ters in the distance threshold graph. The weakness of this
approach is that we have a bound on the diameter, only as The special structure of the clusters in the Clique Al-
a function of the radius of the new cluster. We get the im- gorithm can be used to show that the performance ratio for
provement by observing that a better bound on the number the radius measure is better than  for the geometric case.
of clusters achievable by the optimal with diameter bounded This is based on the following result in geometry; we defer
by  = is the size of the minimum clique partition of the dis- the proofs of the proposition and its consequence.
tance threshold graph. We still need a condition on the radius
in order to do the doubling, but now, since we use cliques, we Proposition 12 Any convex set in  of diameter at most 
can bound the diameter of the new clusters better than twice can be circumscribed by a sphere of radius  , where  sat-
the radius. isfies the following recurrence with the base case 1!5 ,
The following lemmas show that the clusters at the end of 
phase  satisfy the invariants for phase    .
9   1

 
Lemma 5 The radius of the clusters at the end of the phase
The solution to this recurrence is !     .
 is at most = 1 and the diameter of the clusters is at most
= 1 . 
Lemma 6 At the end of phase  ,  =" 1  
Theorem 13 The Clique Algorithm has performance ratio

OPT .
  for the radius measure in  . This implies perform-
Proof: Suppose  = 1 " OPT . Let +' -2 1 32 35
$
$
A32 8 1 : ance ratio
for  ! ,

 for   , and

 asymptotically
be the cluster centers at the beginning of the phase. Note that for large  .
the centers 2  1 3$
5
$
732 8&. belong to + . Let + >-2 F
  " , =:
be the set of cluster centers of the clusters which are formed 5 Lower Bounds
2 F  in +  started a new cluster  2 F  3
 
in phase  after the merging stage. Since each of the centers
" = 1 for all 
F
+9C+   -2  : . Therefore in the optimal solution each center in
We present some lower bounds on the performance of in-
cremental clustering. The lower bounds apply to both dia-
+  is in a cluster which contains no center in + . This implies meter and radius measures but our proofs are given for the
that the centers in + are contained in at most , =   clusters diameter case. The following theorem shows that even for
of diameter  = 1 . This is a contradiction since , = was the size the simplest geometric space, we cannot expect a ratio better
of the minimum clique partition of the  = 1 -threshold graph than 2; the proof is omitted.
on + .
The diameter of the clusters during phase  is at most Theorem 14 For 

 5 8 ) on the performance ratio of deterministic and ran-
, there is a lower bound of and
= 1 and we maintain the invariant that =  OPT at the
start of the phase. Therefore, the performance ratio of this domized algorithms, respectively, for incremental clustering
algorithm is at most =" 1  = 
. on the line.
In the case of general metric spaces, we can establish a clustering. The distribution on inputs is as follows. Initially,
stronger lower bound. the adversary provides points * 1 37*
$
5
6* such that the
distance between any two of them is  . Then the adversary
Theorem 15 There is a lower bound of   
# on
the performance ratio of any deterministic incremental clus-

partitions the points into disjoint sets + 1 37+
5
$
6+ 8 at
random, such that all partitions are equally likely. Finally
tering algorithm for arbitrary metric spaces.
the adversary provides points 1 3
    
3$
5
$
8 , such that 
 =3A* F   if * F  + = ,  =73A* F  if * F  + = ,
  
 
 
Proof: Consider a metric space consisting of the points
= F ,   3  ,   . The distances between the points
  =3 F  . Now, the diameter of the optimal solution
for any input in the distribution is  , obtained by construct-
are the shortest path distances in the graph with the following
distances specified:  = F 3 F =   , and  = F ? 3 = F $ 
 .     
ing the clusters + = C - = : . However, the incremental al-
Let =  - = F   
  3  : . Note that the sets  gorithm can produce a clustering with diameter  only if the
 = ,     , partition the metric space into clusters clusters it produces after it sees points *9143A*
5
$
6* are pre-
 cisely the sets + = (selected at random by the adversary). Let
 
of diameter . Let be any deterministic algorithm for
the incremental clustering problem. Let 
. Consider  <8 be the number of ways to partition the points into
the clusters produced by after it is given the  points = F  
 
sets. Then the probability that the incremental algorithm
described above. produces a clustering of diameter  is at most !54< 8 .
With probability at least   , the incremental algorithm pro-
Case 1: Suppose the maximum diameter of ’s clusters is  duces a clustering of diameter at least . Thus the expected
 . Then ’s clusters must be the
sets - = F 3 F = : . Now the

adversary gives a point such that  3 = F !# (any large    value of the diameter of the clustering produced is at least
 . Hence the expected value of the performance ratio is at
number will do) for   3  . The optimal clustering is least  . By chosing suitably large, < 8 can be made  
  
 
- : and the sets 2103 3 3
. The optimal diameter is arbitrarily large, and hence can be made arbitrarily small,
. We claim that the maximum diameter of is at least  in particular smaller than for any fixed " # .  
 . If the cluster that contains contains any other point 
then our claim is clearly true. If on the other hand, the cluster 6 Dual Clustering

that contains does not contain any other point, must 
have merged two of its existing clusters. Then the maximum We now consider the dual clustering problem: for a se-

diameter of ’s resulting clusters is at least  . Thus quence of points 1 3 35
$
$
A3    , cover each point with a

the performance ratio of is at least   .  unit ball in   as it arrives, so as to minimize the total number
of balls used. In the static case this problem is NP-Complete
Case 2: Suppose the maximum diameter of ’s clusters is  and there is a PTAS for any fixed dimension [22]. We note
greater than  . Then some cluster of contains points  that in general metric spaces, it is not possible to achieve any
  
which are at least distance
   
apart. Let these points be
3  3) . Now the adversary gives
points
bounded ratio.
and ,
= F ,      such that  = F 3 = F   = F 3 F =  .     Our algorithm’s analysis is based on a theorem from com-

The optimal clustering consists of the


sets - = F 3 = F 3 F =@: .
binatorial geometry called Roger’s theorem [36] (see also
Theorem 7.17 [33]), which says that  can be covered by
The optimal diameter is  . We claim that the maximum dia- any convex shape with covering density @  . Since   
 
meter of ’s clusters must be at least   . Note that
     the volume of a radius ball is  times the volume of a unit-
 = ? F ?53 = $ F $ : >

  for  3
1 3 1 . Also  = ? F ?53 = $ F $ 9: 

 13 1 ,  3 
for  1 3 1  * 3 .         radius ball, the number of balls needed to cover a ball of ra-
dius using balls of unit radius is    
$  . We     
If puts any two = F in the same cluster, our claim is clearly

true. If all the = F are in separate clusters, each of the
  
first describe an incremental algorithm which has perform-
ance ratio  . We also establish an asymptotic lower bound
clusters must contain one of them. Also one of the
clusters,
   of        
 
   

 ; for    and , our proof yields lower
say / must contain both the points  and . Then / must bounds of and , respectively.
have diameter at least 9 , since the = F in / must be at
distance at least   from one of  and . Hence the    Theorem 17 For the dual clustering problem in  ,
performance ratio of is at least   
  
. there is an incremental algorithm with performance ratio
Finally, we can improve the randomized lower bound  $  .
slightly for the case of arbitrary metric spaces.
Theorem 16 For any " # and : 
, there is a lower  Proof: Our algorithm maintains a set  of centers which
is a subset of the points that have arrived so far; initially,  
bound of  
on the performance ratio of any randomized  . Define the range  
of a center to be the sphere of
incremental algorithm.
radius about . For any two centers 1 and , we ensure

Proof: We use Yao’s technique and show a lower bound that  153  "

. Associated with each center is a set of
 
  
on deterministic algorithms on an appropriately chosen dis- points called the neighbors of . For convenience, we

tribution. Let be a deterministic algorithm for incremental assume that  . We ensure that all neighbors of lie
 . When a new point is received, if    
To determine the value of  is easy, since we have the in-
  
in for
some center , we add to , breaking ties arbitrarily. If equality that   
no such center exists, must be at a distance greater than 0    " 0  
from the requirement that the ball   have volume equal to
  
from all the existing centers. In this case, we make a new
center and set *  - : .
that of  unit balls. Now let  !9  . Substituting in the 
    
From Roger’s theorem on packing density a sphere of
radius in   can be covered by   $   above equations we obtain that:
spheres of radius  . When a new center is created, we  
    1   
fix a set of spheres + 1 3A+    
3$
5
$
73A+    which cover     
        1  
  . Whenever a point is added to   


, if it is not already  0  
  
  1  " 0 

     6         1  "   
covered by some previously placed sphere, we add the sphere
+ D where is any value such that  + D . Note 
that such a sphere must exist as    
      
and the spheres
+ 1 3A+ 3$
5
$
73A+    cover completely.  Note that       1   . Using the fact that    " 

for   , we see that choosing A = such that
Since no two centers can be covered by a sphere of unit

      1    
radius, any solution must use a separate sphere to cover each
center. Hence, the number of centers is a lower bound for
the number of spheres used by the optimal offline algorithm. 
    
For each center , the incremental algorithm uses at most
 spheres to cover the points in . Hence, the per- will satisfy our requirements. Unfolding the recurrence,
  
    
formance ratio of the incremental algorithm is bounded by
  
5  . #1     1                  
 = =  =
The following theorem gives a lower bound for the dual =G 1  =G 1  =G 1  
clustering problem.
Noting that  1  , we obtain that    1 :
     1  5

Theorem 18 For the dual clustering problem in  , The lower bound is the smallest value of  for which    1 is
negative. Let   
 be the largest value of  for which
      

  
  
any incremental algorithm must have performance ratio
 
 
.  :
#
 1    


Proof: The idea is as follows. At time  , when  points

have been given by the adversary, it will be the case that the
points 1435
$
$
A3  can be covered by a ball of radius   .
   
         

Then, the adversary will find a point   1 lying outside the  This gives the desired lower bound.
unit balls laid down by the algorithm so as to minimize the
radius   1 of the ball required to cover all    points and
present that as a request. The game terminates when at some

time ' , we have for the first time that 8 1 "  . Clearly,
Acknowledgements
 is a lower bound on the performance ratio since the points
143$
5
$
63 8 can be covered by a ball of radius 8   , and We thank Pankaj Agarwal and Leonidas Guibas for help-

the algorithm has used balls up to that point. It remains to
ful discussions, and for suggesting that we consider the dual
analyze the worst-case growth rate of  as a function of  .
clustering problem.
Note that %1 # and $ .
Let 0 denote the volume of a unit ball in   . At time References
 , let 
  be any ball of radius (at most)  that covers the

points 1 35
$
5
73  . For some  to be specified later, define the
[1] M.S. Aldenderfer and R.K. Blashfield. Cluster Analysis.
ball     as a ball with the  same center as   and with radius
Sage, Beverly Hills, 1984.

   . We will choose  such that the volume of    is at [2] M. Bern and D. Eppstein. Approximation Algorithms for
Geometric Problems. In: D.S. Hochbaum, editor, Approx-
least 0  , implying that the current  unit balls placed by the imation Algorithms for NP-Hard Problems. PWS Publishing
algorithm cannot cover the entire volume of '  . This would Company, 1996.
imply that there is a choice of a point   1 inside    which [3] P. Brucker. On the complexity of clustering problems.
is not covered by the current  balls. It is also clear that the In: R. Henn, B. Korte, and W. Oletti, editors, Optimization
new set of '  points now can be covered by a ball of radius and Operations Research, Heidelberg, New York, NY, 1977,
at most     . implying that pp. 45–54.
[4] F. Can. Incremental Clustering for Dynamic Information Pro-
  cessing. ACM Transactions on Information Processing Sys-

  1 
 
tems, 11 (1993), pp. 143–164.
[5] F. Can and E.A. Ozkarahan. A Dynamic Cluster Maintenance [24] D.S. Hochbaum and D.B. Shmoys. A unified approach to ap-
System for Information Retrieval. In Proceedings of the Tenth proximation algorithms for bottleneck problems. Journal of
Annual International ACM SIGIR Conference, 1987, pp. 123- the ACM, 33 (1986), pp. 533–550.
131. [25] S. Irani and A. Karlin. Online Computation. In: D.S. Hoch-
[6] F. Can and N.D. Drochak II. Incremental Clustering for Dy- baum, editor, Approximation Algorithms for NP-Hard Prob-
namic Document Databases. In Proceedings of the 1990 Sym- lems. PWS Publishing Company, 1996.
posium on Applied Computing, 1990, pp. 61–67. [26] N. Jardine and C.J. van Rijsbergen. The Use of Hierarchical
[7] S. Chakrabarti, C. Phillips, A. Schulz, D.B. Shmoys, C. Stein, Clustering in Information Retrieval. Information Storage and
and J. Wein. Improved Scheduling Algorithms for Minsum Retrieval, 7 (1971), pp. 217–240.
Criteria. In Proceedings of the 23rd International Colloquium [27] A.K. Jain and R.C. Dubes. Algorithms for Clustering Data.
on Automata, Languages and Programming, Springer, 1996. Prentice-Hall, NJ, 1988.
[8] B.B. Chaudhri. Dynamic clustering for time incremental data. [28] O. Kariv and S.L. Hakimi. An algorithmic approach to net-
Pattern Recognition Letters, 13 (1994), pp. 27-34. work location problems, part I: the -centers problem. SIAM


Journal of Applied Mathematics, 37 (1979), pp. 513–538.


[9] D.R. Cutting, D.R. Karger, J.O. Pederson, and J.W. Tukey.
[29] N. Megiddo and K.J. Supowit. On the complexity of some
Scatter/Gather: A Cluster-based Approach to Browsing Large
common geometric problems. SIAM Journal on Computing,
Document Collections. In Proceedings of the 15th Annual In-
13 (1984), pp. 182–196.
ternational ACM SIGIR Conference, 1992, pp. 318–329.
[30] S.G. Mentzer. Lower bounds on metric -center problems.
[10] D.R. Cutting, D.R. Karger, and J.O. Pederson. Constant
Manuscript, 1988.
interaction-time Scatter/Gather Browsing of Very Large Doc-
[31] R. Motwani, S. Phillips and E. Torng. Non-Clairvoyant
ument Collections. In Proceedings of the 16th Annual Inter-
Scheduling, In Proceedings of the Fourth ACM-SIAM Sym-
national ACM SIGIR Conference, 1993, pp. 126–135.
posium on Discrete Algorithms, 1993, pp. 422–431. See also:
[11] R.O. Duda and P.E. Hart. Pattern Classification and Scene Theoretical Computer Science, 130 (1994), pp. 17–47.
Analysis. John Wiley & Sons, NY, 1973. [32] E. Omiecinski and P. Scheuermann. A Global Approach to
[12] B. Everitt. Cluster Analysis. Heinemann Educational, Lon- Record Clustering and File Organization. In Proceedings of
don, 1974. the Third BCS-ACM Symposium on Research and Develop-
[13] C. Faloutsos and D.W. Oard. A Survey of Information Re- ment in Information Retrieval, 1984, pp. 201–219.
trieval and Filtering Methods. Technical Report CS-TR-3514, [33] J. Pach and P.K. Agarwal. Combinatorial Geometry. John
Department of Computer Science, University of Maryland, Wiley & Sons, New York, NY, 1995.
College Park, 1995. [34] E. Rasmussen. Clustering Algorithms. Chapter 16 in:
[14] T. Feder and D.H. Greene. Optimal Algorithms for Approx- W. Frakes and R. Baeza-Yates, editors, Information Retrieval:
imate Clustering. In Proceedings of the Twentieth Annual Data Structures and Algorithms, Prentice-Hall, Englewood
Symposium on Theory of Computing, 1988, pp. 434–444. Cliffs, NJ, 1992.
[15] R.J. Fowler, M.S. Paterson, and S.L. Tanimoto. Optimal pack- [35] C.J. van Rijsbergen. Information Retrieval. Butterworths,
ing and covering in the plane are NP-complete. Information London, 1979.
Processing Letters, 12 (1981), pp. 133–137. [36] C. Rogers. A note on coverings. Mathematika, 4 (1957),
[16] W. Frakes and R. Baeza-Yates, editors. Information Retrieval: pp. 11–6.
Data Structures and Algorithms. Prentice-Hall, 1992. [37] G. Salton. Automatic Text Processing. Addison-Wesley,
[17] M.R. Garey and D.S. Johnson, Computers and Intractability: Reading, MA, 1989.
A Guide to the Theory of NP-Completeness, W.H. Freeman [38] G. Salton and M.J. McGill. Introduction to Modern Informa-
and Company, 1979. tion Retrieval. McGraw-Hill Book Company, New York, NY,
[18] M. Goemans and J. Kleinberg. An improved approximation 1983.
ratio for the minimum latency problem. In Proceedings of [39] P. Willett. Recent Trends in Hierarchical Document Cluster-
the Seventh ACM-SIAM Symposium on Discrete Algorithms, ing: A Critical Review. Information Processing & Manage-
1996, pp. 152–157. ment, 24 (1988), pp. 577–597.
[40] I.H. Witten, A. Moffat, and T.C. Bell. Managing Gigabytes:
[19] T.E. Gonzalez. Clustering to minimize the maximum inter-
Compressing and Indexing Documents and Images. Van Nos-
cluster distance. Theoretical Computer Science, 38 (1985),
trand Reinhold, New York, NY, 1994.
pp. 293–306.
[20] J.A. Hartigan. Clustering Algorithms. Wiley, New York, NY,
1975.
[21] D. Hochbaum. Various Notions of Approximations: Good,
Better, Best, and More. In: D.S. Hochbaum, editor, Approx-
imation Algorithms for NP-Hard Problems. PWS Publishing
Company, 1996.
[22] D.S. Hochbaum and W. Maas. Approximation Schemes
for Covering and Packing Problems in Image Processing and
VLSI. Journal of the ACM, 32 (1985), pp. 130–135.
[23] D.S. Hochbaum and D.B. Shmoys. A best possible heuristic
for the -center problem. Mathematics of Operations Re-
search, 10 (1985), pp. 180–184.

You might also like