Scatter/Gather: A Cluster-Based Approach To Browsing Large Document Collections

Scatter/Gather: Browsing
Douglass R. Cuttingl David
A Cluster-based Large Document

R. Kargerl2 Jan O.
Approach Collections
Pedersenl
to
John
W.
Tukey13
Abstract
Document formation two large number not main clustering retrieval categories: (with of documents); improve that these has not tool. first, running and been well received to its often that is too quadratic clustering as an ininto for does in the slow Objections that time second, use fall
of a hierarchy, taking condition returned These neighbor the erate arise only when clustersearch conventional these employs We also which support the the
queries
are The Hybrid
processed branch, subtree strategies
downward, until at that some point of in used
always stopping is then nearof gen-
highest is achieved.
scoring
as a result. strategies searchl clustering. compared document
are also available. variations terms to
clustering
are where
essentially nearness similarity
corpora appreciably
is defined measure search
pairwise
retrieval. problems
Indeed, to direct
cluster
techniques search recall. [9], Variare not in some are often unperfor-
We argue ing tion is used techniques. and provides a document clustering (linear teractive time)
are typically and
near-neighbor and
in an attempt However, a powerful browsing as its primary clustering in its own
to improve right
are evaluated indicate superior can
in terms that
of precision search for
looking
at clustering obviates that
as an informaobjections, We present docum-ent present this fast in-
ous studies markedly situations, Furthermore, slow, mance, for ment with surprising Document
cluster (see, clustering
strategies and,
access tool
to near-neighbor
search, example, algorithms It
new access paradigm. technique operation. algorithms
be inferior document
[6, 12, 4]).
quadratic that cluster gained
running search, wide
times. with popularity.
is therefore
its indifferent
browsing
paradigm.
has not
clustering algorithms in that than
has also been studied search, near-neighbor [1]. approach document a new dismissing but for
as a method the search to develophas de-
Introduction
clustering for has been extensively improving an excellent mutually same document review). and, of such similar queries, investigated search The and will that general as reastend aucan
accelerating of fast this interest paper,
near-neighbor possibility
Document trieval sumption tomatic improve ically
creased In clustering. as a poor how own called as its towards serves To
a methodology is that
we take
document clustering we ask in its method, clustering is directed goals and
(see [15] for
Rather tool
documents hence, a search
for enhancing can be effective We describe
near-neighbor as an access a document
search, method browsing
to be relevant recall a fixed
to the
clustering right.
determination
of groups
documents
by effectively corpus
broadening
request Typinto a or into against are
Scatter/Gather, primitive information as a complement implement
which access
uses This with
document technique non-specific
(see [11] for an exhaustive hierarchical In the case and clusters returned
1Xerox 3333 Coyote 2 St anford 3 Princeton Permission granted titla that to
a discussion partition, tree the structure contents of a partition,
of the cluster disjoint (see, of the queries best sorted
hypothesis). either
operation.
of documents
is clustered or otherwise, for example, are scoring by score. matched
to more
focused fast two new
techniques. document near for clusterlinear their time effechas shown
[8, 13, 2]). clusters
Scatter/Gather, We introduce which also and discuss
ing is a necessity. clustering to tiveness. be effective,
algorithms
experimentation reasons
as a result,
Palo Hill Alto
possibly
In the case
Research Palo
Center Alto, CA 94304
Road,
University University
1.1
The
of this material is for
Browsing
standard presumes
need.
vs Search
of the information the users
to
formulation a query,
The task
access probof an inforfor doc-
copy that
without
fee
all or part are not made
provided
the copies
or distributed notice
lem
mation
expression
search
direQt .a?mmerGial of the copying specific
advantage,
the ACM
Gopyright
and the is given a fee
is then
a corpus
publication To copy
and its date otherwise,
appaar,
and notice
uments ble,
that
match a situation
this
need. in which
However, it is hard, precisely.

or (similarity
it is not if not
difficult impossithe
is by permission permission.
of tha Association or to republish,
for Computing requires
to imagine
Machinery, and/or
to formulate
such a query
as vector space
For example,
search
15th Ann Int1 SIGIR 92/Denmark-6/92 @ 1992 ACM 0.8979J-52+0/92/0006/03J
1Also known
8,..$1,50
318
user
may
not
be familiar a topic
with
the
vocabulary or may
approprinot wish to
2
In the the
Scatter/Gather
basic iteration of the with groups. scatters groups, to
Browsing
proposed short the or the browsing method, of a small into and Based a small presents on these for clusa small
ate for commit the but fact thing session learn to user
describing himself may not
of interest, choice for
to a particular be looking
of words. specific
Indeed, at all,
anything the general
user Initially
is presented of document the system of document
summaries collection clusters, user.
rather covers
may
wish
to discover Access spectrum: a particular defined document across starts the with out
information collection is a narrowly given someto in
number number short further to form tering number to the summaries,
content specified
of the corpus. an entire sea~ch with more no about the for well the user
to a document at one end document, goal, satisfying
summaries study. again The
of them selected
as specific
as its title;
at the other collection. spectrum,
end is a browstng a need It is common from browsing goal defined
the user selects a subcollection. to scatter
one or more groups The system
of the groups then applies into
are gathered
together
for a session search:
to move
the new groups,
subcollection which are again iteration detailed. this
a partially more about
of document user. smaller, the out groups With
presented the groups bot-
which
is refined
as he finds
the document techniques where tend clusis sub-
each become
successive more small
collection. to emphasize example tering, merged search.
Standard the of this from view
information end
access
become when toms
and therefore
Ultimately, process
search emphasis
of the of topic only
spectrum. search, extraction,
A glaring
enough,
is cluster
by enumerating
individual
documents.
a technology
capable and used
to assist
near-neighbor for clustering from textbook. terms which is in
2.1
An
Illustration
a Scatter/Gather of about News session figure, 5000 Servzce session, articles the assigned during where posted month the text to the of Au1. Here,
We propose information methods
an alternative access, taking provided question question,
application our inspiration with in mind, one consults
We now describe collection New gust York 1990. consists Tzmes This the
the access
typically
a conventional and the
If one has a specific which directs simply eral lays out define one question, the gives that to
specific index,
is summarized we manually cluster
in figure
to simplify labels sion based
single-word The full ses-
passages in one
of interest. the table of the sort
However, of contents, text. The
if one
on the full
descriptions. A. out what
interested logical
gaining structure intensive browsing
an overview,
or has a genwhich table might of be and and
is provided Several search The
as Appendix issues prevent
peruses
Suppose month. tional

q
the user wants techniques:
to find
happened
that
the
application
of conven-
contents answered may ily
a sense of what sections
of questions
by a more between the index. analogy, two
examination of interest. the table
of the text, of contents.
information topic.
need is too vague
to be described
as
also lead to specific alternate
One can eas-
a single
q
searching By system
Even scribe
if a topic it may words to in
were available, not be known
the to the a topic
words user. may thus
used
to de-
direct with
we propose
an information our browsing
access method, dynamic of text of

q q
components: which metaphor uses for a
Scatter/Gather, table-of-contents documents; search search similar for user scribe ipate to find user viced used retrieve [7]. and The methods, documents, is directly found focused the document that
cluster-based, navigating
The used pear words Even were those
used to describe the topic of interest.
not fail
be those to use apthe articles
a collection directed, search groups until Based or snippet
discuss articles
and events
may For need
one such
or more
word-based,
example, never
as near-neighbor component This individual process, method. tool will but can of which
concerning
international
browsing
describes be iterated documents.
international if some available, words, words
event. used in discussion may may than with fail of the topic
one or more
can be selected the on
further
examination. viewing in this groups, search browsing a search other the
documents e.g., synonyms rather
to use precisely instead. to proan clusters immeinvades leads
be used being select
documents to a more
or on terms In particular, not which
used to deswitch be used help also the be be serthat With vide outline which In the diately Kuwait, the user Kuwait clusters Scatter/Gather, of the seem example, obvious and corpus. potentially the from big the forced those are Iraq This we antic-
the user may,
at any time, necessarily instead will then may
terms,
the user is presented relevant stories initial
a set of clusters, topic month
She need only
particular by some
documents, request, means. results documents.
may
to the of the
of interest.
formulate to organize too
Scatter/Gather of word-based
scattering: reunification. issues:

clusters.
queries
Germany
considers
many
to focus on international and Germany and oil are gathered together.
she selects the These three
319
New York Trees News Service, August
1990
Education
Domestic
Iraq
kts
sports
oil
Gemany
Legal
\Gathe//
International Stories Deployment Politics Germany Pakistan
Africa
Markets
Oil
Hostages
Gather +
Smaller International
Stories
Trinidad
W. Africa
S. Africa Figure
security
International of Scatter/Gather
Lebanon
Pakistan
Japan
1: Illustration
This produce Since these original discussing
reduced eight
corpus new
is then
reclustered the
on the reduced
fly
to
method
for
automatically This cluster the
summarizing description user
that must
group
must
clusters corpus reveal articles have military
covering contains a finer
corpus.
be specified. revealing by the We of suitable ter/Gather. tering able which cluster a cluster for group, will for
be sufficiently topic defined to first Scatclusit suitthe of corpus
the reduced new clusters The articles the is about but of the also eight.
a subset level
of the articles, than the and some clusters effects the user stories cluster, and reveals as articles. and
otherwise
to give yet
a sense of the for many that
of detail invasion into the
short two online whose offline first for
enough
descriptions meet the for
on the Iraqi now deployment, and in Kuwait.
be appreciated requirement.
simultaneously. algorithms is a fast reclustering is another, greater to the partitioning user. Buckshot the Fractionation clustering essential more of the We concise makes entire also algorithm careful,
of the Oil the invasion deduces The corners which a cluster a number a small The
among hostages
been separated
present
U.S.
upon
the oil market, hostages
one which large
user feels her understanding wishes world. to find other articles out foreign about She selects
of these what the
is
algorithm for the static is presented digest, suitable
accuracy
adequate,
happened Pakistan This
in other
define
contains of specific
political Africa. situations international in Pakistan,

stories month.
stories, as well
an easy to generate, Scatter/Gather.
description
containing collection thus

being the major
international of a coup
in Trinidad, of that
of miscellaneous learns
taken
user
about
lost
3
Before review cuss
Document
presenting why existing in the the terminology Throughout of clusters. to cluster measure to partition this
Clustering
our document established paper, clustering collection, documents clustering in prior fail the n denotes and algorithms algorithms, work, to meet number the we our of and dis-
stories
2.2
Requirements
depends since clustering basic iteration, cluster given on the existence of two faciliwhich
needs. number pairwise a method lar
Scatter/Gather ties. part can within First, of the a time
documents In order
k denotes first and similarity
desired a
and reclustering we need a large a group number
is an essential of documents (e. g., less than some
an algorithm
one must similarity into
establish then define of simimeasures
appropriately Second,
of document Numerous
tolerable
for user interaction
the collection document treat all of which
clusters
a minute).
of documents,
documents.
have been proposed,
each document
as a
320
set of words, sure The tors types) value in the the absence of the does cosine both cosine vectors. coefficients, Willett larity results Two structed. a collection teT, which vidual each chical the build These ering which ment gram). ilarity fusion. from sider erage known forming due space They tively manner, similar that to [15] than not the documents of length
often are equal
with
frequency overlap
information, between
and
mea[1 1]. vec(or has a word
the
course
of selecting
an agglomeration. which must
Global
algoO(n), z This that
degree
of word
documents by sparse words
rithms because sharply attain mance.
have running all pairs limits the
times
are intrinsically be considered. algorithms
typically to the Each
represented number component
of similarities usefulness, quadratic those into of the nested have
of unique
their
even given lower that sets strive
in the corpus. reflecting document. of the word, words occur frequency measure, angle
of the vector
theoretical strategies,
bound
on perfordecomrather also are been global as algotypiGenermanof
the occurrence We may
of the corresponding use a binary the might value two scale, presence
in which or the
Partitional position than studied in the If nature above
for a flat have
value
is one or zero,
to represent within the that its these
of the a hierarchy [8, 13]. and
collection of thus Some
of documents algorithms slow
or the value
be some function If a word A poputhe is zero. sparse to unit product Dice the
partitions, these same global,
document, measure,
in a document, between
performance
lar similarity of the document
cosine
computes vectors. length, of the and
mentioned Other
greedy,
agglomerative by contrast, i.e., 0( kn). in some
rithms. cally ally, ner, sets) tion
partitional
algorithms, running times,
vectors
are normalized the include normalized that that inner the
the two
have rectangular these algorithms
is, of course, Other which has
simply are
proceed
by choosing, to the desired
measures
Jaccard counts. of simi-
a number is then
of seeds equal partition. assigned
size (number in the As a refinement,
word impact
overlap choice
of the final
Each
document seed.
It is can
collecan imthat
has suggested the choice types is a flat be defined is then
to the closest with, seeds.

algorithm algorithm found
measure different One can
less qualitative of clustering of document partztton The other recursively
on clustering can be coninto clussets, on to most [5, 2], considthe pair docudendrosimas one conthe clustering typically popular optimal [1 O]. characteristics. by iterainto for in in avis
the procedure proved

any into
can be iterated, of cluster

clustering clustering of the clusters
at each stage,
noteworthy be
algorithm. clusters of the
selection
partitional a hierarchical each
transformed parof
documents
by recursively
in an application
of subsets.
is a hzerarchzcal as either corpus clustered. of the into
titioning the
an indiA hierar-
partitioning
algorithm.
document of which clustering
or a partition hierarchically a tree, algorithms defines
One improve cluding, uments
application the with that for fine, each might
of a partitional document, otherwise
clustering some be closely the
has been search related However, must contain by
to in-
performance
of near-neighbor missed.
called have
a demirogram, been applied
docto be
documents. Numerous clustering single-linkage generally of clusters the (which differ greatest then hierarchical algorithms all pairs exhibits group They when document clusters, hierarchical proceed built similarity becomes pair is the including, clustering
be useful fairly tion the
near-neighbor
search,
partition generates of unique
since it is desirable For example, size is related collection computational obviated
for each set to only Willett
prominently,
a few documents. who
a partiwords in the stratto the For this
by iteratively and into a node product the fusing of the a single
to the number [13]. From benefits by the of the
so far,
document
this
perspective,
potential
of a seed-based large not size (relative
egy are largely number reason pursued niques acheive number over partitional
in the procedure clustering between groups.
used to compute defines similarity methods as well behavior, it remains of an
of documents)
required have retrieval
partition. been community. which
one of the similarity of the two
of a previous
strategies
aggressively use techbut the which the
Single-linkage each the
by the information two partitioning from time desired drawn
the maximum
any two individuals, Alternative linkage), single-linkage
We present
algorithms
the hierarchical bounds. is small
algorithms, and thus
minimum measures. elongated its simplicity time
similarity (group-average Although straggly and share for in
(complete-linkage),
rectangular of clusters time
For our application, is substantial.
similarity
as other
speedup
aggregate
quadratic
algorithms
to have an undesirable the
chaining clusters, availability
and are
algorithm
its computation common they that proceed
These
algorithms two the
certain
agglomerative, group. pair
choosing in that under all pairs
document
groups They
to agglomerate groups they chosen are global
a single
document
agglomerate
in a gmedv
2Willett [14] discusses an reverted file approach which can ameliorate this quadratic behavior when a large number of small clusters are desn-ed. Unfortunately, when clusters are large enough to contain a large proportion of the terms in the corpus, this approach yields less improvement
of document which Lastly, similarities
agglomeration
is the pair some of inter-group
is considered
best or most
criterion.
are considered
321
Definitions
a in a collection document.3 in C. IVI; cs)}j!~ in V and ~(w%, a) is the frequency between the of C(O) cosine and pairs c(f?). of documents, monotone particular, Then Let c(a) (or corpus) with their V be the C, let the frequencies, set of unique as a c(a) occurring of length c(a) = {f(uI,, be the set of words,
outskirts fine the considering cluster. a whose define
of the tmmrned only For every similarity
group. sum the m
To profile most
solve pm(I)
this for
problem, any cluster documents is largest.
we r
deby
For each document countfile that words vector occur in that
central S(IX, I),
of the Then
a in r
let rm(17)
be the
m documents
to 17, namely
can be represented
~~~m(r) and Pm(O This tively = &(r)/llfim(r)N. can be completed parameter percentage in time proportional defined be fixed. adap-
where
w, is the ith word
of w% in a. To measure a let S(CY,/3) = (9(4Q))) 9(@))) and ~, let element-wise the similarity us employ functions between In
computation as some
to 1171.4 The
trimming
m maybe
of ]r 1, or may
119(C(Q))II119( 4P))II
where inner been g is a monotone product, our and produces to experience damping II . II denotes that taking better results similarity where function, vector than to (., .) denotes It has norm. the
4.1
Another dual the central of words (or
Cluster
description
Digest
of a document sum profile. those which of a cluster, We thus w highest two sets group Rather appear define weighted (rm(I), is in some than most consider sense the
to the trimmed central documents wo~ds, namely
considering frequently in p(I) the con-
g to be component-wise traditional
we can
square-root component-wise It is useful document
logarithm. consider p(a), be a function
in the group perhaps Taken
as a whole. in pm(r)). the
tw(I), the topical

terms
of 17, to be the together, cluster digest
profiles g(c(ff))
tw(I)) form
of the can easily the
p(a)
==
(m, w) tents puted Ivl used
of I, a short The cluster and
description
119(4~))11
in which case
of the cluster. in time to describe 0(11
digest
be comsummary
+ lV\),
is in fact
a cluster
to a user
of Scatter/Gather.
S(a, p)
= (pap)
~p(ci)zp(p),.
Z=l
5
or a document with of the contained group. it in17 by defining phases:
Partitional
partitional
Clustering
clustering algorithms have three
Suppose A simple to be the dividuals.
17 is a set of documents, profile Let can be associated sum of profiles normalized
Seed-based
1 Find 2 Assign
k centers. each document the partition in the collection to a center.
CXEI?
be the
unnormalized
sum
profile,
and
then
3 Refine The that result U==P
so constructed. document groups such
p(r) =
Similarly, employing qr,z) Sometimes is not because a good
j(r)
m.
is a set P of k disjoint II = C. and the Fractionation initial centers. algorithms, centers. algorithm Let
use
The the this cosine profile measure definition: P(x)). the normalized groups which documents sum profile contents lie on the can be extended to r by designed is only ~ (p(r), the which clusteT clustering our small
Buckshot to find used may
algorithms They however Both which can their
are
both
be thought output assume well, but the of
of as rough existence
clustering to define of some run for and this slowly.

We
algorithms clusters this
for our purposes, measure into account
us call group
procedure agglomerative B). the Each locally
of a document
subroutine. uses builds
average subroutine to find
it takes
subroutine this
(see appendix
algorithms sets,
cluster
over
3 Throughout
denote individual
this paper, lower case Greek letters wJ1 be used to documents, Upper case Greek letters wdl denote
groups) groups, and upper case Roman letters
on its results
k centers.
sets of documents (document will denote sets of document
4A full
sort
of the similarities
IS not requmed.
Buckshot sample placation to find accurate significantly the on-the-fly Scatter/Gather. the primary displayed We the Our tradeoff.
applies
the
cluster subroutine that
subroutine over fixed
to
a random apgroups
This corpus The is
procedure ordering initial
thus bucketing
encourages at least creates
nearby a partition
individuals
in the
to find
centers.
Fractionation
uses successive sized
to have
one word
in common.
of the cluster centers. center
We believe finding and, online
Fractionation However, is more required can entire
is the more Buckshot B={@l, such that Q is Each tion =

{%( Z-l)+ l,% (Z-1)+2, . .> CG?U}.
procedure. hence,
@2, . . ..@m}m}
faster,
appropriate by iterations to
for of
reclustering of the iteration
Fractionation partitioning
be used
establish which
corpus,
@, is then into Note and, factor.
separately pm groups, that hence, Rz = each
clustered where of these clustering
(using computations in nm
the
cluster reducoccurs Each union are That an asso-
in the first Step center
of Scatter/Gather. each document later). to
subroutine) in m~ time, application ciated
p is the desired time. The partitions
implement refinement The
2 by assigniag also reflect limited. through to Split,
nearest
(in a sense to be defined
all n/m {@z,l,
occur
algorithms simplest is fast but
a time-accuracy iterated comprehenapplication and clarify el-
of agglomerative groups
produces
refinement
procedure, A more repeated Join,
partition
@8,2, . . . . @,,,m}. in these next for the
move-to-nearest, sive refinement of the
of the documents then treated
contained
is achieved that attempt P. partition
as individuals
iteration,
of procedures ements
is, define C={@,,J: l<i Sri/m, lgj order J. That C. pn/m through at iteration groups running time by The Spin} taking process which the @,,J in re-
5.1
Buckshot The
Finding
Initial
Centers
inherits with are
an enumeration order broken to p2n on i and into groups application the C replacing
lexicographic peated of the buckshot time random algorithm clustering of the subroutine. This algorithm the may sample is quite simple. To of C reduced The point reduce To sampling deterministic. on the same although in our is employed, That corpus similar Buckshot calls alto differtrials The iteration, overall O(rnn). running
is then are
is, the pn components further this can jth
idea
buckets,
achieve choose &), ters time
a rectangular a small and apply clusters
algorithm, documents Return clearly
merely (of size the cenin runs
separate
agglomeration. clustering that time +. the
process the
terminates remaining
j if # n < k. At
the cluster found.
one final determine which Thus time.
of agglomerative to a partition time, observe
of the O(kn).
P of size k. pJ nm. . .)) =
Since gorithm this ent
random is not
operates if m = O(k)
on # n items, is thus this O(nm( algorithm
takes 1 +P+P2
is, repeated produce repeated partitions.
running
algorithm partitions,
has rectangular
experience
generally
produce
qualitatively
5.2
Fractionation The cluster groups uals These and when to Fractionation C into subroutine such groups that algorithm N/m buckets is then the finds applied k centers to each by initially buckOnce fined signed The
Assigning
k centers for those to one simplest have
Documents
been found, each centers center. and based
to
Centers
profiles de-
suitable on some
centers, of those algorithm,
document
in C must assigns k groups, a to the
be aseach and group
breaking
of a fixed individuals in number is roughly as if they The
size m > k. The of these into (from were document individof p.
criterion.
Asszgn-to-Nea~est, of the collection in G. Let into
ets separately groups
to agglomerate reduction bucket) treated repeated. remain. branching in each are now process
document let r,
to the nearest ith can index. partition. be efficiently for the computing cost group
Let G be a partition be the Ties s(cz17i). with
a factor
a < II,
if i maximizes
individuals, terminates
be broken The
by assigning
the entire only
iteration
lowest
set P = {11%}, computed by
O < i < k is then an ina : C to
k groups a I/p individual the
Fractionation tree bottom terminating
can be viewed up, where when the only
the desired P can verted In any kn. map
as building leaves k roots c=al, on a key mon ber, word such are
constructing for each
documents,
k centers of this
p~(I,
), and
remain. individuals but in C are enumerated, ordering a better word index could procedure of the Typically medium reflect sorts jth so that an extrinC based comnumterms. most ffz, . . .. c%. This on C, which as three, is the which
simultaneously case, the
the similarity procedure
to all the centers. is proportional
Suppose sic ordering
5.3
Given it into rithms,
Refinement
an initial a better there clustering, one. As with is a tradeoff it is now our initial speed desirable and to refine algoaccuracy.
in each individual. favors
j is a small frequency
clustering
between
323
The rates and
simplest poorly Join
process just defined
is simply discussed. clusters into which
to iterate The two Split well
the
Assign-tosepaparts
Join The purpose of the groups elements documents words may by their Join refinement digests. operator P that they However, Therefore groups Since, is to merge usefully have of be by definition, never their lists
Nearest
process merges
algorithm separated similar.
clusters
are too
document distinguished any two typical
in a partition cluster of P are disjoint, in common. well overlap. two between
are not will
Iterated The also From using each
Assign-to-Nearest procedure first of our mentioned refinement above can
Assign-to-Nearest be seen a given document This as the
topical
the criterion A will
algorithms. cluster centers assign new
of distinguishability T(r, where merge time which A) =
17 and
set of clusters, sum profiles to the process nearest
we generate above, center
ItU(r)
n tu(A)l of w most A) the words number In large is thus topical some for of words each words the for 17. We
W.
the trimmed
and we then so as to form indefinitely, few steps, number
t~ (I) r and
is the list A if I(I, the to then
clusters. it makes is typically
can be iterated gains only in the first a small fixed
though and hence of times.5
> p, for
P, 0 < P < cluster in the
its greatest iterated
Determining proportional pus, and
topical
takes corof and
we must
compute
k2 intersections corpora, O(kn).
to decide number
clusters is typically
to merge. less than of Join time
Split Split two divides new each document This can (without Buckshot J7~} and partition of the group !J in a partition with P into
words the
the number
of documents,
running
groups. clustering
be accomplished refinement) partition let
by applying C = r and the be a G provides
Buckshot two two P new
k = 2. The Let P={
resulting Fl, the

k
6
ment the
Application
procedures course The
to
Scatter/Gather
initial possible clustering complete and refinein clustering method. is comconsideration. one can use of of algo-
groups. Iz, ..., union G, = {17,,1, 1,,2} new Buckshot of 17,. The G, s: partition
Combinations algorithms. initial
of the various give several used We have partition by corpus
element is simply
two the in
of these
combinations
of implementing used the
Scatter/Gather Scatter/Gather under
P = UG,.
pletely Hence, of Buckshot the overall to N. of this is the procedure on some cluster in the would coherency self similarity to the cluster, to the as well cluster only s(I, split I). poorly criterion. simirequires time proportional compute a slower the tens rithm and Split, the O(k N) tering expense rank of A(I, A(rk)}. then
p,
determined when the the initial clustering partition.
corpus offline. for
t=l
Each application proportional that quantity between similarity define: = s(r, P) r). ) in the set score criterion
is available
in advance, the
partition algorithm
We can therefore accuracy consisting time computation. to find corpora a quadratic
to II, 1. Hence, in time A groups One This larity average We thus A(r) Let r(f,, {A(r, The i-(r, does to N. modification simple
computation
can be performed
to improve
initial
However, of documents, Fractionation a great deal
of thousands is likely then Join, running and use the perform and time thus
to be too slow even for offline algorithm of refinement operators. the overall it is vital as possible, therefore and We and then have
We thus
centers, using the is
is in fact documents
proportional
average
as to the centroid.
Assign-to-Nearest does not session, to run accuracy. procedure, of refinement. clustering, affect
Not e that procedures running time.
of a document
for each of the refinement
In an interactive algorithm of some finding center minimum accurate additional returns.
however, We
for the cluseven at the the that yield quickly Bucka two a readithis it with
as quickly
use follow found
be the
shot bare
), A(r,),..., would
some
iterations procedure P) not < pk for change criterion only of the split groups such the that cosonably produces minishing By virtue algorithm may in fact worsen the partition
clusters can pull even fuzzier.
of the Assign-to-Nearest improvement, center
procedure that but finding further with
refinement
O < p s
1. This
modification since
the can
order
algorithm in time
herence
be computed
proportional
of the Buckshot is not deterministic.
procedure in the speed the lack since
However, at high Indeed, as a feature,
contemthan the that user
5Excessive Iteration
improving away from
rather than
documents
plated that
application, the partition might
Scatter/Gather, be computed
it is more
important of deterpartition
it, since {fuzzy elongated other clusters and become
the algorithm minism then in favor
be deterministic. be interpreted of discarding reclustering.
has the option of a fresh
an unrevealing
324
The described factor
overall
complexity section
of both is clearly
clustering O(klV).
procedures The constant enough
await fined excels.
evaluation information
metrics access
appropriate goals in
to the which
vaguely
de-
in this the
Scatter/Gather
for
Buckshot-based use with
procedure large
is small
to permit The
interactive
document
collections. larger
To support are essential. in a local than tering rithms trying
Scatter/Gather, Clustering to deal by with the slow. can the
fast
clustering quickly
algorithms by working rather time clusalgovaripre-
Fractionation-based factor, but
procedure one which is still
has a somewhat acceptable
be done groups entire even or will
constant
for offline
manner
on small large corpora,
of documents corpus the globally. linear
applications.
For extremely
6.1
Naturally
examining data set input
Clustered
the consists data
Data
of our separated clusters, then algorithms clusters i.e., both is larger of the than of
achieved may corpora, will the
Buckshot which be feasible. of the cluster
Fractionation to develop linear time scale that
be too under
We are working
It is worth when points. smallest the largest our the
performance of well has k natural similarity
ations large
on Scatter/Gather always accuracy is affected slow to find their
to arbitrarily
the assumption
If the
processing Clearly, tion provided
intra-cluster inter-cluster will equal select
document document find this
Buckshot subroutine. accurate time
and This may be.
Fractionaprovides al-
similarity,
algorithms by the motivation
by the quality highly running
of the clustering clustering
algorithms For Buckshot, and will with
partition. a corpus documents so long containing a random from each k widely sample of the then
if we have some for the our
further gorithms,
separated of size & centers will To
size centers,
whatever
high
probability
as n >> k in k. This k = 20 or so. if we choose from that In figures pus This about updates Here events a = 5 means in 1000. from that Given tion is the News consists 5000 our during 2 through set 5, we present described during Some the distributed month are 30 megabytes articles about To the full by output 2.1. the New of the The corYork 1990. text due political initial algorithm this line the task, display number parti(figtime of of and line in to Scatter/Gather Times session SeTvice articles. stories. is to learn month. the Buckshot international create clustering for two the goal applied in section
certainly see this, cluster. of our I/k).
be true compute This
case in which that, any to get
Scatter/Gather
Session
probability k times
a sample some none k(l
of size
.s, we fail is at most
individual
the probability of cluster of failure for
s individuals So, the probability k(l total If we now
is a member probability take s = aklnk
Z, namely is at most a, then
of articles of roughly
(1 I/k). the failure
of August of ASCII repeated
some
is at most l/k) Gklnk < kl-a.
of news
this
Thus that our
in our we start resulting
case, find with
with
k = 20, taking one element each
weve
400 samples
all the clusters at least will set the
999 times be a subset found
ure 2). permitting. Each its
Fractionation cluster the
is recommended with line the
each cluster, of one of the will include a
clusters Thus
is described The number near frequent oil) issues) and first
clusters. center more clearly, pair. cluster cluster actual
of centers cluster.
cluster
digest.
contains
within than this will
each actual k documents pair ever will
the cluster, note actual not that if we have some pair Then same each titles bucket, contains kets, articles gest Next, 4 (African
of documents the centroid. invasion which in the cluster.
in the cluster, The second
For Fractionation, of them is necessarily
we need merely in in the be merged a single same
of documents words clusters
cluster. in the
We select including international (figure
2 (Iraqs as those recluster,
of Kuwait), and probably seem likely
5 (Marother dito contain
in preference Thus, when
to any other we finish,
6 (Germany,
Therefore, we have clusters.
no pair found will
of documents be a subset
be merged.
of interest, 3). in figure issues).
and display this time
a new cluster selecting
of some
one of the
4, we iterate, other Specific and
clusters and separated police by
3 (Pakistan,
and probably hostages Africa, detail more titles
international have been war
issues)
incidents so on.
Conclusion
demonstrates information
metaphor contenks
out. action that

give.
We find
in Trinidad, about
in Liberia,
in South the 5).
Scatter/Gather can be an effective The tuitive uations a query

table-of.
document
the
clustering
an in-
We obtain viewing (figure
the situation contained
in Liberia in that
access tool has shown is particularly or
in its own right.

method
of the articles
cluster
basis, in
and
experience it is difficult
that
it is indeed helpful in sitmust to specify
easy to use.
Scatter/Gather which Claims
undesirable
formally.
of improved
performance
325
>
(time
(setq 4970 cluster
first items 199
(outline
(all-dots
tdb))))
cluster global move move 0 to to
Items..
sizes: 517 287 1293 1731
18 835 749
24 86
53
5 25
47
13 273 310
14 269 293 THE; TEACHING percent, SUBJECTS study, educ T
nearest. nearest. CRITICS year,
..sizes: ..sizes: student,
677 481
1020 844
275
(287) school,
URGE NEW METHODS; child,
PROGRAMS FOR PARENTS state, TAKES
unlverslty, HE; RESORT day, BUSH state, New Musical movie,
program, STEPS
(1731) year,
FEDERAL state, york,
WORK PROGRAMS city, SAYS million,
TO PR;
AMERICANS official, SAYS
CUT BACK house FOREIGNER presld
service, DRAWS unite, from art,
company, A LINE saudi, the cre; IN; official, After
week, BUSH
(749) iraq,
PENTAGON Iraqi, Trillins year, music,
60,000 american, Hats;
IRA;
kuwalt, Many
railltary,
(275) film,
Nasty Teen-Agers 1
american, PAINTING hit, OIL day, right, PRICES rate, THE directo DODGER
play,
company, MAY MEA; season, PRICES;
angeles, FOR RELIEF day, PANIC mxllion, league,
york, I;
(481) game,
TWISTS year, CRISIS oil,
AND TURNS team, OIL
SAX LOOKING win, player,
play,
coach,
RISE week, COUNCIL preslden AS
1
s RE
(844) price,
PUSHES percent,
WHY MAJOR year,
OVER A M; stock, OF TWO G; official,
market,
company, ;
(310)
LEADERS year,
OF TWO GERMANYS state, party,
REPRESENTATIVES political, DID JUDGE country,
SECURITY leader,
government, 7 (293) case, real U.S. court, time
APPEALS charge, 131258
ORDER FREEI; year, Judge,
MOVE TOO HASTI; attorney, trial,
MAYOR BARRY Jury, federal,
CONVICT dlst
lawyer,
msec
Figure
2: Initial
Scattering
>
(time
(setq 1903 cluster
second Items 123
(outline
first
2 5 6)))
cluster global move move 0 to to
items.
..sizes: 730 650 67 66 IRA; 65 57
51 62
8 5 5 4 56 59 99
7 28 714
15
nearest. nearest. PENTAGON iraql,
. . sizes: ..sizes: SAYS
110 126 DET; BUSH DRAWS president, A LINE sa
117
242
586
(650) iraq,
60,000 kuwait,
BUSH SAYS state, unite,
FOREIGNERS military, WITH
american,
offzci.al,
1 2
(66) party, (57) german,
LEGISLATIVE state, IN
LEADERS
BACK;
THE PROBLEM ; IN
AN EARL; vote,
ROAD STILL campaign, ; LEADERS
TOUGH FOR democratic
election,
year,
political,
candidate,
PUSH FOR UNIFICATION,
PUSH FOR UNIFICATION, government, soviet, PAKISTAN, mllxtary, MANDELA national, CORP. corp, CRISIS day, stock,
OF TWO GERMA unificati FEEL LET sta TO SETT offlclal, MERRILL pres OVE f F ara price, PANIC week,
east,
BHUTTO DEATH
germany, mumster, TOLL
west, year, 500 south,
year, IN party, I; leader, share,
union, G; country, fi~ht, EA; market, 01;
state, PAKISTANIS
3 4 5
(117) (59) a+r~ean, (242) company,
GOVERNMENT EXCEEDS
DISMIS;
FRACTIOUS polltlcal,
government,
official, group,
DE KLERK, palica, FIRST year, MIDEAST company,
HOLD U;
NEGOTIATIONS FARM BANK, sell,
government, million, OIL PRICES percent, GRANTS
WEST GERMANS TO BUY FIRE; percent, RISE
EXECUTIVE
(586) oil,
AS STOCK; year,
PUSHES stock, DAYS
WHY MAJOR rate,
price, IRAQ , lraqi, time
market, 237
mdlzon, I;
(126) kuwait
FOREIGN iraq,
; WOMAN TELLS saudi, day,
OF 12
CONCERN HEIGHTENS country, state,
american, msec
year,
invasion,
real
54184
Figure3:
Second
Scatter
326
>
(time
(setq 176 cluster
third items 37
(outline
second
3 4)))
cluster global move move 0 1 (5) (16) rebel, 2 (28) south, 3 (1) security, 4 (51) to to
items...slzes: ..sizes: ..sizes: 4 5 LAY 16 16 44 28
1 4 1 23 1 51 MUSLIM
12
1 5 3 8 3 10 13 LAY DOW; L; DRAMA IS hostage, WEST AFRICAN leader, COMPETING ant, politlcal, OVER BUT BOO robinson FORCE S officla FACTIONS gove T wednesday, TO SETTLE group, troop, bakr, llberian, HOLD U;
nearest. nearest.
7 71 7 55
MUSLIM
MILITANTS trinidad, taylor, TOLL
DOW; L;
MILITANTS
government, african, DEATH pollee, SHIFT
minister, TO SETTLE west, 500 black, S; I;
parliament, NEGOTIATIONS liberia,
NEGOTIATIONS
EXCEEDS
DE KLERK, mandela,
MANDELA africa,
african, IN U.S.
congress,
COMPUTER computer,
agency, SECURITY year,
technology,
national, COUNCIL country,
center, REACHE; group, BATTLE god, I;
communication, NEW U.S. guerrilla, BOMBINGS aoun, MS. IN POLICY war,
milit IS natio W
COUNCIL state,
REACHES; official, SH; al, DISMIS; TO;
@SECURITY army, MUSLIM party, MS.
government, 5 (7) CLASHES
BETWEEN
RIVAL
FACTIONS kill,
SOUTHERN amal HER WORL
lebanon, 6 7 (55) (13)
muslim, BHUTTO
chru+tlan,
lebanese, HER OUS; prime, END;
beirut, CALLS IN
GOVERNMENT minister,
BHUTTO YEARS
CALLS
BHUTTO
government,
party, TO VISIT
politlcal, 45
military, AFTER
pakxstan, JAPANS
president, ROLE
SHEVARDNADZE
WARS
Japan, soviet, war, korean, real time 11140 msec
japanese,
year,
tokyo,
government,
south,
korea,
Figure
4: Third
Scatter
> (prmt-titles (nth 1 thmd)) 3720 REBEL LEADER SEIZES ABOUT A DOZEN FOREIGNERS 4804 4778 3719
3409 3114 3113 2785 2784 2783 2782 1801 1685 1684 248
WEST AFRICAN FORCE SENT TO LIBERIA
AS TALKS REMAIN DEADLOCKED

TALKS INEVITABLE FAILURE FAILURE POLICY POLICY
WAR THREATENS TO WIDEN AS NEIGHBORING COUNTRIES TAKE SIDES REBEL LEADER AGREES TO HOLD CEASE-FIRE
OUSTER OF LIBERIAN PRESIDENT NOW SEEMS NEGOTIATIONS NEGOTIATIONS LIBERIANS LIBERIANS LIBERIAN LIBERIA FIVE FACES FACES OUSTER IN IN TO SETTLE TO SETTLE U.S. U.S. LIBERIAN LIBERIAN
WAR END IN WAR END IN
CRITICAL CRITICAL
OF ADMINISTRATION OF ADMINISTRATION TAYLOR OFFER, TROOPS
REBEL LEADER,
LEADER
CHARLES
HURT EN ROUTE TO CEASE-FIRE WONT QUIT
REJECTING NATIONS LIBERIA LIBERIA
TRUCE MOVING
WEST AFRICAN OF DEATH OF DEATH IN IN
TOWARD LIBERIA
OF LIBERIAN
PRESIDENT
NOW SEEMS
INEVITABLE
Figure
5: Titles
of articles
in topic
1 from
Figure
327
B
Here erative in our Let between
Group
we present clustering in [3].
Average
a quadratic algorithm This time which
Clustering
greedy global good to the average agglomresults algorithm similarity has given
then the and
finding
the
best
pair
would
simply
involve
scanning with each I the ag-
IGI candidates. A need time
Updating
these since
quantities those
iteration Using average glomerative number
is straightforward, be recomputed. such techniques clustering
only
involving
implementation. r be a document any two
is similar The
as these, for
it can be seen that group average where n is equal
presented
complexity
truncated
group.
is 0(n2)
to the
documents
in 17 is defined
to be
of individuals
to be clustered.
Let
G be a set of group
of disjoint average clusters from G.
document agglomerative I and A G
groups.
The
basic finds S(r U by
iteration A) A over
clustering maximize constructed
the two different new, I smaller, with
which is then
all choices A.
partition
merging
G = (G {I, Initially, individual IGI = k. flat the final recording If that r. we G is simply
A})
u {r
u A}.
groups, one for each when is by inwith
a set of singleton The output rather latter iteration from than
to be clustered. Note that the G, the join cosine partition although each pairwise employ the
terminates this a nested
procedure hierarchy
of partitions,
could
be computed in a dendrogram. measure, the
as one level similarity sum profile
ner maximization
can be significantly
accelerated. associated S(I), That
Recall
P( 17) is the unnormalized Then the to the average inner pairwise product,
similarity, (@(I], ~(1)).
is simply is, since
related
~~rp~r =
lrl(\rl lrl(lrl
- l)s(r) - l)s(r)
= s(r)
Similarly,
+ ~(P(cY), P(cY)) Clcr + Irl,
= (~(r) ii(r)) Irl(lrfor the union
- Irl I)
of two disjoint groups, A = I U
(A)
where
(IXA), IXA)) - (Irl + IAI) = (Irl+ lAl)((lrl+ IAI)- 1)
(@( A),
P(A))
{pfrh~(r)}
2(j(r),
Therefore, the pairwise in average a merge the A were if for every that can such
j(A))
+ (j(A),
and fi(r)
fI(A))
are known, decrease each every time 17 E G
1? c G, S(I) will Further, that n A), produce be cheaply
merge similarity
the for
least
updated
is performed. known
suppose
s(r n A) = ~j~s(r
328
References
[1] Chris Eighth ence Buckley Annual on Research pages and Alan F. Lewit. In ACM Optimizations of the SIGIR in Confer-
[14] P.
Willett.
fast
procedure ~
for
the
calculation 17:53-60,
of similarity Information 1981. [15] of inverted vector searches. and 97-11O, l%oceedings
coefficients Processing
in automatic Management,
classification.
International
Development 1985. large files
Info?matzon
P.
Willett.
Recent A critical
trends review.
in hierarchical
document
RetTieval, [2] W.B. ing Croft.
clustering: of documents us-
Info~matzon
1988.
l+ocessmg
a Management,
Clustering method. the single-link for Journal Science, P. Willett. Wards in of the Ame?zcan 28:341-344, Hierarchical method. Information Conference 1977. docon ReRetTzeval,
24(5):577-597,
Soczety [3] A. ings seamh pages [4] A.
Info?matzon and using
E1-Hamdouchi clustering of the Ninth
ument
In Proceed-
International
and Development
149-156, 1986. H.C. Luckhurst,
Griffiths,
and
P. Willett.
Using
inter-document retrieval systems.
similarity Jou?nal Sczence, and Richard Pretice
information of the 37:3-11,
in document Society
AmeTican 1986.
foT Information [5] Anil 07632, [6] K. Jain 1988. and
C. Dubes. Hall,
A~goTithms Cliffs,
fo~ N.J.
CiusteTing
Data.
Engelwood
N. Jardine erarchical mation
C.J. and D.
van
Rijsbergen.
The retrieval.
use of hiInfor1971. W. Tukey. to text
clustering Storage
in information RetTzeval, R. Cutting, phrase of the Statistical PARC
7:217240, and J.
[7] J.
O.
Pedersen, search: In
Snippet access. al Also 91-08. [8] G. Salton. Hall,
a single
approach 1991 Joznt
Proceedings American as Xerox
Statistic1991. SSL-
Meetings. available
Association, technical report
The SMART Cliffs,
Retmecal N. J., 1971.
System.
Prentice-
Englewood and
[9] G. Salton Information [10] R. Sibson.
M. J. McGill.
~ntToduciion 1983. efficient ComputeT
to Modern
Retrieval. SLINK: link 1973. Rijsbergen.
McGraw-Hill, an optimally
algorithm Journal,
for the single 16:30-34, [11] C.J. van
cluster
method.
Information edition,
Retrzeval. 1979. Croft, Document
Butter-
worths, [12] C.J. tering: Cranfield
London,
second
van Rijsbergen An evaluation
and W.B. of some
clusthe &
experiments
with
1400 collection.
11:171182,
Information
1975.
Processing
Management, [13] P. Willett. file approach. 231, 1980.
Document Journal
clustering of Information
using
an inverted 2:223-
Sczence,
329

Scatter/Gather: A Cluster-Based Approach To Browsing Large Document Collections

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Scatter/Gather: A Cluster-Based Approach To Browsing Large Document Collections

Uploaded by

Copyright:

Available Formats

Scatter/Gather: Browsing

Douglass R. Cuttingl David

A Cluster-based Large Document

are The Hybrid

processed branch, subtree strategies

downward, until at that some point of in used

always stopping is then nearof gen-

as a result. strategies searchl clustering. compared document

are also available. variations terms to

essentially nearness similarity

is defined measure search

are typically and

in an attempt However, a powerful browsing as its primary clustering in its own

are evaluated indicate superior can

of precision search for

at clustering obviates that

as an informaobjections, We present docum-ent present this fast in-

cluster (see, clustering

search, example, algorithms It

new access paradigm. technique operation. algorithms

[6, 12, 4]).

quadratic that cluster gained

running search, wide

times. with popularity.

clustering algorithms in that than

as a method the search to develophas de-

accelerating of fast this interest paper,

Document trieval sumption tomatic improve ically

creased In clustering. as a poor how own called as its towards serves To

document clustering we ask in its method, clustering is directed goals and

(see [15] for

documents hence, a search

for enhancing can be effective We describe

near-neighbor as an access a document

search, method browsing

to be relevant recall a fixed

request Typinto a or into against are

Scatter/Gather, primitive information as a complement implement

uses This with

document technique non-specific

a discussion partition, tree the structure contents of a partition,

of the cluster disjoint (see, of the queries best sorted

is clustered or otherwise, for example, are scoring by score. matched

focused fast two new

techniques. document near for clusterlinear their time effechas shown

[8, 13, 2]). clusters

Scatter/Gather, We introduce which also and discuss

ing is a necessity. clustering to tiveness. be effective,

Center Alto, CA 94304

access probof an inforfor doc-

all or part are not made

direQt .a?mmerGial of the copying specific

and the is given a fee

and its date otherwise,

However, it is hard, precisely.

of tha Association or to republish,

for Computing requires

15th Ann Int1 SIGIR 92/Denmark-6/92 @ 1992 ACM 0.8979J-52+0/92/0006/03J

describing himself may not

of interest, choice for

anything the general

is presented of document the system of document