You are on page 1of 12

Scatter/Gather: Browsing

Douglass R. Cuttingl David

A Cluster-based Large Document


R. Kargerl2 Jan O.

Approach Collections
Pedersenl

to

John

W.

Tukey13

Abstract
Document formation two large number not main clustering retrieval categories: (with of documents); improve that these has not tool. first, running and been well received to its often that is too quadratic clustering as an ininto for does in the slow Objections that time second, use fall

of a hierarchy, taking condition returned These neighbor the erate arise only when clustersearch conventional these employs We also which support the the

queries

are The Hybrid

processed branch, subtree strategies

downward, until at that some point of in used

always stopping is then nearof gen-

highest is achieved.

scoring

as a result. strategies searchl clustering. compared document

are also available. variations terms to

clustering

are where

essentially nearness similarity

corpora appreciably

is defined measure search

pairwise

retrieval. problems

Indeed, to direct

cluster

techniques search recall. [9], Variare not in some are often unperfor-

We argue ing tion is used techniques. and provides a document clustering (linear teractive time)

are typically and

near-neighbor and

in an attempt However, a powerful browsing as its primary clustering in its own

to improve right

are evaluated indicate superior can

in terms that

of precision search for

looking

at clustering obviates that

as an informaobjections, We present docum-ent present this fast in-

ous studies markedly situations, Furthermore, slow, mance, for ment with surprising Document

cluster (see, clustering

strategies and,

access tool

to near-neighbor

search, example, algorithms It

new access paradigm. technique operation. algorithms

be inferior document

[6, 12, 4]).

quadratic that cluster gained

running search, wide

times. with popularity.

is therefore

its indifferent

browsing

paradigm.

has not

clustering algorithms in that than

has also been studied search, near-neighbor [1]. approach document a new dismissing but for

as a method the search to develophas de-

Introduction
clustering for has been extensively improving an excellent mutually same document review). and, of such similar queries, investigated search The and will that general as reastend aucan

accelerating of fast this interest paper,

near-neighbor possibility

Document trieval sumption tomatic improve ically

creased In clustering. as a poor how own called as its towards serves To

a methodology is that

we take

document clustering we ask in its method, clustering is directed goals and

(see [15] for

Rather tool

documents hence, a search

for enhancing can be effective We describe

near-neighbor as an access a document

search, method browsing

to be relevant recall a fixed

to the

clustering right.

determination

of groups

documents

by effectively corpus

broadening

request Typinto a or into against are

Scatter/Gather, primitive information as a complement implement

which access

uses This with

document technique non-specific

(see [11] for an exhaustive hierarchical In the case and clusters returned
1Xerox 3333 Coyote 2 St anford 3 Princeton Permission granted titla that to

a discussion partition, tree the structure contents of a partition,

of the cluster disjoint (see, of the queries best sorted

hypothesis). either

operation.

of documents

is clustered or otherwise, for example, are scoring by score. matched

to more

focused fast two new

techniques. document near for clusterlinear their time effechas shown

[8, 13, 2]). clusters

Scatter/Gather, We introduce which also and discuss

ing is a necessity. clustering to tiveness. be effective,

algorithms

experimentation reasons

as a result,
Palo Hill Alto

possibly

In the case

Research Palo

Center Alto, CA 94304

Road,

University University

1.1
The
of this material is for

Browsing
standard presumes
need.

vs Search
of the information the users
to

formulation a query,
The task

access probof an inforfor doc-

copy that

without

fee

all or part are not made

provided

the copies

or distributed notice

lem
mation

expression
search

direQt .a?mmerGial of the copying specific

advantage,

the ACM

Gopyright

and the is given a fee

is then

a corpus

publication To copy

and its date otherwise,

appaar,

and notice

uments ble,

that

match a situation

this

need. in which

However, it is hard, precisely.


or (similarity

it is not if not

difficult impossithe

is by permission permission.

of tha Association or to republish,

for Computing requires

to imagine

Machinery, and/or

to formulate

such a query
as vector space

For example,
search

15th Ann Int1 SIGIR 92/Denmark-6/92 @ 1992 ACM 0.8979J-52+0/92/0006/03J

1Also known
8,..$1,50

318

user

may

not

be familiar a topic

with

the

vocabulary or may

approprinot wish to

2
In the the

Scatter/Gather
basic iteration of the with groups. scatters groups, to

Browsing
proposed short the or the browsing method, of a small into and Based a small presents on these for clusa small

ate for commit the but fact thing session learn to user

describing himself may not

of interest, choice for

to a particular be looking

of words. specific

Indeed, at all,

anything the general

user Initially

is presented of document the system of document

summaries collection clusters, user.

rather covers

may

wish

to discover Access spectrum: a particular defined document across starts the with out

information collection is a narrowly given someto in

number number short further to form tering number to the summaries,

content specified

of the corpus. an entire sea~ch with more no about the for well the user

to a document at one end document, goal, satisfying

summaries study. again The

of them selected

as specific

as its title;

at the other collection. spectrum,

end is a browstng a need It is common from browsing goal defined

the user selects a subcollection. to scatter

one or more groups The system

of the groups then applies into

are gathered

together

for a session search:

to move

the new groups,

subcollection which are again iteration detailed. this

a partially more about

of document user. smaller, the out groups With

presented the groups bot-

which

is refined

as he finds

the document techniques where tend clusis sub-

each become

successive more small

collection. to emphasize example tering, merged search.

Standard the of this from view

information end

access

become when toms

and therefore

Ultimately, process

search emphasis

of the of topic only

spectrum. search, extraction,

A glaring

enough,

is cluster

by enumerating

individual

documents.

a technology

capable and used

to assist

near-neighbor for clustering from textbook. terms which is in

2.1

An

Illustration
a Scatter/Gather of about News session figure, 5000 Servzce session, articles the assigned during where posted month the text to the of Au1. Here,

We propose information methods

an alternative access, taking provided question question,

application our inspiration with in mind, one consults

We now describe collection New gust York 1990. consists Tzmes This the

the access

typically

a conventional and the

If one has a specific which directs simply eral lays out define one question, the gives that to

specific index,

is summarized we manually cluster

in figure

to simplify labels sion based

single-word The full ses-

passages in one

of interest. the table of the sort

However, of contents, text. The

if one

on the full

descriptions. A. out what

interested logical

gaining structure intensive browsing

an overview,

or has a genwhich table might of be and and

is provided Several search The

as Appendix issues prevent

peruses

Suppose month. tional


q

the user wants techniques:

to find

happened

that

the

application

of conven-

contents answered may ily

a sense of what sections

of questions

by a more between the index. analogy, two

examination of interest. the table

of the text, of contents.

information topic.

need is too vague

to be described

as

also lead to specific alternate

One can eas-

a single
q

searching By system

Even scribe

if a topic it may words to in

were available, not be known

the to the a topic

words user. may thus

used

to de-

direct with

we propose

an information our browsing

access method, dynamic of text of


q q

components: which metaphor uses for a

Scatter/Gather, table-of-contents documents; search search similar for user scribe ipate to find user viced used retrieve [7]. and The methods, documents, is directly found focused the document that

cluster-based, navigating

The used pear words Even were those

used to describe the topic of interest.

not fail

be those to use apthe articles

a collection directed, search groups until Based or snippet

discuss articles

and events

may For need

one such

or more

word-based,

example, never

as near-neighbor component This individual process, method. tool will but can of which

concerning

international

browsing

describes be iterated documents.

international if some available, words, words

event. used in discussion may may than with fail of the topic

one or more

can be selected the on

further

examination. viewing in this groups, search browsing a search other the

documents e.g., synonyms rather

to use precisely instead. to proan clusters immeinvades leads

be used being select

documents to a more

or on terms In particular, not which

used to deswitch be used help also the be be serthat With vide outline which In the diately Kuwait, the user Kuwait clusters Scatter/Gather, of the seem example, obvious and corpus. potentially the from big the forced those are Iraq This we antic-

the user may,

at any time, necessarily instead will then may

terms,

the user is presented relevant stories initial

a set of clusters, topic month

She need only

particular by some

documents, request, means. results documents.

may

to the of the

of interest.

formulate to organize too

Scatter/Gather of word-based

scattering: reunification. issues:


clusters.

queries

Germany

considers

many

to focus on international and Germany and oil are gathered together.

she selects the These three

319

New York Trees News Service, August

1990

Education

Domestic

Iraq

kts

sports

oil

Gemany

Legal

\Gathe//
International Stories Deployment Politics Germany Pakistan

Africa

Markets

Oil

Hostages

Gather +

Smaller International

Stories

Trinidad

W. Africa

S. Africa Figure

security

International of Scatter/Gather

Lebanon

Pakistan

Japan

1: Illustration

This produce Since these original discussing

reduced eight

corpus new

is then

reclustered the

on the reduced

fly

to

method

for

automatically This cluster the

summarizing description user

that must

group

must

clusters corpus reveal articles have military

covering contains a finer

corpus.

be specified. revealing by the We of suitable ter/Gather. tering able which cluster a cluster for group, will for

be sufficiently topic defined to first Scatclusit suitthe of corpus

the reduced new clusters The articles the is about but of the also eight.

a subset level

of the articles, than the and some clusters effects the user stories cluster, and reveals as articles. and
otherwise

to give yet

a sense of the for many that

of detail invasion into the

short two online whose offline first for

enough

descriptions meet the for

on the Iraqi now deployment, and in Kuwait.

be appreciated requirement.

simultaneously. algorithms is a fast reclustering is another, greater to the partitioning user. Buckshot the Fractionation clustering essential more of the We concise makes entire also algorithm careful,

of the Oil the invasion deduces The corners which a cluster a number a small The
among hostages

been separated

present

U.S.

upon

the oil market, hostages

one which large

user feels her understanding wishes world. to find other articles out foreign about She selects

of these what the

is

algorithm for the static is presented digest, suitable

accuracy

adequate,

happened Pakistan This

in other

define

contains of specific

political Africa. situations international in Pakistan,


stories month.

stories, as well

an easy to generate, Scatter/Gather.

description

containing collection thus


being the major

international of a coup
in Trinidad, of that

of miscellaneous learns
taken

user

about
lost

3
Before review cuss

Document
presenting why existing in the the terminology Throughout of clusters. to cluster measure to partition this

Clustering
our document established paper, clustering collection, documents clustering in prior fail the n denotes and algorithms algorithms, work, to meet number the we our of and dis-

stories

2.2

Requirements
depends since clustering basic iteration, cluster given on the existence of two faciliwhich

needs. number pairwise a method lar

Scatter/Gather ties. part can within First, of the a time

documents In order

k denotes first and similarity

desired a

and reclustering we need a large a group number

is an essential of documents (e. g., less than some

an algorithm

one must similarity into

establish then define of simimeasures

appropriately Second,

of document Numerous

tolerable

for user interaction

the collection document treat all of which

clusters

a minute).

of documents,

documents.

have been proposed,

each document

as a

320

set of words, sure The tors types) value in the the absence of the does cosine both cosine vectors. coefficients, Willett larity results Two structed. a collection teT, which vidual each chical the build These ering which ment gram). ilarity fusion. from sider erage known forming due space They tively manner, similar that to [15] than not the documents of length

often are equal

with

frequency overlap

information, between

and

mea[1 1]. vec(or has a word

the

course

of selecting

an agglomeration. which must

Global

algoO(n), z This that

degree

of word

documents by sparse words

rithms because sharply attain mance.

have running all pairs limits the

times

are intrinsically be considered. algorithms

typically to the Each

represented number component

of similarities usefulness, quadratic those into of the nested have

of unique

their

even given lower that sets strive

in the corpus. reflecting document. of the word, words occur frequency measure, angle

of the vector

theoretical strategies,

bound

on perfordecomrather also are been global as algotypiGenermanof

the occurrence We may

of the corresponding use a binary the might value two scale, presence

in which or the

Partitional position than studied in the If nature above

for a flat have

value

is one or zero,

to represent within the that its these

of the a hierarchy [8, 13]. and

collection of thus Some

of documents algorithms slow

or the value

be some function If a word A poputhe is zero. sparse to unit product Dice the

partitions, these same global,

document, measure,

in a document, between

performance

lar similarity of the document

cosine

computes vectors. length, of the and

mentioned Other

greedy,

agglomerative by contrast, i.e., 0( kn). in some

rithms. cally ally, ner, sets) tion

partitional

algorithms, running times,

vectors

are normalized the include normalized that that inner the

the two

have rectangular these algorithms

is, of course, Other which has

simply are

proceed

by choosing, to the desired

measures

Jaccard counts. of simi-

a number is then

of seeds equal partition. assigned

size (number in the As a refinement,

word impact

overlap choice

of the final

Each

document seed.
It is can

collecan imthat

has suggested the choice types is a flat be defined is then

to the closest with, seeds.


algorithm algorithm found

measure different One can

less qualitative of clustering of document partztton The other recursively

on clustering can be coninto clussets, on to most [5, 2], considthe pair docudendrosimas one conthe clustering typically popular optimal [1 O]. characteristics. by iterainto for in in avis

the procedure proved


any into

can be iterated, of cluster


clustering clustering of the clusters

at each stage,
noteworthy be

algorithm. clusters of the

selection

partitional a hierarchical each

transformed parof

documents

by recursively
in an application

of subsets.

is a hzerarchzcal as either corpus clustered. of the into

titioning the

an indiA hierar-

partitioning

algorithm.

document of which clustering

or a partition hierarchically a tree, algorithms defines

One improve cluding, uments

application the with that for fine, each might

of a partitional document, otherwise

clustering some be closely the

has been search related However, must contain by

to in-

performance

of near-neighbor missed.

called have

a demirogram, been applied

docto be

documents. Numerous clustering single-linkage generally of clusters the (which differ greatest then hierarchical algorithms all pairs exhibits group They when document clusters, hierarchical proceed built similarity becomes pair is the including, clustering

be useful fairly tion the

near-neighbor

search,

partition generates of unique

since it is desirable For example, size is related collection computational obviated

for each set to only Willett

prominently,

a few documents. who

a partiwords in the stratto the For this

by iteratively and into a node product the fusing of the a single

to the number [13]. From benefits by the of the

so far,

document

this

perspective,

potential

of a seed-based large not size (relative

egy are largely number reason pursued niques acheive number over partitional

in the procedure clustering between groups.

used to compute defines similarity methods as well behavior, it remains of an

of documents)

required have retrieval

partition. been community. which

one of the similarity of the two

of a previous

strategies

aggressively use techbut the which the

Single-linkage each the

by the information two partitioning from time desired drawn

the maximum

any two individuals, Alternative linkage), single-linkage

We present

algorithms

the hierarchical bounds. is small

algorithms, and thus

minimum measures. elongated its simplicity time

similarity (group-average Although straggly and share for in

(complete-linkage),

rectangular of clusters time

For our application, is substantial.

similarity

as other

speedup

aggregate

quadratic

algorithms

to have an undesirable the

chaining clusters, availability

and are

algorithm

its computation common they that proceed

These

algorithms two the

certain

agglomerative, group. pair

choosing in that under all pairs

document

groups They

to agglomerate groups they chosen are global

a single

document

agglomerate

in a gmedv
2Willett [14] discusses an reverted file approach which can ameliorate this quadratic behavior when a large number of small clusters are desn-ed. Unfortunately, when clusters are large enough to contain a large proportion of the terms in the corpus, this approach yields less improvement

of document which Lastly, similarities

agglomeration

is the pair some of inter-group

is considered

best or most

criterion.

are considered

321

Definitions
a in a collection document.3 in C. IVI; cs)}j!~ in V and ~(w%, a) is the frequency between the of C(O) cosine and pairs c(f?). of documents, monotone particular, Then Let c(a) (or corpus) with their V be the C, let the frequencies, set of unique as a c(a) occurring of length c(a) = {f(uI,, be the set of words,

outskirts fine the considering cluster. a whose define

of the tmmrned only For every similarity

group. sum the m

To profile most

solve pm(I)

this for

problem, any cluster documents is largest.

we r

deby

For each document countfile that words vector occur in that

central S(IX, I),

of the Then

a in r

let rm(17)

be the

m documents

to 17, namely

can be represented

~~~m(r) and Pm(O This tively = &(r)/llfim(r)N. can be completed parameter percentage in time proportional defined be fixed. adap-

where

w, is the ith word

of w% in a. To measure a let S(CY,/3) = (9(4Q))) 9(@))) and ~, let element-wise the similarity us employ functions between In

computation as some

to 1171.4 The

trimming

m maybe

of ]r 1, or may

119(C(Q))II119( 4P))II
where inner been g is a monotone product, our and produces to experience damping II . II denotes that taking better results similarity where function, vector than to (., .) denotes It has norm. the

4.1
Another dual the central of words (or

Cluster
description

Digest
of a document sum profile. those which of a cluster, We thus w highest two sets group Rather appear define weighted (rm(I), is in some than most consider sense the

to the trimmed central documents wo~ds, namely

considering frequently in p(I) the con-

g to be component-wise traditional

we can

square-root component-wise It is useful document

logarithm. consider p(a), be a function

in the group perhaps Taken

as a whole. in pm(r)). the

tw(I), the topical


terms

of 17, to be the together, cluster digest

profiles g(c(ff))

tw(I)) form
of the can easily the

p(a)

==

(m, w) tents puted Ivl used

of I, a short The cluster and

description

119(4~))11
in which case

of the cluster. in time to describe 0(11

digest

be comsummary

+ lV\),

is in fact

a cluster

to a user

of Scatter/Gather.

S(a, p)

= (pap)

~p(ci)zp(p),.
Z=l

5
or a document with of the contained group. it in17 by defining phases:

Partitional
partitional

Clustering
clustering algorithms have three

Suppose A simple to be the dividuals.

17 is a set of documents, profile Let can be associated sum of profiles normalized

Seed-based

1 Find 2 Assign

k centers. each document the partition in the collection to a center.

CXEI?

be the

unnormalized

sum

profile,

and

then

3 Refine The that result U==P

so constructed. document groups such

p(r) =
Similarly, employing qr,z) Sometimes is not because a good

j(r)
m.

is a set P of k disjoint II = C. and the Fractionation initial centers. algorithms, centers. algorithm Let
use

The the this cosine profile measure definition: P(x)). the normalized groups which documents sum profile contents lie on the can be extended to r by designed is only ~ (p(r), the which clusteT clustering our small

Buckshot to find used may

algorithms They however Both which can their

are

both

be thought output assume well, but the of

of as rough existence

clustering to define of some run for and this slowly.


We

algorithms clusters this

for our purposes, measure into account

us call group

procedure agglomerative B). the Each locally

of a document

subroutine. uses builds

average subroutine to find

it takes

subroutine this

(see appendix

algorithms sets,

cluster

over

3 Throughout

denote individual

this paper, lower case Greek letters wJ1 be used to documents, Upper case Greek letters wdl denote
groups) groups, and upper case Roman letters

on its results

k centers.

sets of documents (document will denote sets of document

4A full

sort

of the similarities

IS not requmed.

Buckshot sample placation to find accurate significantly the on-the-fly Scatter/Gather. the primary displayed We the Our tradeoff.

applies

the

cluster subroutine that

subroutine over fixed

to

a random apgroups

This corpus The is

procedure ordering initial

thus bucketing

encourages at least creates

nearby a partition

individuals

in the

to find

centers.

Fractionation

uses successive sized

to have

one word

in common.

of the cluster centers. center

We believe finding and, online

Fractionation However, is more required can entire

is the more Buckshot B={@l, such that Q is Each tion =


{%( Z-l)+ l,% (Z-1)+2, . .> CG?U}.

procedure. hence,

@2, . . ..@m}m}

faster,

appropriate by iterations to

for of

reclustering of the iteration

Fractionation partitioning

be used

establish which

corpus,

@, is then into Note and, factor.

separately pm groups, that hence, Rz = each

clustered where of these clustering

(using computations in nm

the

cluster reducoccurs Each union are That an asso-

in the first Step center

of Scatter/Gather. each document later). to

subroutine) in m~ time, application ciated

p is the desired time. The partitions

implement refinement The

2 by assigniag also reflect limited. through to Split,

nearest

(in a sense to be defined

all n/m {@z,l,

occur

algorithms simplest is fast but

a time-accuracy iterated comprehenapplication and clarify el-

of agglomerative groups

produces

refinement

procedure, A more repeated Join,

partition

@8,2, . . . . @,,,m}. in these next for the

move-to-nearest, sive refinement of the

of the documents then treated

contained

is achieved that attempt P. partition

as individuals

iteration,

of procedures ements

is, define C={@,,J: l<i Sri/m, lgj order J. That C. pn/m through at iteration groups running time by The Spin} taking process which the @,,J in re-

5.1
Buckshot The

Finding

Initial

Centers

inherits with are

an enumeration order broken to p2n on i and into groups application the C replacing

lexicographic peated of the buckshot time random algorithm clustering of the subroutine. This algorithm the may sample is quite simple. To of C reduced The point reduce To sampling deterministic. on the same although in our is employed, That corpus similar Buckshot calls alto differtrials The iteration, overall O(rnn). running

is then are

is, the pn components further this can jth

idea

buckets,

achieve choose &), ters time

a rectangular a small and apply clusters

algorithm, documents Return clearly

merely (of size the cenin runs

separate

agglomeration. clustering that time +. the

process the

terminates remaining

j if # n < k. At

the cluster found.

one final determine which Thus time.

of agglomerative to a partition time, observe

of the O(kn).

P of size k. pJ nm. . .)) =

Since gorithm this ent

random is not

operates if m = O(k)

on # n items, is thus this O(nm( algorithm

takes 1 +P+P2

is, repeated produce repeated partitions.

running

algorithm partitions,

has rectangular

experience

generally

produce

qualitatively

5.2
Fractionation The cluster groups uals These and when to Fractionation C into subroutine such groups that algorithm N/m buckets is then the finds applied k centers to each by initially buckOnce fined signed The

Assigning
k centers for those to one simplest have

Documents
been found, each centers center. and based

to

Centers
profiles de-

suitable on some

centers, of those algorithm,

document

in C must assigns k groups, a to the

be aseach and group

breaking

of a fixed individuals in number is roughly as if they The

size m > k. The of these into (from were document individof p.

criterion.

Asszgn-to-Nea~est, of the collection in G. Let into

ets separately groups

to agglomerate reduction bucket) treated repeated. remain. branching in each are now process

document let r,

to the nearest ith can index. partition. be efficiently for the computing cost group

Let G be a partition be the Ties s(cz17i). with

a factor

a < II,

if i maximizes

individuals, terminates

be broken The

by assigning

the entire only

iteration

lowest

set P = {11%}, computed by

O < i < k is then an ina : C to

k groups a I/p individual the

Fractionation tree bottom terminating

can be viewed up, where when the only

the desired P can verted In any kn. map

as building leaves k roots c=al, on a key mon ber, word such are

constructing for each

documents,

k centers of this

p~(I,

), and

remain. individuals but in C are enumerated, ordering a better word index could procedure of the Typically medium reflect sorts jth so that an extrinC based comnumterms. most ffz, . . .. c%. This on C, which as three, is the which

simultaneously case, the

the similarity procedure

to all the centers. is proportional

Suppose sic ordering

5.3
Given it into rithms,

Refinement
an initial a better there clustering, one. As with is a tradeoff it is now our initial speed desirable and to refine algoaccuracy.

in each individual. favors

j is a small frequency

clustering

between

323

The rates and

simplest poorly Join

process just defined

is simply discussed. clusters into which

to iterate The two Split well

the

Assign-tosepaparts

Join The purpose of the groups elements documents words may by their Join refinement digests. operator P that they However, Therefore groups Since, is to merge usefully have of be by definition, never their lists

Nearest

process merges

algorithm separated similar.

clusters

are too

document distinguished any two typical

in a partition cluster of P are disjoint, in common. well overlap. two between

are not will

Iterated The also From using each

Assign-to-Nearest procedure first of our mentioned refinement above can

Assign-to-Nearest be seen a given document This as the

topical

the criterion A will

algorithms. cluster centers assign new

of distinguishability T(r, where merge time which A) =

17 and

set of clusters, sum profiles to the process nearest

we generate above, center

ItU(r)

n tu(A)l of w most A) the words number In large is thus topical some for of words each words the for 17. We
W.

the trimmed

and we then so as to form indefinitely, few steps, number

t~ (I) r and

is the list A if I(I, the to then

clusters. it makes is typically

can be iterated gains only in the first a small fixed

though and hence of times.5

> p, for

P, 0 < P < cluster in the

its greatest iterated

Determining proportional pus, and

topical

takes corof and

we must

compute

k2 intersections corpora, O(kn).

to decide number

clusters is typically

to merge. less than of Join time

Split Split two divides new each document This can (without Buckshot J7~} and partition of the group !J in a partition with P into

words the

the number

of documents,

running

groups. clustering

be accomplished refinement) partition let

by applying C = r and the be a G provides

Buckshot two two P new

k = 2. The Let P={

resulting Fl, the


k

6
ment the

Application
procedures course The

to

Scatter/Gather
initial possible clustering complete and refinein clustering method. is comconsideration. one can use of of algo-

groups. Iz, ..., union G, = {17,,1, 1,,2} new Buckshot of 17,. The G, s: partition

Combinations algorithms. initial

of the various give several used We have partition by corpus

element is simply

two the in

of these

combinations

of implementing used the

Scatter/Gather Scatter/Gather under

P = UG,.

pletely Hence, of Buckshot the overall to N. of this is the procedure on some cluster in the would coherency self similarity to the cluster, to the as well cluster only s(I, split I). poorly criterion. simirequires time proportional compute a slower the tens rithm and Split, the O(k N) tering expense rank of A(I, A(rk)}. then
p,

determined when the the initial clustering partition.

corpus offline. for

t=l
Each application proportional that quantity between similarity define: = s(r, P) r). ) in the set score criterion

is available

in advance, the

partition algorithm

We can therefore accuracy consisting time computation. to find corpora a quadratic

to II, 1. Hence, in time A groups One This larity average We thus A(r) Let r(f,, {A(r, The i-(r, does to N. modification simple

computation

can be performed

to improve

initial

However, of documents, Fractionation a great deal

of thousands is likely then Join, running and use the perform and time thus

to be too slow even for offline algorithm of refinement operators. the overall it is vital as possible, therefore and We and then have

We thus

centers, using the is

is in fact documents

proportional

average

as to the centroid.

Assign-to-Nearest does not session, to run accuracy. procedure, of refinement. clustering, affect

Not e that procedures running time.

of a document

for each of the refinement

In an interactive algorithm of some finding center minimum accurate additional returns.

however, We

for the cluseven at the the that yield quickly Bucka two a readithis it with

as quickly

use follow found

be the

shot bare

), A(r,),..., would
some

iterations procedure P) not < pk for change criterion only of the split groups such the that cosonably produces minishing By virtue algorithm may in fact worsen the partition
clusters can pull even fuzzier.

of the Assign-to-Nearest improvement, center

procedure that but finding further with

refinement

O < p s

1. This

modification since

the can

order

algorithm in time

herence

be computed

proportional

of the Buckshot is not deterministic.

procedure in the speed the lack since

However, at high Indeed, as a feature,

contemthan the that user

5Excessive Iteration
improving away from

rather than
documents

plated that

application, the partition might

Scatter/Gather, be computed

it is more

important of deterpartition

it, since {fuzzy elongated other clusters and become

the algorithm minism then in favor

be deterministic. be interpreted of discarding reclustering.

has the option of a fresh

an unrevealing

324

The described factor

overall

complexity section

of both is clearly

clustering O(klV).

procedures The constant enough

await fined excels.

evaluation information

metrics access

appropriate goals in

to the which

vaguely

de-

in this the

Scatter/Gather

for

Buckshot-based use with

procedure large

is small

to permit The

interactive

document

collections. larger

To support are essential. in a local than tering rithms trying

Scatter/Gather, Clustering to deal by with the slow. can the

fast

clustering quickly

algorithms by working rather time clusalgovaripre-

Fractionation-based factor, but

procedure one which is still

has a somewhat acceptable

be done groups entire even or will

constant

for offline

manner

on small large corpora,

of documents corpus the globally. linear

applications.

For extremely

6.1

Naturally
examining data set input

Clustered
the consists data

Data
of our separated clusters, then algorithms clusters i.e., both is larger of the than of

achieved may corpora, will the

Buckshot which be feasible. of the cluster

Fractionation to develop linear time scale that

be too under

We are working

It is worth when points. smallest the largest our the

performance of well has k natural similarity

ations large

on Scatter/Gather always accuracy is affected slow to find their

to arbitrarily

the assumption

If the

processing Clearly, tion provided

intra-cluster inter-cluster will equal select

document document find this

Buckshot subroutine. accurate time

and This may be.

Fractionaprovides al-

similarity,

algorithms by the motivation

by the quality highly running

of the clustering clustering

algorithms For Buckshot, and will with

partition. a corpus documents so long containing a random from each k widely sample of the then

if we have some for the our

further gorithms,

separated of size & centers will To

size centers,

whatever

high

probability

as n >> k in k. This k = 20 or so. if we choose from that In figures pus This about updates Here events a = 5 means in 1000. from that Given tion is the News consists 5000 our during 2 through set 5, we present described during Some the distributed month are 30 megabytes articles about To the full by output 2.1. the New of the The corYork 1990. text due political initial algorithm this line the task, display number parti(figtime of of and line in to Scatter/Gather Times session SeTvice articles. stories. is to learn month. the Buckshot international create clustering for two the goal applied in section

certainly see this, cluster. of our I/k).

be true compute This

case in which that, any to get

Scatter/Gather

Session

probability k times

a sample some none k(l

of size

.s, we fail is at most

individual

the probability of cluster of failure for

s individuals So, the probability k(l total If we now

is a member probability take s = aklnk

Z, namely is at most a, then

of articles of roughly

(1 I/k). the failure

of August of ASCII repeated

some

is at most l/k) Gklnk < kl-a.

of news

this

Thus that our

in our we start resulting

case, find with

with

k = 20, taking one element each

weve

400 samples

all the clusters at least will set the

999 times be a subset found

ure 2). permitting. Each its

Fractionation cluster the

is recommended with line the

each cluster, of one of the will include a

clusters Thus

is described The number near frequent oil) issues) and first

clusters. center more clearly, pair. cluster cluster actual

of centers cluster.

cluster

digest.

contains

within than this will

each actual k documents pair ever will

the cluster, note actual not that if we have some pair Then same each titles bucket, contains kets, articles gest Next, 4 (African

of documents the centroid. invasion which in the cluster.

in the cluster, The second

For Fractionation, of them is necessarily

we need merely in in the be merged a single same

of documents words clusters

cluster. in the

We select including international (figure

2 (Iraqs as those recluster,

of Kuwait), and probably seem likely

5 (Marother dito contain

in preference Thus, when

to any other we finish,

6 (Germany,

Therefore, we have clusters.

no pair found will

of documents be a subset

be merged.

of interest, 3). in figure issues).

and display this time

a new cluster selecting

of some

one of the

4, we iterate, other Specific and

clusters and separated police by

3 (Pakistan,

and probably hostages Africa, detail more titles

international have been war

issues)

incidents so on.

Conclusion
demonstrates information
metaphor contenks

out. action that


give.

We find

in Trinidad, about

in Liberia,

in South the 5).

Scatter/Gather can be an effective The tuitive uations a query


table-of.

document
the

clustering
an in-

We obtain viewing (figure

the situation contained

in Liberia in that

access tool has shown is particularly or

in its own right.


method

of the articles

cluster

basis, in

and

experience it is difficult

that

it is indeed helpful in sitmust to specify

easy to use.

Scatter/Gather which Claims

undesirable

formally.

of improved

performance

325

>

(time

(setq 4970 cluster

first items 199

(outline

(all-dots

tdb))))

cluster global move move 0 to to

Items..

sizes: 517 287 1293 1731

18 835 749

24 86

53

5 25

47

13 273 310

14 269 293 THE; TEACHING percent, SUBJECTS study, educ T

nearest. nearest. CRITICS year,

..sizes: ..sizes: student,

677 481

1020 844

275

(287) school,

URGE NEW METHODS; child,

PROGRAMS FOR PARENTS state, TAKES

unlverslty, HE; RESORT day, BUSH state, New Musical movie,

program, STEPS

(1731) year,

FEDERAL state, york,

WORK PROGRAMS city, SAYS million,

TO PR;

AMERICANS official, SAYS

CUT BACK house FOREIGNER presld

service, DRAWS unite, from art,

company, A LINE saudi, the cre; IN; official, After

week, BUSH

(749) iraq,

PENTAGON Iraqi, Trillins year, music,

60,000 american, Hats;

IRA;

kuwalt, Many

railltary,

(275) film,

Nasty Teen-Agers 1
american, PAINTING hit, OIL day, right, PRICES rate, THE directo DODGER

play,

company, MAY MEA; season, PRICES;

angeles, FOR RELIEF day, PANIC mxllion, league,

york, I;

(481) game,

TWISTS year, CRISIS oil,

AND TURNS team, OIL

SAX LOOKING win, player,

play,

coach,
RISE week, COUNCIL preslden AS

1
s RE

(844) price,

PUSHES percent,

WHY MAJOR year,

OVER A M; stock, OF TWO G; official,

market,

company, ;

(310)

LEADERS year,

OF TWO GERMANYS state, party,

REPRESENTATIVES political, DID JUDGE country,

SECURITY leader,

government, 7 (293) case, real U.S. court, time

APPEALS charge, 131258

ORDER FREEI; year, Judge,

MOVE TOO HASTI; attorney, trial,

MAYOR BARRY Jury, federal,

CONVICT dlst

lawyer,

msec

Figure

2: Initial

Scattering

>

(time

(setq 1903 cluster

second Items 123

(outline

first

2 5 6)))

cluster global move move 0 to to

items.

..sizes: 730 650 67 66 IRA; 65 57

51 62

8 5 5 4 56 59 99

7 28 714

15

nearest. nearest. PENTAGON iraql,

. . sizes: ..sizes: SAYS

110 126 DET; BUSH DRAWS president, A LINE sa

117

242

586

(650) iraq,

60,000 kuwait,

BUSH SAYS state, unite,

FOREIGNERS military, WITH

american,

offzci.al,

1 2

(66) party, (57) german,

LEGISLATIVE state, IN

LEADERS

BACK;

THE PROBLEM ; IN

AN EARL; vote,

ROAD STILL campaign, ; LEADERS

TOUGH FOR democratic

election,

year,

political,

candidate,

PUSH FOR UNIFICATION,

PUSH FOR UNIFICATION, government, soviet, PAKISTAN, mllxtary, MANDELA national, CORP. corp, CRISIS day, stock,

OF TWO GERMA unificati FEEL LET sta TO SETT offlclal, MERRILL pres OVE f F ara price, PANIC week,

east,
BHUTTO DEATH

germany, mumster, TOLL

west, year, 500 south,

year, IN party, I; leader, share,

union, G; country, fi~ht, EA; market, 01;

state, PAKISTANIS

3 4 5

(117) (59) a+r~ean, (242) company,

GOVERNMENT EXCEEDS

DISMIS;

FRACTIOUS polltlcal,

government,

official, group,

DE KLERK, palica, FIRST year, MIDEAST company,

HOLD U;

NEGOTIATIONS FARM BANK, sell,

government, million, OIL PRICES percent, GRANTS

WEST GERMANS TO BUY FIRE; percent, RISE

EXECUTIVE

(586) oil,

AS STOCK; year,

PUSHES stock, DAYS

WHY MAJOR rate,

price, IRAQ , lraqi, time

market, 237

mdlzon, I;

(126) kuwait

FOREIGN iraq,

; WOMAN TELLS saudi, day,

OF 12

CONCERN HEIGHTENS country, state,

american, msec

year,

invasion,

real

54184

Figure3:

Second

Scatter

326

>

(time

(setq 176 cluster

third items 37

(outline

second

3 4)))

cluster global move move 0 1 (5) (16) rebel, 2 (28) south, 3 (1) security, 4 (51) to to

items...slzes: ..sizes: ..sizes: 4 5 LAY 16 16 44 28

1 4 1 23 1 51 MUSLIM

12

1 5 3 8 3 10 13 LAY DOW; L; DRAMA IS hostage, WEST AFRICAN leader, COMPETING ant, politlcal, OVER BUT BOO robinson FORCE S officla FACTIONS gove T wednesday, TO SETTLE group, troop, bakr, llberian, HOLD U;

nearest. nearest.

7 71 7 55

MUSLIM

MILITANTS trinidad, taylor, TOLL

DOW; L;

MILITANTS

government, african, DEATH pollee, SHIFT

minister, TO SETTLE west, 500 black, S; I;

parliament, NEGOTIATIONS liberia,

NEGOTIATIONS

EXCEEDS

DE KLERK, mandela,

MANDELA africa,

african, IN U.S.

congress,

COMPUTER computer,

agency, SECURITY year,

technology,

national, COUNCIL country,

center, REACHE; group, BATTLE god, I;

communication, NEW U.S. guerrilla, BOMBINGS aoun, MS. IN POLICY war,

milit IS natio W

COUNCIL state,

REACHES; official, SH; al, DISMIS; TO;

@SECURITY army, MUSLIM party, MS.

government, 5 (7) CLASHES

BETWEEN

RIVAL

FACTIONS kill,

SOUTHERN amal HER WORL

lebanon, 6 7 (55) (13)

muslim, BHUTTO

chru+tlan,

lebanese, HER OUS; prime, END;

beirut, CALLS IN

GOVERNMENT minister,

BHUTTO YEARS

CALLS

BHUTTO

government,

party, TO VISIT

politlcal, 45

military, AFTER

pakxstan, JAPANS

president, ROLE

SHEVARDNADZE

WARS

Japan, soviet, war, korean, real time 11140 msec

japanese,

year,

tokyo,

government,

south,

korea,

Figure

4: Third

Scatter

> (prmt-titles (nth 1 thmd)) 3720 REBEL LEADER SEIZES ABOUT A DOZEN FOREIGNERS 4804 4778 3719
3409 3114 3113 2785 2784 2783 2782 1801 1685 1684 248

WEST AFRICAN FORCE SENT TO LIBERIA

AS TALKS REMAIN DEADLOCKED


TALKS INEVITABLE FAILURE FAILURE POLICY POLICY

WAR THREATENS TO WIDEN AS NEIGHBORING COUNTRIES TAKE SIDES REBEL LEADER AGREES TO HOLD CEASE-FIRE
OUSTER OF LIBERIAN PRESIDENT NOW SEEMS NEGOTIATIONS NEGOTIATIONS LIBERIANS LIBERIANS LIBERIAN LIBERIA FIVE FACES FACES OUSTER IN IN TO SETTLE TO SETTLE U.S. U.S. LIBERIAN LIBERIAN

WAR END IN WAR END IN

CRITICAL CRITICAL

OF ADMINISTRATION OF ADMINISTRATION TAYLOR OFFER, TROOPS

REBEL LEADER,

LEADER

CHARLES

HURT EN ROUTE TO CEASE-FIRE WONT QUIT

REJECTING NATIONS LIBERIA LIBERIA

TRUCE MOVING

WEST AFRICAN OF DEATH OF DEATH IN IN

TOWARD LIBERIA

OF LIBERIAN

PRESIDENT

NOW SEEMS

INEVITABLE

Figure

5: Titles

of articles

in topic

1 from

Figure

327

B
Here erative in our Let between

Group
we present clustering in [3].

Average
a quadratic algorithm This time which

Clustering
greedy global good to the average agglomresults algorithm similarity has given

then the and

finding

the

best

pair

would

simply

involve

scanning with each I the ag-

IGI candidates. A need time

Updating

these since

quantities those

iteration Using average glomerative number

is straightforward, be recomputed. such techniques clustering

only

involving

implementation. r be a document any two

is similar The

as these, for

it can be seen that group average where n is equal

presented

complexity

truncated

group.

is 0(n2)

to the

documents

in 17 is defined

to be

of individuals

to be clustered.

Let

G be a set of group

of disjoint average clusters from G.

document agglomerative I and A G

groups.

The

basic finds S(r U by

iteration A) A over

clustering maximize constructed

the two different new, I smaller, with

which is then

all choices A.

partition

merging

G = (G {I, Initially, individual IGI = k. flat the final recording If that r. we G is simply

A})

u {r

u A}.
groups, one for each when is by inwith

a set of singleton The output rather latter iteration from than

to be clustered. Note that the G, the join cosine partition although each pairwise employ the

terminates this a nested

procedure hierarchy

of partitions,

could

be computed in a dendrogram. measure, the

as one level similarity sum profile

ner maximization

can be significantly

accelerated. associated S(I), That

Recall

P( 17) is the unnormalized Then the to the average inner pairwise product,

similarity, (@(I], ~(1)).

is simply is, since

related

~~rp~r =

lrl(\rl lrl(lrl

- l)s(r) - l)s(r)

= s(r)
Similarly,

+ ~(P(cY), P(cY)) Clcr + Irl,

= (~(r) ii(r)) Irl(lrfor the union

- Irl I)
of two disjoint groups, A = I U

(A)
where

(IXA), IXA)) - (Irl + IAI) = (Irl+ lAl)((lrl+ IAI)- 1)

(@( A),

P(A))

{pfrh~(r)}

2(j(r),
Therefore, the pairwise in average a merge the A were if for every that can such

j(A))

+ (j(A),
and fi(r)

fI(A))
are known, decrease each every time 17 E G

1? c G, S(I) will Further, that n A), produce be cheaply

merge similarity

the for

least

updated

is performed. known

suppose

s(r n A) = ~j~s(r

328

References
[1] Chris Eighth ence Buckley Annual on Research pages and Alan F. Lewit. In ACM Optimizations of the SIGIR in Confer-

[14] P.

Willett.

fast

procedure ~

for

the

calculation 17:53-60,

of similarity Information 1981. [15] of inverted vector searches. and 97-11O, l%oceedings

coefficients Processing

in automatic Management,

classification.

International

Development 1985. large files

Info?matzon

P.

Willett.

Recent A critical

trends review.

in hierarchical

document

RetTieval, [2] W.B. ing Croft.

clustering: of documents us-

Info~matzon
1988.

l+ocessmg

a Management,
Clustering method. the single-link for Journal Science, P. Willett. Wards in of the Ame?zcan 28:341-344, Hierarchical method. Information Conference 1977. docon ReRetTzeval,

24(5):577-597,

Soczety [3] A. ings seamh pages [4] A.

Info?matzon and using

E1-Hamdouchi clustering of the Ninth

ument

In Proceed-

International

and Development
149-156, 1986. H.C. Luckhurst,

Griffiths,

and

P. Willett.

Using

inter-document retrieval systems.

similarity Jou?nal Sczence, and Richard Pretice

information of the 37:3-11,

in document Society

AmeTican 1986.

foT Information [5] Anil 07632, [6] K. Jain 1988. and

C. Dubes. Hall,

A~goTithms Cliffs,

fo~ N.J.

CiusteTing

Data.

Engelwood

N. Jardine erarchical mation

C.J. and D.

van

Rijsbergen.

The retrieval.

use of hiInfor1971. W. Tukey. to text

clustering Storage

in information RetTzeval, R. Cutting, phrase of the Statistical PARC

7:217240, and J.

[7] J.

O.

Pedersen, search: In

Snippet access. al Also 91-08. [8] G. Salton. Hall,

a single

approach 1991 Joznt

Proceedings American as Xerox

Statistic1991. SSL-

Meetings. available

Association, technical report

The SMART Cliffs,

Retmecal N. J., 1971.

System.

Prentice-

Englewood and

[9] G. Salton Information [10] R. Sibson.

M. J. McGill.

~ntToduciion 1983. efficient ComputeT

to Modern

Retrieval. SLINK: link 1973. Rijsbergen.

McGraw-Hill, an optimally

algorithm Journal,

for the single 16:30-34, [11] C.J. van

cluster

method.

Information edition,

Retrzeval. 1979. Croft, Document

Butter-

worths, [12] C.J. tering: Cranfield

London,

second

van Rijsbergen An evaluation

and W.B. of some

clusthe &

experiments

with

1400 collection.
11:171182,

Information
1975.

Processing

Management, [13] P. Willett. file approach. 231, 1980.

Document Journal

clustering of Information

using

an inverted 2:223-

Sczence,

329

You might also like