You are on page 1of 25

Clustering Web Search Results

Iwona Biaynicka-Birula

Overview

What is clustering?
Applying clustering to web search results
Clustering algorithms
Case studies
Related topics not covered

Clustering
Clustering in general
Document clustering in general
Other search and browsing aids
Classification
Visualization
Query expansion
Iwona Biaynicka-Birula - Clustering Web Search Results

What is clustering?

Clustering the act of grouping similar


object into sets
In the web search context:
organizing web pages (search results)
into groups, so that different groups
correspond to different user needs
search engine
i.e.: engine
car part
Engine Corp.
Iwona Biaynicka-Birula - Clustering Web Search Results

Clustering vs. Classification

Classification assigns objects to


predefined groups
Clustering infers groups based on
clustered objects

Iwona Biaynicka-Birula - Clustering Web Search Results

Why cluster web search results?

Flat ranked list not enough

Relationships between the results

Documents pertaining to different topics


cannot be compared
Cluster Hypothesis (van Rijsbergen 1979):
Closely related documents tend to be relevant
to the same requests.

Aids user-engine interaction

Browsing
Help user express his need

Iwona Biaynicka-Birula - Clustering Web Search Results

Why not just document clustering?

Web search results clustering is a


version of document clustering,
but
Billions of pages
Constantly changing
Data mainly unstructured and
heterogeneous
Additional information to consider
(i.e. links, click-through data, etc.)
Iwona Biaynicka-Birula - Clustering Web Search Results

Some requirements

Fast

Flexible

Immediate response to query


Web content changes constantly

User-oriented

Main goal is to aid the user in finding


sought information

Iwona Biaynicka-Birula - Clustering Web Search Results

Main issues

Online or offline clustering?


What to use as input

How to define similarity?

Entire documents
Snippets
Structure information (links)
Other data (i.e. click-through)
Use stop word lists, stemming, etc.
Content (i.e. vector-space model)
Link analysis
Usage statistics

How to group similar documents?


How to label the groups?
Iwona Biaynicka-Birula - Clustering Web Search Results

Clustering algorithms

Flat or hierarchical?
Overlapping?
Hard or soft?
Incremental?
Predefined cluster number?
Requiring explicit similarity
measure? Distance measure?

Iwona Biaynicka-Birula - Clustering Web Search Results

Clustering algorithms

Distance-based

Hierarchical
Agglomerative Hierarchical Clustering (AHC)
Flat
K-means (can be fuzzy)
Single-pass (incremental)

Other

Suffix Tree Clustering (Grouper)


Self-organizing (Kohonen) maps (neural
networks)
Latent Semantic Indexing (LSI) (reducing the
dimensionality of the vector-space)

Iwona Biaynicka-Birula - Clustering Web Search Results

Agglomerative hierarchical clustering

Iwona Biaynicka-Birula - Clustering Web Search Results

Clustering result: dendrogram

Iwona Biaynicka-Birula - Clustering Web Search Results

AHC variants

Various ways of calculating cluster


similarity

single-link
complete-link
(minimum)
(maximum)

Group-average
(average)
Iwona Biaynicka-Birula - Clustering Web Search Results

K-means clustering (k=3)

Iwona Biaynicka-Birula - Clustering Web Search Results

Single-pass

Iwona Biaynicka-Birula - Clustering Web Search Results

Selected systems

Scatter/Gather
Grouper
Carrot2
Vivisimo
Mapuccino
(Su et. al. 2001)
SHOC

Iwona Biaynicka-Birula - Clustering Web Search Results

Scatter/Gather

(Cutting et. al. 1992)


Designed for browsing
Based on two novel clustering
algorithms

Buckshot fast for online clustering


Fractionation accurate for offline
initial clustering of the entire set

Iwona Biaynicka-Birula - Clustering Web Search Results

Grouper

(Zamir and Etzioni 1997, 1999)


Online
Operates on query result snippets
Clusters together documents with
large common subphrases
Suffix Tree Clustering (STC)
STC induces labeling

Iwona Biaynicka-Birula - Clustering Web Search Results

Suffix Tree Clustering (STC)

Linear
Incremental
Overlapping
Can be extended to hierarchical

Iwona Biaynicka-Birula - Clustering Web Search Results

STC algorithm

Step 1: Cleaning

Step 2: Suffix tree construction

Stemming
Sentence boundary identification
Punctuation elimination
Produces base clusters (internal nodes)
Base clusters are scored based on size and
phrase score (which depends on length and
word quality)

Step 3: Merging base clusters

Highly overlapping clusters are merged

Iwona Biaynicka-Birula - Clustering Web Search Results

Carrot2

(Stefanowski and Weiss 2003)


http://www.cs.put.poznan.pl/dweiss/carr
ot/
Component framework
Allows substituting components for

Input (i.e. snippets from other search engines)


Filter
Stemming
Distance measure
Clustering
Output

Iwona Biaynicka-Birula - Clustering Web Search Results

Vivsimo

Commercial
http://www.vivisimo.com/
Online
Hierarchical
Conceptual

Iwona Biaynicka-Birula - Clustering Web Search Results

Other

Mapuccino (IBM)

(Su et. al. 2001)

(Maarek et. al. 2000)


http://www.alphaworks.ibm.com/tech/mapuccino
Relatively efficient AHC (O(n2))
Similarity based on vector-space model
Only usage statistics used as input
Recursive Density Based Clustering

SHOC

(Zhang and Dong 2004)


Grouper-like
Key phrase discovery

Iwona Biaynicka-Birula - Clustering Web Search Results

References

Douglass Cutting, David Karger, Jan Pedersen, and John W.


Tukey, Scatter/Gather: A Cluster-based Approach to Browsing
Large Document Collections, 1992.

Proceedings of the 15th Annual International ACM/SIGIR Conference, Copenhagen.

O. Zamir and O. Etzioni, Grouper: a dynamic clustering


interface to web search results, May 1999.

In Proceedings of the Eighth International World Wide Web Conference, Toronto, CanadaM. Steinbach, G.

Y.S. Maarek, R. Fagin, I.Z. Ben-Shaul, D. Pelleg, Ephemeral


document clustering for web applications, 2000.
Technical Report RJ 10186, IBM Research

Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen


Hu, Correlation-based Document Clustering using Web
Logs, 2001.
J. Stefanowski, D. Weiss. Carrot2 and Language Properties in
Web Search Results Clustering, 2003.
In: Lecture Notes in Artificial Intelligence: Advances in Web Intelligence, Proceedings of the First
International Atlantic Web Intelligence Conference, Madrit, Spain, vol. 2663 (), pp. 240249

Dell Zhang, Yisheng Dong. Semantic, Hierarchical, Online


Clustering of Web Search Results, Apr 2004.
In Proceedings of the 6th Asia Pacific Web Conference (APWEB), Hangzhou, China

Iwona Biaynicka-Birula - Clustering Web Search Results

Thank you

Questions?
http://www.di.unipi.it/~iwona/Clust
ering.ppt

Iwona Biaynicka-Birula - Clustering Web Search Results

You might also like