Clustering Web Search Results: Iwona Białynicka-Birula

Clustering Web Search Results
Iwona Biaynicka-Birula
Overview
What is clustering?
Applying clustering to web search results
Clustering algorithms
Case studies
Related topics not covered
Clustering
Clustering in general
Document clustering in general
Other search and browsing aids
Classification
Visualization
Query expansion
Iwona Biaynicka-Birula - Clustering Web Search Results
What is clustering?
Clustering the act of grouping similar

object into sets
In the web search context:
organizing web pages (search results)
into groups, so that different groups
correspond to different user needs
search engine
i.e.: engine
car part
Engine Corp.
Clustering vs. Classification
Classification assigns objects to

predefined groups
Clustering infers groups based on
clustered objects
Why cluster web search results?
Flat ranked list not enough
Relationships between the results
Documents pertaining to different topics

cannot be compared
Cluster Hypothesis (van Rijsbergen 1979):
Closely related documents tend to be relevant
to the same requests.
Aids user-engine interaction
Browsing
Help user express his need
Why not just document clustering?
Web search results clustering is a

version of document clustering,
but
Billions of pages
Constantly changing
Data mainly unstructured and
heterogeneous
Additional information to consider
(i.e. links, click-through data, etc.)
Some requirements
Fast
Flexible
Immediate response to query

Web content changes constantly
User-oriented
Main goal is to aid the user in finding

sought information
Main issues
Online or offline clustering?

What to use as input
How to define similarity?
Entire documents
Snippets
Structure information (links)
Other data (i.e. click-through)
Use stop word lists, stemming, etc.
Content (i.e. vector-space model)
Link analysis
Usage statistics
How to group similar documents?

How to label the groups?
Flat or hierarchical?
Overlapping?
Hard or soft?
Incremental?
Predefined cluster number?
Requiring explicit similarity
measure? Distance measure?
Distance-based
Hierarchical
Agglomerative Hierarchical Clustering (AHC)
Flat
K-means (can be fuzzy)
Single-pass (incremental)
Other
Suffix Tree Clustering (Grouper)

Self-organizing (Kohonen) maps (neural
networks)
Latent Semantic Indexing (LSI) (reducing the
dimensionality of the vector-space)
Agglomerative hierarchical clustering
Clustering result: dendrogram
AHC variants
Various ways of calculating cluster

similarity
single-link
complete-link
(minimum)
(maximum)
Group-average
(average)
K-means clustering (k=3)
Single-pass
Selected systems
Scatter/Gather
Grouper
Carrot2
Vivisimo
Mapuccino
(Su et. al. 2001)
SHOC
Scatter/Gather
(Cutting et. al. 1992)

Designed for browsing
Based on two novel clustering
algorithms
Buckshot fast for online clustering

Fractionation accurate for offline
initial clustering of the entire set
Grouper
(Zamir and Etzioni 1997, 1999)

Online
Operates on query result snippets
Clusters together documents with
large common subphrases
Suffix Tree Clustering (STC)
STC induces labeling
Suffix Tree Clustering (STC)
Linear
Incremental
Overlapping
Can be extended to hierarchical
STC algorithm
Step 1: Cleaning
Step 2: Suffix tree construction
Stemming
Sentence boundary identification
Punctuation elimination
Produces base clusters (internal nodes)
Base clusters are scored based on size and
phrase score (which depends on length and
word quality)
Step 3: Merging base clusters
Highly overlapping clusters are merged
Carrot2
(Stefanowski and Weiss 2003)

http://www.cs.put.poznan.pl/dweiss/carr
ot/
Component framework
Allows substituting components for
Input (i.e. snippets from other search engines)

Filter
Stemming
Distance measure
Clustering
Output
Vivsimo
Commercial
http://www.vivisimo.com/
Online
Hierarchical
Conceptual
Other
Mapuccino (IBM)
(Su et. al. 2001)
(Maarek et. al. 2000)

http://www.alphaworks.ibm.com/tech/mapuccino
Relatively efficient AHC (O(n2))
Similarity based on vector-space model
Only usage statistics used as input
Recursive Density Based Clustering
SHOC
(Zhang and Dong 2004)

Grouper-like
Key phrase discovery
References
Douglass Cutting, David Karger, Jan Pedersen, and John W.

Tukey, Scatter/Gather: A Cluster-based Approach to Browsing
Large Document Collections, 1992.
Proceedings of the 15th Annual International ACM/SIGIR Conference, Copenhagen.
O. Zamir and O. Etzioni, Grouper: a dynamic clustering

interface to web search results, May 1999.
In Proceedings of the Eighth International World Wide Web Conference, Toronto, CanadaM. Steinbach, G.
Y.S. Maarek, R. Fagin, I.Z. Ben-Shaul, D. Pelleg, Ephemeral

document clustering for web applications, 2000.
Technical Report RJ 10186, IBM Research
Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen

Hu, Correlation-based Document Clustering using Web
Logs, 2001.
J. Stefanowski, D. Weiss. Carrot2 and Language Properties in
Web Search Results Clustering, 2003.
In: Lecture Notes in Artificial Intelligence: Advances in Web Intelligence, Proceedings of the First
International Atlantic Web Intelligence Conference, Madrit, Spain, vol. 2663 (), pp. 240249
Dell Zhang, Yisheng Dong. Semantic, Hierarchical, Online

Clustering of Web Search Results, Apr 2004.
In Proceedings of the 6th Asia Pacific Web Conference (APWEB), Hangzhou, China
Thank you
Questions?
http://www.di.unipi.it/~iwona/Clust
ering.ppt

Clustering Web Search Results: Iwona Białynicka-Birula

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering Web Search Results: Iwona Białynicka-Birula

Uploaded by

Copyright:

Available Formats

Clustering Web Search Results

Clustering the act of grouping similar

Clustering vs. Classification

Classification assigns objects to

Iwona Biaynicka-Birula - Clustering Web Search Results

Why cluster web search results?

Flat ranked list not enough

Relationships between the results

Documents pertaining to different topics

Aids user-engine interaction

Iwona Biaynicka-Birula - Clustering Web Search Results

Why not just document clustering?

Web search results clustering is a

Immediate response to query

Main goal is to aid the user in finding

Iwona Biaynicka-Birula - Clustering Web Search Results

Online or offline clustering?

How to define similarity?

How to group similar documents?

Iwona Biaynicka-Birula - Clustering Web Search Results

Suffix Tree Clustering (Grouper)

Iwona Biaynicka-Birula - Clustering Web Search Results

Agglomerative hierarchical clustering

Iwona Biaynicka-Birula - Clustering Web Search Results

Clustering result: dendrogram

Iwona Biaynicka-Birula - Clustering Web Search Results

Various ways of calculating cluster

K-means clustering (k=3)

Iwona Biaynicka-Birula - Clustering Web Search Results

Iwona Biaynicka-Birula - Clustering Web Search Results

Iwona Biaynicka-Birula - Clustering Web Search Results

(Cutting et. al. 1992)

Buckshot fast for online clustering

Iwona Biaynicka-Birula - Clustering Web Search Results

(Zamir and Etzioni 1997, 1999)

Iwona Biaynicka-Birula - Clustering Web Search Results

Suffix Tree Clustering (STC)

Iwona Biaynicka-Birula - Clustering Web Search Results

Step 2: Suffix tree construction

Step 3: Merging base clusters

Highly overlapping clusters are merged

Iwona Biaynicka-Birula - Clustering Web Search Results

(Stefanowski and Weiss 2003)

Input (i.e. snippets from other search engines)

Iwona Biaynicka-Birula - Clustering Web Search Results

Iwona Biaynicka-Birula - Clustering Web Search Results

(Su et. al. 2001)

(Maarek et. al. 2000)

(Zhang and Dong 2004)

Iwona Biaynicka-Birula - Clustering Web Search Results

Douglass Cutting, David Karger, Jan Pedersen, and John W.

Proceedings of the 15th Annual International ACM/SIGIR Conference, Copenhagen.

O. Zamir and O. Etzioni, Grouper: a dynamic clustering

Y.S. Maarek, R. Fagin, I.Z. Ben-Shaul, D. Pelleg, Ephemeral

Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen

Dell Zhang, Yisheng Dong. Semantic, Hierarchical, Online

Iwona Biaynicka-Birula - Clustering Web Search Results

Iwona Biaynicka-Birula - Clustering Web Search Results

You might also like