Professional Documents
Culture Documents
Iwona Biaynicka-Birula
Overview
What is clustering?
Applying clustering to web search results
Clustering algorithms
Case studies
Related topics not covered
Clustering
Clustering in general
Document clustering in general
Other search and browsing aids
Classification
Visualization
Query expansion
Iwona Biaynicka-Birula - Clustering Web Search Results
What is clustering?
Browsing
Help user express his need
Some requirements
Fast
Flexible
User-oriented
Main issues
Entire documents
Snippets
Structure information (links)
Other data (i.e. click-through)
Use stop word lists, stemming, etc.
Content (i.e. vector-space model)
Link analysis
Usage statistics
Clustering algorithms
Flat or hierarchical?
Overlapping?
Hard or soft?
Incremental?
Predefined cluster number?
Requiring explicit similarity
measure? Distance measure?
Clustering algorithms
Distance-based
Hierarchical
Agglomerative Hierarchical Clustering (AHC)
Flat
K-means (can be fuzzy)
Single-pass (incremental)
Other
AHC variants
single-link
complete-link
(minimum)
(maximum)
Group-average
(average)
Iwona Biaynicka-Birula - Clustering Web Search Results
Single-pass
Selected systems
Scatter/Gather
Grouper
Carrot2
Vivisimo
Mapuccino
(Su et. al. 2001)
SHOC
Scatter/Gather
Grouper
Linear
Incremental
Overlapping
Can be extended to hierarchical
STC algorithm
Step 1: Cleaning
Stemming
Sentence boundary identification
Punctuation elimination
Produces base clusters (internal nodes)
Base clusters are scored based on size and
phrase score (which depends on length and
word quality)
Carrot2
Vivsimo
Commercial
http://www.vivisimo.com/
Online
Hierarchical
Conceptual
Other
Mapuccino (IBM)
SHOC
References
In Proceedings of the Eighth International World Wide Web Conference, Toronto, CanadaM. Steinbach, G.
Thank you
Questions?
http://www.di.unipi.it/~iwona/Clust
ering.ppt