You are on page 1of 41

(Mostly) Parallel and Distributed Information Retrieval

March 27, 2006

Some slides were adapted from those by James Allan, Jamie Callan,
Zhang Gang, Ken Hoganson, Weiyi Meng, and Clement Yu
Roadmap for Today

‣ Parallel Information Retrieval (PIR)

‣ Distributed Information Retrieval (DIR)

‣ Metasearch Engine Example

‣ Retrieval (a higher level view)

‣ Recap
Parallel Information Retrieval

‣ The goal: Greater simultaneous execution of parts of a retrieval algorithm.

‣ Two major ways of doing this:

- new retrieval strategies that are parallel-friendly

- adapt existing, well-studied IR algorithms


Parallel IR - New Retrieval Strategies

‣ neural networks

‣ genetic algorithms

‣ co-occurrence analysis and clustering


Parallel IR - Adaptation of Existing Algorithms

‣ singular value decomposition

‣ the construction and weighting of the term-document matrix

‣ pattern matching algorithms

‣ text signature algorithms

‣ inverted index file algorithms


Classic Model: Parallel Processing
An example parallel process of time 10: • Multiple Processors available (4)
S - Serial or non-parallel portion • A Process can be divided into
A - All A parts can be executed concurrently serial and parallel portions
B - All B parts can be executed concurrently • The parallel parts are executed
All A parts must be completed prior to executing concurrently
the B parts
• Serial Time: 10 time units
• Parallel Time: 4 time units

Executed on a single processor:

S1 A A A A B B B B S2

Executed in parallel on 4 processors:

A B

A B
S1 S2
A B

A B

6
Roadmap for Today

‣ Parallel Information Retrieval (PIR)

‣ Distributed Information Retrieval (DIR)

‣ Metasearch Engine Example

‣ Retrieval (a higher level view)

‣ Recap
Distributed Information Retrieval (DIR)

‣ IR is usually viewed as a single collection of documents.

‣ What is a collection?

- A single source, e.g., the Durham Herald Sun (what time period?)

- A single location, e.g., just the SILS Library?

- A set of libraries, e.g., all of the UNC-CH libraries?

‣ Distributed IR: searching when there is more than one collection

- Local environments, e.g., a large collection is partitioned

- Wide-area environments, e.g., corporate network, Internet

UNC-CH Chicago NY Times Brown Apple


A General Model for Distributed Information Retrieval

DB1 DB2 DBn

···
Search Search Search
Engine 1 Engine 2 Engine n

Distributed
Search Engine

···

Resource Description
of each DB
A General Model for Distributed Information Retrieval

DB1 DB2 DBn

···
Search Search Search
Engine 1 Engine 2 Engine n

Query
Distributed
Search Engine

···

Resource Description
of each DB
A General Model for Distributed Information Retrieval

DB1 DB2 DBn

···
Search Search Search
Engine 1 Engine 2 Engine n

s
Qu ecte

ct e d DB
sel

to se l e
ery d D

Query
to Bs

Query
Distributed
Search Engine

···

Resource Description
of each DB
A General Model for Distributed Information Retrieval

DB1 DB2 DBn

···
Search Search Search
Engine 1 Engine 2 Engine n
Re ecte

D Bs
sel

e d
su d D

se lect
lts

m Bs
Qu ecte

l t s f ro t e d D
elec
sel

Resu
fro Bs

s
ery d D

y t o
Quer
m
to Bs

Query
Distributed
Search Engine

···

Resource Description
of each DB
A General Model for Distributed Information Retrieval

DB1 DB2 DBn

···
Search Search Search
Engine 1 Engine 2 Engine n
Re ecte

D Bs
sel

e d
su d D

se lect
lts

m Bs
Qu ecte

l t s f ro t e d D
elec
sel

Resu
fro Bs

s
ery d D

y t o
Quer
m
to Bs

Query
Distributed
Search Engine Merged Results

···

Resource Description
of each DB
A General Model for Distributed Information Retrieval

DB1 DB2 DBn

···
Search Search Search
Engine 1 Engine 2 Engine n
Re ecte

D Bs
sel

e d
su d D

se lect
lts

m Bs
Qu ecte

l t s f ro t e d D
elec
sel

Resu
fro Bs

s
ery d D

y t o
Quer
m
to Bs

Query
Distributed
Search Engine Merged Results Four Steps:
1. Find out what each DB contains
··· 2. Decide which DBs to search
3. Search selected DBs
4. Merge results returned by DBs
Resource Description
of each DB
Primary Motivations for Distributed IR

‣ Partition large collections across processors

- To increase speed

- Because of political or administrative requirements

‣ Ever-increasing amounts of data

‣ Networks, with hundreds or thousands of collections

- Consider the number of collections indexed on the Web

‣ Heterogeneous environments, many IR systems

‣ Economic costs of searching everything at a site

‣ Economic costs of searching everything on a network


Issues

‣ Site Description

‣ Collection Selection

‣ Searching

‣ Result merging: blending a set of document rankings

‣ Metrics
Site Description

‣ Contents

‣ Search Engine

‣ Services
Collection Selection

‣ Deciding which collection(s) to search

‣ Ranking collections for a query

‣ Selecting the best subset from a ranked list


Result Merging

‣ Different underlying corpus statistics

‣ Different search engines with different output information

‣ Two types of environment

- cooperative

- uncooperative
Some Result Merging Possibilities

Collection at a Relevance Ranked


time Round robin
Metrics

‣ Generality

‣ Effectiveness

‣ Efficiency

‣ Consistency of results

‣ Amount of manual effort


Roadmap for Today

‣ Parallel Information Retrieval (PIR)

‣ Distributed Information Retrieval (DIR)

‣ Metasearch Engine Example

‣ Retrieval (a higher level view)

‣ Recap
The Problem

How am I going to find


the 5 best pages on
“Enumerative
Combinatorics”?

search search search


engine 1 engine 2 engine n
......

text text text


source 1 source 2 source n
18
Metasearch Engine Solution

user
query result
user interface
query dispatcher result merger

search search search


engine 1 engine 2 engine n
......

text source 1 text source 2 text source n


Some Observations

• most sources are not useful for a given query


• sending a query to a useless source would
⎯incur unnecessary network traffic
⎯waste local resources for evaluating the query
⎯increase the cost of merging the results
• retrieving too many documents from a source is inefficient

20
A More Efficient Metasearch Engine

user
query result
user interface
database selector query dispatcher
query dispatcher result merger

search search search


engine 1 engine 2 engine n
......

text source 1 text source 2 text source n


Roadmap for Today

‣ Parallel Information Retrieval (PIR)

‣ Distributed Information Retrieval (DIR)

‣ Metasearch Engine Example

‣ Retrieval (a higher level view)

‣ Recap
Retrieval
(A Higher Level View)

‣ Some of the various flavors

- data retrieval (DR)

- information retrieval (IR)

- text retrieval

- document retrieval

- fact retrieval

‣ IR is often confused with the others, most often with DR.


"#$%&'()*+,+-./01+233.4

What is Different About IR?

@).J"01*7%K=%/)%+"/"&"#'#
F"/"&"#'# K=
F"/" ;.,:/.:,)* 9$2.,:/.:,)*
+&)',62)1'$.-/26 8"67-)&*269)/E'0%
L1'3+#
9HHQ;%"7'< /E"*%/'P/<
!)7-$)* 90'3"/1)*"3% 5,))6.)4. 9M*"/-0"3%
5-'01'# "37'&0";%H5B< 3"*7-"7'N<;%O))3'"*
+,-.-/'& 9,)*,-00'*,(% !"#$%&'()*;%
=',)2'0"&131/( ,)*/0)3;%0',)2'0(;% /E)-7E%#/133%"*%1##-'
"/).1,%)J'0"/1)*#<
34'/. 90'#-3/#%"0'% 01%,)/-2) 9*''+%/)%
D"/,E1*7 !"#!$% M,)00',/N< .'"#-0'%
'::',/12'*'##<

"#$%&'()*+,+-./01+233.4
Information Retrieval Contrasted With Data Retrieval

‣ Retrieving items via SQL is an example of DR.

‣ Retrieving items via, say, Google, is an example of IR.

‣ SQL is an example of a data retrieval language.

- It’s main use is for querying “relational” databases.

- The term “relational database” is actually a misnomer w.r.t. SQL.

• Why?

• In practice, does it really matter?

‣ What are some of other differences between DR and IR?

‣ What specific “IR” retrieval model would describe SQL predicates?


Baeza-Yates and Ribeiro-Neto’s Classification of IR Models
(1999)

Set Theoretic

Fuzzy
Extended Boolean
Classic Models
Boolean Algebraic
vector
U Generalized Vector
probabilistic
s Retrieval: Lat. Semantic Index
e Adhoc Neural Networks
r Filtering
Structured Models
Probabilistic
T Non-Overlapping Lists
a Proximal Nodes Inference Network
s Belief Network
k Browsing
Browsing
Flat
Structure Guided
Hypertext
Kuropka’s Classification of IR Models (2005)

Properties of dependent terms


the Model
independent terms
immanent transcendent
Mathematical term dependencies term dependencies
Basis

Standard
Boolean Fuzzy Set
set-theoretic

Extended
Boolean
Generalized Balanced
Vector Space Topic-based
Vector Space Topic-based
Vector Vector Space
algebraic Space

Latent Spreading Activation Backpropagation


Semantic Neural Network Neural Network

Binary
Language
Independence
Retrieval
probabilistic by Logical
Imaging
Inference Belief
Network Network
Roadmap for Today

‣ Parallel Information Retrieval (PIR)

‣ Distributed Information Retrieval (DIR)

‣ Metasearch Engines

‣ Retrieval (a higher level view)

‣ Recap
Recap: Parallel Information Retrieval

‣ Focused on algorithms and processes within a computer

‣ Won’t work for all algorithms

‣ The benefit(s) must outweigh the cost(s)

‣ Opportunities

- parallellize some existing well-known algorithms

- develop parallelizable algorithms, when possible, for new retrieval strategies


Recap: Distributed Information Retrieval

‣ DIR systems provide:

- scalability in data volume

- scalability in performance: high throughput and low response times

- resilience to failures

‣ Challenges for DIR systems are:

- preserving response quality

- balancing workload among various query servers

- easy generation and maintenance of indexes


Recap: Retrieval

‣ IR and DR are fundamentally different!

‣ Ranked vs. Unranked

‣ Unstructured vs. Structured (this has become blurred)

‣ IR is based on relevance, DR is not

‣ In IR, the information need is assumed to be imprecise

‣ In DR, it is assumed to be a precise and complete specification


References

• Distributed Information Retrieval


⎯ http://159.226.40.18/~zhanggang/research/slides/Distributed%20Information%20Retrieval.ppt
⎯ http://www.ir.iit.edu/~nazli/CS495-Slides/DistributedIR_Web_05.pdf
⎯ http://ciir.cs.umass.edu/irchallenges/presentations/metasearch-jamie.ppt
• Methodologies for Distributed Information Retrieval (1998)
⎯ http://www.cs.rmit.edu.au/~jz/fulltext/icdcs98.pdf
• Dr. Ken Hoganson’s Lecture on Parallel Computing (2002)
⎯ http://science.kennesaw.edu/~khoganso/CS8421/Intro-Parallel.ppt
• VLDB’99 Tutorial: Metasearch Engines: Solutions and Challenges (1999)
⎯ http://www.cs.binghamton.edu/~meng/slides.d/vldbtut.ppt
• Building Efficient and Effective Metasearch Engines (March 2002)
⎯ http://citeseer.ist.psu.edu/376145.html
• The Website you seek cannot be located but endless others exist (2001)
⎯ http://citeseer.ist.psu.edu/433781.html
• Search Engine Watch
⎯ http://www.searchenginewatch.com

32
Optional:
If Time Permits
and
There is Enough Interest

‣ DIR: The State of the Art

‣ Open Problems
()*+,)-.+$/%012
34$%5+"+$%67%+4$%8,+%
9 1$:,$*$;+);#%<6==$<+)6;*%->%+$,?*%";/%7,$@.$;<)$*%)*%$77$<+)A$B
9 C6;+,6==$/%A6<"-.=",)$*%";/%*<4$?"*%",$%;6+%;$<$**",>B
9 C6==$<+)6;*%";/%/6<.?$;+*%<";%-$%,";D$/%E)+4%6;$%"=#6,)+4?%
F.*);#%/)77$,$;+%*+"+)*+)<*GB
= $B#BH%I=J55H%);7$,$;<$%;$+E6,D*
9 1";D);#*%7,6?%/)77$,$;+%<6==$<+)6;*%<";%-$%?$,#$/%$77)<)$;+=>2
= E)+4%:,$<)*$=>%;6,?"=)K$/%*<6,$*%F0;76*$$DL*%?$+46/GH%6,
= E)+46.+%:,$<)*$=>%;6,?"=)K$/%/6<.?$;+%*<6,$*H
= E)+4%6;=>%?);)?"=%$776,+H%";/
= E)+4%6;=>%?);)?"=%<6??.;)<"+)6;%-$+E$$;%<=)$;+%";/%*$,A$,B
9 M",#$%*<"=$%/)*+,)-.+$/%,$+,)$A"=%<";%-$%"<<6?:=)*4$/%;6EB

!"#$!%&'(' !)*+,-./0&1&234-5&!36637 3789),&2345:&;6637 <<


()*+,)-.+$/%012
34$%5+"+$%67%+4$%8,+
9 N6*+%$,,6,%6<<.,*%);%,";D);#%<6==$<+)6;*H%;6+%?$,#);#
9 O6+%<=$",%+4"+%);A$,*$%<6==$<+)6;%7,$@.$;<>%F!"#G%4$=:*
= -.+%?">-$%E$%P.*+%/6;L+%4"A$%$;6.#4%<6==$<+)6;*%>$+
9 5+"+$%67%+4$%",+%)*%"-6.+%&QQ%<6==$<+)6;*
= RN"**%4"*%/$A$=6:$/%"%ST&%<6==$<+)6;%+$*+-$/
= CNR%)*%:.*4);#%+4)*%+6%+46.*";/*%67%<6==$<+)6;*
= +4$%?"P6,%:,6-=$?%)*%$%&%'()"%*+,-.%/%)01H%;6+%/"+"
9 5)#;)7)<";+%)?:,6A$?$;+*%:6**)-=$%E4$;%7$E$,%
<6==$<+)6;*%",$%*$",<4$/H%)B$B%/6;L+%*$",<4%U$/$,"=%
1$#)*+$,%);%31VC
= C6.;+$,W);+.)+)A$%"+%7),*+%-=.*4%F-$++$,%,$*.=+*%->%)#;6,);#%/"+"G
9 N";>%6:$;%:,6-=$?*
9 M";#."#$%?6/$=);#%"::,6"<4$*%,$<$;+=>%/$A$=6:$/
!"#$!%&'(' !)*+,-./0&1&234-5&!36637 3789),&2345:&;6637 <(
()$*%!+,-.$/0
1 23.45).$%+$)+$0$*4"45,*0%
> 04$//5*#6%04,)7,+806%93$+:%)+,;$005*#6%5*8$<5*#
> ;=$"45*#%>%0)"//5*#
1 23.45).$%+$4+5$?".%".#,+54=/0
> ?"+:5*#%";;3+";:%5*%+"*@5*#0
1 A=,30"*80%B/5..5,*0CD%,E%;,..$;45,*0
1 FEE$;45?$*$00%754=%GHI%7,+8%93$+5$0
1 J,7%4,%5*4$#+"4$
> +$.$?"*;$%E$$8-";@
> 93$+:%$<)"*05,*
> -+,705*#

!"#$!%&'(' !)*+,-./0&1&234-5&!36637 3789),&2345:&;6637 <=

You might also like