You are on page 1of 39

Information Retrieval

Admin

Project 3 due tomorrow at 9pm

Extra OH today and tomorrow (see Piazza)

Midterm Wed. 2/24 7pm


Review Session: Monday 22nd, 7-8:30 PM, COOL G906 (Cooley Lab - White
auditorium in basement)

Q1: Consider the following short documents.


1) vim is the only real editor
2) why do people use emacs?
3) vim emacs evil mode?
4) nano is the best editor

We want to make these documents searchable

Q1: Consider the following short documents.


1) vim is the only real editor
2) why do people use emacs?
3) vim emacs evil mode?
4) nano is the best editor

First, remove stopwords: {is, the}

Q1: Consider the following short documents.


1) vim is the only real editor
2) why do people use emacs?
3) vim emacs evil mode?
4) nano is the best editor

First, remove stopwords: {is, the}

Q1: Consider the following short documents.


1) vim only real editor
2) why do people use emacs?
3) vim emacs evil mode?
4) nano best editor

Now, build term frequency vectors for each document

Q1: Consider the following short documents.


1) vim only real editor
2) why do people use emacs?
3) vim emacs evil mode?
4) nano best editor
First, build the global dictionary:
dictionary = {vim, only, real, editor, why, do, people, use, emacs, evil, mode, nano, best}
Then, build the tf vectors for each document

Q1: Consider the following short documents.


1) vim only real editor = <1,1,1,1,0,0,0,0,0,0,0,0,0>
2) why do people use emacs? = <0,0,0,0,1,1,1,1,1,0,0,0,0>
3) vim emacs evil mode?= <1,0,0,0,0,0,0,0,1,1,1,0,0>
4) nano best editor = <0,0,0,1,0,0,0,0,0,0,0,1,1>
First, build the global dictionary:
dictionary = {vim, only, real, editor, why, do, people, use, emacs, evil, mode, nano, best}
Then, build the tf vectors for each document

Q1: Consider the following short documents.


1) vim only real editor
2) why do people use emacs?
3) vim emacs evil mode?
4) nano best editor
Draw the inverted index for this document collection

Q1: Consider the following short documents.


1) vim only real editor
2) why do people use emacs?
3) vim emacs evil mode?
4) nano best editor
Draw the inverted index for this document
collection

vim: (1, 3)
only: (1)
real: (1)
editor: (1, 4)
why: (2)
do: (2)
people: (2)
use: (2)
emacs: (2, 3)
evil: (3)
mode: (3)
nano: (4)
best: (4)

Q1: Consider the following short documents.


1) vim only real editor
2) why do people use emacs?
3) vim emacs evil mode?
4) nano best editor
Draw the inverted index for this document
collection -> Keep this sorted!

best: (4)
do: (2)
editor: (1, 4)
emacs: (2, 3)
evil: (3)
mode: (3)
nano: (4)
only: (1)
people: (2)
real: (1)
use: (2)
vim: (1, 3)
why: (2)

Q1: Consider the following short documents.


1) vim only real editor
2) why do people use emacs?
3) vim emacs evil mode?
4) nano best editor
Consider the query: vim editor:
vim: (1, 3)
editor: (1, 4)
Using the inverted index lets us answer this query quickly

best: (4)
do: (2)
editor: (1, 4)
emacs: (2, 3)
evil: (3)
mode: (3)
nano: (4)
only: (1)
people: (2)
real: (1)
use: (2)
vim: (1, 3)
why: (2)

Q1: Consider the following short documents.


1) vim only real editor
2) why do people use emacs?
3) vim emacs evil mode?
4) nano best editor
What is the IDF of vim? emacs? nano?

best: (4)
do: (2)
editor: (1, 4)
emacs: (2, 3)
evil: (3)
mode: (3)
nano: (4)
only: (1)
people: (2)
real: (1)
use: (2)
vim: (1, 3)
why: (2)

Q1: Consider the following short documents.


1) vim only real editor
2) why do people use emacs?
3) vim emacs evil mode?
4) nano best editor
What is the IDF of vim? emacs? nano?
vim: log(4/2) = log(2)
emacs: log(4/2) = log(2)
nano: log(4/1) = log(4)

best: (4)
do: (2)
editor: (1, 4)
emacs: (2, 3)
evil: (3)
mode: (3)
nano: (4)
only: (1)
people: (2)
real: (1)
use: (2)
vim: (1, 3)
why: (2)

Q1: Consider the following short documents.


1) vim only real editor
2) why do people use emacs?
3) vim emacs evil mode?
4) nano best editor
What is the tf-idf vector of document 1?

Q1: Consider the following short documents.


1) vim only real editor
2) why do people use emacs?
3) vim emacs evil mode?
4) nano best editor
What is the tf-idf vector of document 1?
tf: <1,1,1,1,0,0,0,0,0,0,0,0,0>
idf: vim=log(4/2), only=log(4/1), real=log(4/1), editor=log(4/2)
tf-idf: <log(2), log(4), log(4), log(2), 0,0,0,0,0,0,0>

Q2: Kendalls Tau


A startup search engine is trying to compare its query results to Googles.
Googles results: ABCXYZ
Startups results: ABZYXC

They claim their results are comparable. What do you think?

Q2: Kendalls Tau


A startup search engine is trying to compare its query results to Googles.
Googles results: ABCXYZ
Startups results: ABZYXC
Number of matching pairs: 9
Number of non-matching pairs: 6
(9-6)/((1/2)(6)(5)) = 3/15
OR: C-D/C+D = 9-6/9+6 = 3/15

Q2: Kendalls Tau


A startup search engine is trying to compare its query results to Googles.
Googles results: ABCXYZ
What if the startup was able to change their results to: ABCZYX

Q2: Kendalls Tau


A startup search engine is trying to compare its query results to Googles.
Googles results: ABCXYZ
What if the startup was able to change their results to: ABCZYX
Number of matching pairs: 12
Number of non-matching pairs: 3
(12-3)/((1/2)(6)(5)) = 9/15
OR: C-D/C+D = 12-3/12+3 = 9/15

Q3: Precision and Recall Question


Consider a query for which there are 100 relevant URLs in the universe. A search
engine computes a ranking that contains a relevant URL in positions 1, 3, 4, 5, 6,
and 8. It has irrelevant URLs in 2, 7, 9, and 10.
What is the precision and recall of this search engine if it emits only its top-10
results?

Q3: Precision and Recall Question


What are precision and recall again?
https://en.wikipedia.org/wiki/Precision_and_recall

Q3: Precision and Recall Question


Consider a query for which there are 100 relevant URLs in the universe. A search
engine computes a ranking that contains a relevant URL in positions 1, 3, 4, 5, 6,
and 8. It has irrelevant URLs in 2, 7, 9, and 10.
What is the precision and recall of this search engine if it emits only its top-10
results?

Q3: Precision and Recall Question


Consider a query for which there are 100 relevant URLs in the universe. A search
engine computes a ranking that contains a relevant URL in positions 1, 3, 4, 5, 6,
and 8. It has irrelevant URLs in 2, 7, 9, and 10.
What is the precision and recall of this search engine if it emits only its top-10
results?
Precision = .6, Recall = .06

Q3: Precision and Recall Question


Consider a query for which there are 100 relevant URLs in the universe. A search
engine computes a ranking that contains a relevant URL in positions 1, 3, 4, 5, 6,
and 8. It has irrelevant URLs in 2, 7, 9, and 10.
What is the precision and recall of this search engine if it emits only its top-5
results?

Q3: Precision and Recall Question


Consider a query for which there are 100 relevant URLs in the universe. A search
engine computes a ranking that contains a relevant URL in positions 1, 3, 4, 5, 6,
and 8. It has irrelevant URLs in 2, 7, 9, and 10.
What is the precision and recall of this search engine if it emits only its top-5
results?
Precision = .8, Recall = .04

Q4: HITS
What does HITS stand for?

How does the algorithm work?

Q4: HITS
What does HITS stand for?
Hyperlinked-Induced Topic Search
How does the algorithm work?
It starts with the user's query to create the root set. It then builds the base set
from those pages. Once you have this focused subgraph, run the algorithm to
compute hub and auth scores

Q4: HITS
What are hubs and authorities?

Q4: HITS
What are hubs and authorities?
Hubs are central repositories - they have links to good authorities
Authorities are the sources of information - they are linked to by good hubs

Q4: HITS
How does HITS differ from PageRank?

Q4: HITS
How does HITS differ from PageRank?
HITS is based on the users query.
Each node maintains two scores - hub and auth
Each round requires an explicit normalization step

Q5: Pagerank
What does the .85 value for d represent?

If we assume most internet users are mobile, should we raise or lower the value of
d?

Q5: Pagerank
What does the .85 value for d represent?
This value represents the amount of time that a user clicks on a link. So, 85% of
the time they follow by clicking links, and 15% of the time they navigate to a new
page.
If we assume most internet users are mobile, should we raise or lower the value of
d?
Probably raise. Mobile users are more likely to follow links, and less likely to
navigate to new pages (because navigating to new pages requires typing in a
URL)

Q5: Pagerank
What are some issues with PageRank as a metric of page quality?

Q5: Pagerank
What are some issues with PageRank as a metric of page quality?
- Link farms, spam bots, etc. can skew rankings
- Links may not be meant as an endorsement, ie. social media shares
- Ajax and javascript can make traditional surfing difficult
- Content can be behind login - facebook feed is not searchable

Q5: Pagerank: An example


A = .2
B = .2

A
E

C = .2

D = .2
C

Assume d=.85

E = .2

Q5: Pagerank: An example


A = .15/5 + .85*(.2/2) = .115
B = .15/5 + .85*(.2/2 + .2) = .285

A
E

C = .15/5 + .85*(.2/2) = .115

D = .15/5 + .85*(.2/2+.2/2) = .2
C

Assume d=.85

E = .15/5 + .85*(.2 + .2/2) = .285

Q5: Pagerank: An example


A = .115, = (.15/5) + (.85)(.285/2) = 0.151
B = .285, = (.15/5) + (.85)(.115/2 + .2) = 0.249

A
E

C = .115, = (.15/5) + (.85)(.115/2) = .079

D = .2, = (.15/5) + (.85)(.115/2 + .285/2) = .2


C

E = .285, = (.15/5) + (.85)(.285 + .115/2) = .321

Assume d=.85

You might also like