Discussion 7

Information Retrieval
Admin
Project 3 due tomorrow at 9pm
Extra OH today and tomorrow (see Piazza)
Midterm Wed. 2/24 7pm

Review Session: Monday 22nd, 7-8:30 PM, COOL G906 (Cooley Lab - White
auditorium in basement)
Q1: Consider the following short documents.

1) vim is the only real editor
2) why do people use emacs?
3) vim emacs evil mode?
4) nano is the best editor
We want to make these documents searchable

First, remove stopwords: {is, the}

First, remove stopwords: {is, the}

1) vim only real editor
4) nano best editor
Now, build term frequency vectors for each document

4) nano best editor
First, build the global dictionary:
dictionary = {vim, only, real, editor, why, do, people, use, emacs, evil, mode, nano, best}
Then, build the tf vectors for each document

1) vim only real editor = <1,1,1,1,0,0,0,0,0,0,0,0,0>
2) why do people use emacs? = <0,0,0,0,1,1,1,1,1,0,0,0,0>
3) vim emacs evil mode?= <1,0,0,0,0,0,0,0,1,1,1,0,0>
4) nano best editor = <0,0,0,1,0,0,0,0,0,0,0,1,1>
First, build the global dictionary:
dictionary = {vim, only, real, editor, why, do, people, use, emacs, evil, mode, nano, best}
Then, build the tf vectors for each document

4) nano best editor
Draw the inverted index for this document collection

4) nano best editor
Draw the inverted index for this document
collection
vim: (1, 3)
only: (1)
real: (1)
editor: (1, 4)
why: (2)
do: (2)
people: (2)
use: (2)
emacs: (2, 3)
evil: (3)
mode: (3)
nano: (4)
best: (4)

4) nano best editor
Draw the inverted index for this document
collection -> Keep this sorted!
best: (4)
do: (2)
editor: (1, 4)
emacs: (2, 3)
evil: (3)
mode: (3)
nano: (4)
only: (1)
people: (2)
real: (1)
use: (2)
vim: (1, 3)
why: (2)

4) nano best editor
Consider the query: vim editor:
vim: (1, 3)
editor: (1, 4)
Using the inverted index lets us answer this query quickly
best: (4)
do: (2)
editor: (1, 4)
emacs: (2, 3)
evil: (3)
mode: (3)
nano: (4)
only: (1)
people: (2)
real: (1)
use: (2)
vim: (1, 3)
why: (2)

4) nano best editor
What is the IDF of vim? emacs? nano?
best: (4)
do: (2)
editor: (1, 4)
emacs: (2, 3)
evil: (3)
mode: (3)
nano: (4)
only: (1)
people: (2)
real: (1)
use: (2)
vim: (1, 3)
why: (2)

4) nano best editor
What is the IDF of vim? emacs? nano?
vim: log(4/2) = log(2)
emacs: log(4/2) = log(2)
nano: log(4/1) = log(4)
best: (4)
do: (2)
editor: (1, 4)
emacs: (2, 3)
evil: (3)
mode: (3)
nano: (4)
only: (1)
people: (2)
real: (1)
use: (2)
vim: (1, 3)
why: (2)

4) nano best editor
What is the tf-idf vector of document 1?

4) nano best editor
What is the tf-idf vector of document 1?
tf: <1,1,1,1,0,0,0,0,0,0,0,0,0>
idf: vim=log(4/2), only=log(4/1), real=log(4/1), editor=log(4/2)
tf-idf: <log(2), log(4), log(4), log(2), 0,0,0,0,0,0,0>
Q2: Kendalls Tau

A startup search engine is trying to compare its query results to Googles.
Googles results: ABCXYZ
Startups results: ABZYXC
They claim their results are comparable. What do you think?
Q2: Kendalls Tau

Startups results: ABZYXC
Number of matching pairs: 9
Number of non-matching pairs: 6
(9-6)/((1/2)(6)(5)) = 3/15
OR: C-D/C+D = 9-6/9+6 = 3/15
Q2: Kendalls Tau

What if the startup was able to change their results to: ABCZYX
Q2: Kendalls Tau

What if the startup was able to change their results to: ABCZYX
Number of matching pairs: 12
Number of non-matching pairs: 3
(12-3)/((1/2)(6)(5)) = 9/15
OR: C-D/C+D = 12-3/12+3 = 9/15
Q3: Precision and Recall Question

Consider a query for which there are 100 relevant URLs in the universe. A search
engine computes a ranking that contains a relevant URL in positions 1, 3, 4, 5, 6,
and 8. It has irrelevant URLs in 2, 7, 9, and 10.
What is the precision and recall of this search engine if it emits only its top-10
results?

What are precision and recall again?
https://en.wikipedia.org/wiki/Precision_and_recall

results?

results?
Precision = .6, Recall = .06

results?

results?
Precision = .8, Recall = .04
Q4: HITS
What does HITS stand for?
How does the algorithm work?
Q4: HITS
What does HITS stand for?
Hyperlinked-Induced Topic Search
How does the algorithm work?
It starts with the user's query to create the root set. It then builds the base set
from those pages. Once you have this focused subgraph, run the algorithm to
compute hub and auth scores
Q4: HITS
What are hubs and authorities?
Q4: HITS
What are hubs and authorities?
Hubs are central repositories - they have links to good authorities
Authorities are the sources of information - they are linked to by good hubs
Q4: HITS
How does HITS differ from PageRank?
Q4: HITS
How does HITS differ from PageRank?
HITS is based on the users query.
Each node maintains two scores - hub and auth
Each round requires an explicit normalization step
Q5: Pagerank
What does the .85 value for d represent?
If we assume most internet users are mobile, should we raise or lower the value of
d?
Q5: Pagerank
What does the .85 value for d represent?
This value represents the amount of time that a user clicks on a link. So, 85% of
the time they follow by clicking links, and 15% of the time they navigate to a new
page.
If we assume most internet users are mobile, should we raise or lower the value of
d?
Probably raise. Mobile users are more likely to follow links, and less likely to
navigate to new pages (because navigating to new pages requires typing in a
URL)
Q5: Pagerank
What are some issues with PageRank as a metric of page quality?
Q5: Pagerank
What are some issues with PageRank as a metric of page quality?
- Link farms, spam bots, etc. can skew rankings
- Links may not be meant as an endorsement, ie. social media shares
- Ajax and javascript can make traditional surfing difficult
- Content can be behind login - facebook feed is not searchable
Q5: Pagerank: An example

A = .2
B = .2
A
E
C = .2
D = .2
C
Assume d=.85
E = .2

A = .15/5 + .85*(.2/2) = .115
B = .15/5 + .85*(.2/2 + .2) = .285
A
E
C = .15/5 + .85*(.2/2) = .115
D = .15/5 + .85*(.2/2+.2/2) = .2
C
Assume d=.85
E = .15/5 + .85*(.2 + .2/2) = .285

A = .115, = (.15/5) + (.85)(.285/2) = 0.151
B = .285, = (.15/5) + (.85)(.115/2 + .2) = 0.249
A
E
C = .115, = (.15/5) + (.85)(.115/2) = .079
D = .2, = (.15/5) + (.85)(.115/2 + .285/2) = .2

C
E = .285, = (.15/5) + (.85)(.285 + .115/2) = .321
Assume d=.85

Discussion 7

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Discussion 7

Uploaded by

Copyright:

Available Formats

Information Retrieval

Project 3 due tomorrow at 9pm

Extra OH today and tomorrow (see Piazza)

Midterm Wed. 2/24 7pm

Q1: Consider the following short documents.

We want to make these documents searchable

Q1: Consider the following short documents.

First, remove stopwords: {is, the}

Q1: Consider the following short documents.

First, remove stopwords: {is, the}

Q1: Consider the following short documents.

Now, build term frequency vectors for each document

Q1: Consider the following short documents.

Q1: Consider the following short documents.

Q1: Consider the following short documents.

Q1: Consider the following short documents.

Q1: Consider the following short documents.

Q1: Consider the following short documents.

Q1: Consider the following short documents.

Q1: Consider the following short documents.

Q1: Consider the following short documents.

Q1: Consider the following short documents.

Q2: Kendalls Tau

They claim their results are comparable. What do you think?

Q2: Kendalls Tau

Q2: Kendalls Tau

Q2: Kendalls Tau

Q3: Precision and Recall Question

Q3: Precision and Recall Question

Q3: Precision and Recall Question

Q3: Precision and Recall Question

Q3: Precision and Recall Question

Q3: Precision and Recall Question

How does the algorithm work?

Q5: Pagerank: An example

Q5: Pagerank: An example

C = .15/5 + .85*(.2/2) = .115

E = .15/5 + .85*(.2 + .2/2) = .285

Q5: Pagerank: An example

C = .115, = (.15/5) + (.85)(.115/2) = .079

D = .2, = (.15/5) + (.85)(.115/2 + .285/2) = .2

E = .285, = (.15/5) + (.85)(.285 + .115/2) = .321

You might also like