Professional Documents
Culture Documents
Information Retrieval
Jaime Arguello
INLS 509: Information Retrieval
jarguell@email.unc.edu
Outline
Outline
--Wikipedia
4
Thursday, February 21, 13
Motivating Example
speech-to-text conversion for mobile search
jaime arguello
information retrieval
at the university of
north carolina at
chapel hill
5
Thursday, February 21, 13
Motivating Example
speech-to-text conversion for mobile search
6
Thursday, February 21, 13
Motivating Example
speech-to-text conversion for mobile search
Lets see what the internets say!
7
Thursday, February 21, 13
Motivating Example
speech-to-text conversion for mobile search
Lets see what the internets say!
8
Thursday, February 21, 13
Motivating Example
speech-to-text conversion for mobile search
This example raises some questions:
!?
9
Thursday, February 21, 13
10
Thursday, February 21, 13
11
Thursday, February 21, 13
12
Thursday, February 21, 13
must be satisfied:
15
Thursday, February 21, 13
16
Thursday, February 21, 13
different outcomes
I reach into the bag and pull out an
P(RED) = 0.5
P(BLUE) = 0.25
P(ORANGE) = 0.25
independent of previous
outcomes, then the probability of a
sequence of outcomes is
calculated by multiplying the
individual probabilities
P(RED) = 0.5
P(BLUE) = 0.25
P(ORANGE) = 0.25
18
Thursday, February 21, 13
) = 0.25
) = 0.5
P(
P(
P(
P(
P(RED) = 0.5
P(BLUE) = 0.25
P(ORANGE) = 0.25
) = 0.25 x 0.50 x
0.25 x 0.50
19
Thursday, February 21, 13
jaime arguello
information retrieval
at the university of
north carolina at
chapel hill
!?
20
Thursday, February 21, 13
21
P(university) = 2/20
P(of) = 4/20
P(north) = 2/20
P(carolina) = 1/20
P(at) = 4/20
P(chapel) = 3/20
P(hill) = 4/20
university university
of of of of
north north
carolina
at at at at
chapel chapel chapel
hill hill hill hill
23
Thursday, February 21, 13
26
Thursday, February 21, 13
model
And, given a language model, we can assign a
27
Thursday, February 21, 13
28
Thursday, February 21, 13
IMDB Corpus
language model estimation (top 20 terms)
term
P(term)
term
tf
P(term)
1586358 36989629
0.0429
year
250151
36989629
0.0068
854437
36989629
0.0231
he
242508
36989629
0.0066
and
822091
36989629
0.0222
movie
241551
36989629
0.0065
to
804137
36989629
0.0217
her
240448
36989629
0.0065
of
657059
36989629
0.0178
artist
236286
36989629
0.0064
in
472059
36989629
0.0128
character
234754
36989629
0.0063
is
395968
36989629
0.0107
cast
234202
36989629
0.0063
390282
36989629
0.0106
plot
234189
36989629
0.0063
his
328877
36989629
0.0089
for
207319
36989629
0.0056
with
253153
36989629
0.0068
that
197723
36989629
0.0053
the
tf
29
Thursday, February 21, 13
IMDB Corpus
language model estimation (top 20 terms)
term
P(term)
term
tf
P(term)
1586358 36989629
0.0429
year
250151
36989629
0.0068
854437
36989629
0.0231
he
242508
36989629
0.0066
and
822091
36989629
0.0222
movie
241551
36989629
0.0065
to
804137
36989629
0.0217
her
240448
36989629
0.0065
of
657059
36989629
0.0178
artist
236286
36989629
0.0064
in
472059
36989629
0.0128
character
234754
36989629
0.0063
is
395968
36989629
0.0107
cast
234202
36989629
0.0063
390282
36989629
0.0106
plot
234189
36989629
0.0063
his
328877
36989629
0.0089
for
207319
36989629
0.0056
with
253153
36989629
0.0068
that
197723
36989629
0.0053
the
tf
IMDB Corpus
language model estimation (top 20 terms)
term
tf
P(term)
term
tf
P(term)
1586358 36989629
0.0429
year
250151
36989629
0.0068
854437
36989629
0.0231
he
242508
36989629
0.0066
and
822091
36989629
0.0222
movie
241551
36989629
0.0065
to
804137
36989629
0.0217
her
240448
36989629
0.0065
of
657059
36989629
0.0178
artist
236286
36989629
0.0064
in
472059
36989629
0.0128
character
234754
36989629
0.0063
is
395968
36989629
0.0107
cast
234202
36989629
0.0063
390282
36989629
0.0106
plot
234189
36989629
0.0063
his
328877
36989629
0.0089
for
207319
36989629
0.0056
with
253153
36989629
0.0068
that
197723
36989629
0.0053
the
year?
31
IMDB Corpus
language model estimation (top 20 terms)
term
P(term)
term
tf
P(term)
1586358 36989629
0.0429
year
250151
36989629
0.0068
854437
36989629
0.0231
he
242508
36989629
0.0066
and
822091
36989629
0.0222
movie
241551
36989629
0.0065
to
804137
36989629
0.0217
her
240448
36989629
0.0065
of
657059
36989629
0.0178
artist
236286
36989629
0.0064
in
472059
36989629
0.0128
character
234754
36989629
0.0063
is
395968
36989629
0.0107
cast
234202
36989629
0.0063
390282
36989629
0.0106
plot
234189
36989629
0.0063
his
328877
36989629
0.0089
for
207319
36989629
0.0056
with
253153
36989629
0.0068
that
197723
36989629
0.0053
the
tf
Motivating Example
speech-to-text conversion for mobile search
This example raises some questions:
!?
33
Thursday, February 21, 13
Outline
Language Models
A language model is a probability distribution defined
P(RED) = 0.5
P(BLUE) = 0.25
P(ORANGE) = 0.25
35
Thursday, February 21, 13
Topic Models
We can think of a topic as being defined by a language
model
A high-probability of seeing certain words and a low-
...
movies
P(RED) = 0.5
P(BLUE) = 0.25
P(ORANGE) = 0.25
politics
sports
music
P(RED) = 0.05
P(RED) = 0.90
P(RED) = 0.00
P(BLUE) = 0.00
P(BLUE) = 0.10
P(BLUE) = 0.50
P(ORANGE) = 0.95 P(ORANGE) = 0.00 P(ORANGE) = 0.50
nature
P(RED) = 0.10
P(BLUE) = 0.80
P(ORANGE) = 0.10
36
Thursday, February 21, 13
Topic Models
??? vs. ???
probability
0.60
0.45
0.30
0.15
actress
movie
party
political
state
term
37
Thursday, February 21, 13
Topic Models
movies vs. politics
probability
0.60
0.45
0.30
0.15
actress
movie
party
political
state
term
38
Thursday, February 21, 13
Topical Relevance
Many factors affect whether a document satisfies a
the query
User relevance: everything else!
Remember, our goal right now is to predict topical
relevance
39
Thursday, February 21, 13
...
movies
P(RED) = 0.5
P(BLUE) = 0.25
P(ORANGE) = 0.25
politics
sports
music
P(RED) = 0.05
P(RED) = 0.90
P(RED) = 0.00
P(BLUE) = 0.00
P(BLUE) = 0.10
P(BLUE) = 0.50
P(ORANGE) = 0.95 P(ORANGE) = 0.00 P(ORANGE) = 0.50
Document D232
nature
P(RED) = 0.10
P(BLUE) = 0.80
P(ORANGE) = 0.10
What is this
document about?
40
D is
42
Thursday, February 21, 13
43
tft,D
ND
P(term|D)
term
tft,D
ND
P(term|D)
22
420
0.05238
creed
420
0.01190
rocky
19
420
0.04524
philadelphia
420
0.01190
to
18
420
0.04286
has
420
0.00952
the
17
420
0.04048
pet
420
0.00952
is
11
420
0.02619
boxing
420
0.00952
and
10
420
0.02381
up
420
0.00952
in
10
420
0.02381
an
420
0.00952
for
420
0.01667
boxer
420
0.00952
his
420
0.01667
420
0.00714
he
420
0.01429
balboa
420
0.00714
44
Thursday, February 21, 13
45
Thursday, February 21, 13
tft,D
ND
P(term|D)
term
tft,D
ND
P(term|D)
22
420
0.05238
creed
420
0.01190
rocky
19
420
0.04524
philadelphia
420
0.01190
to
18
420
0.04286
has
420
0.00952
the
17
420
0.04048
pet
420
0.00952
is
11
420
0.02619
boxing
420
0.00952
and
10
420
0.02381
up
420
0.00952
in
10
420
0.02381
an
420
0.00952
for
420
0.01667
boxer
420
0.00952
his
420
0.01667
420
0.00714
he
420
0.01429
balboa
420
0.00714
tft,D
ND
P(term|D)
term
tft,D
ND
P(term|D)
22
420
0.05238
creed
420
0.01190
rocky
19
420
0.04524
philadelphia
420
0.01190
to
18
420
0.04286
has
420
0.00952
the
17
420
0.04048
pet
420
0.00952
is
11
420
0.02619
boxing
420
0.00952
and
10
420
0.02381
up
420
0.00952
in
10
420
0.02381
an
420
0.00952
for
420
0.01667
boxer
420
0.00952
his
420
0.01667
420
0.00714
he
420
0.01429
balboa
420
0.00714
tft,D
ND
P(term|D)
term
tft,D
ND
P(term|D)
22
420
0.05238
creed
420
0.01190
rocky
19
420
0.04524
philadelphia
420
0.01190
to
18
420
0.04286
has
420
0.00952
the
17
420
0.04048
pet
420
0.00952
is
11
420
0.02619
boxing
420
0.00952
and
10
420
0.02381
up
420
0.00952
in
10
420
0.02381
an
420
0.00952
for
420
0.01667
boxer
420
0.00952
his
420
0.01667
420
0.00714
he
420
0.01429
balboa
420
0.00714
score( Q, D ) = P( Q| D ) =
P ( qi | D )
i =1
49
language model
Let D denote the language model associated with
document D
You can think of D as a black-box: given a word, it
outputs a probability
rocky
0.04524
Query-Likelihood Model
back to our analogy
D1
P(RED) = 0.50
P(BLUE) = 0.25
P(ORANGE) = 0.25
D2
D3
P(RED) = 0.25
P(BLUE) = 0.25
P(ORANGE) = 0.50
P(RED) = 0.90
P(BLUE) = 0.10
P(ORANGE) = 0.00
D5
D6
P(RED) = 0.50
P(RED) = 0.10
P(BLUE) = 0.50
P(BLUE) = 0.80
P(ORANGE) = 0.00 P(ORANGE) = 0.10
query?
51
Thursday, February 21, 13
Query-Likelihood Model
back to our analogy
D1
P(RED) = 0.50
P(BLUE) = 0.25
P(ORANGE) = 0.25
D2
D3
P(RED) = 0.25
P(BLUE) = 0.25
P(ORANGE) = 0.50
P(RED) = 0.90
P(BLUE) = 0.10
P(ORANGE) = 0.00
D5
D6
P(RED) = 0.50
P(RED) = 0.10
P(BLUE) = 0.50
P(BLUE) = 0.80
P(ORANGE) = 0.00 P(ORANGE) = 0.10
Query =
Which would be the top-ranked document and what
52
Thursday, February 21, 13
Query-Likelihood Model
back to our analogy
D1
P(RED) = 0.50
P(BLUE) = 0.25
P(ORANGE) = 0.25
D2
D3
P(RED) = 0.25
P(BLUE) = 0.25
P(ORANGE) = 0.50
P(RED) = 0.90
P(BLUE) = 0.10
P(ORANGE) = 0.00
D5
D6
P(RED) = 0.50
P(RED) = 0.10
P(BLUE) = 0.50
P(BLUE) = 0.80
P(ORANGE) = 0.00 P(ORANGE) = 0.10
Query =
Which would be the top-ranked document and what
53
Thursday, February 21, 13
Query-Likelihood Model
back to our analogy
D1
P(RED) = 0.50
P(BLUE) = 0.25
P(ORANGE) = 0.25
D2
D3
P(RED) = 0.25
P(BLUE) = 0.25
P(ORANGE) = 0.50
P(RED) = 0.90
P(BLUE) = 0.10
P(ORANGE) = 0.00
D5
D6
P(RED) = 0.50
P(RED) = 0.10
P(BLUE) = 0.50
P(BLUE) = 0.80
P(ORANGE) = 0.00 P(ORANGE) = 0.10
Query =
Which would be the top-ranked document and what
54
Thursday, February 21, 13
Query-Likelihood Model
back to our analogy
D1
P(RED) = 0.50
P(BLUE) = 0.25
P(ORANGE) = 0.25
D2
D3
P(RED) = 0.25
P(BLUE) = 0.25
P(ORANGE) = 0.50
P(RED) = 0.90
P(BLUE) = 0.10
P(ORANGE) = 0.00
D5
D6
P(RED) = 0.50
P(RED) = 0.10
P(BLUE) = 0.50
P(BLUE) = 0.80
P(ORANGE) = 0.00 P(ORANGE) = 0.10
Query =
Which would be the top-ranked document and what
55
Thursday, February 21, 13
D1
P( Q| D1 ) = 0.001
D2
P( Q| D2 ) = 0.001
D3
P( Q| D3 ) = 0.0234
D4
P( Q| D4 ) = 0.621
D5
P( Q| D5 ) = 0.00345
::
DM
::
P( Q| D M ) = 0.3453
56
score( Q, D ) = P( Q| D ) =
P ( qi | D )
i =1
57
Thursday, February 21, 13
58
Thursday, February 21, 13
query
59
Thursday, February 21, 13
score( Q, D ) = P( Q| D ) =
P ( qi | D )
i =1
There are (at least) two issues with this scoring function
What are they?
60
Thursday, February 21, 13
61
Thursday, February 21, 13
Outline
YOU: ????
64
Thursday, February 21, 13
a zero probability?
What else can we do?
P(RED) = (10/20)
P(BLUE) = (5/20)
P(ORANGE) = (5/20)
P(YELLOW) = (0/20)
P(GREEN) = (0/20)
67
Thursday, February 21, 13
Add-One Smoothing
We could add one ball of each color to the bag
This gives a small probability to unobserved outcomes
P(RED) = (11/25)
P(BLUE) = (6/25)
P(ORANGE) = (6/25)
P(YELLOW) = (1/25)
P(GREEN) = (1/25)
68
Thursday, February 21, 13
Add-One Smoothing
Gives a small probability to unobserved outcomes
70
authors mind
High-frequency words (e.g., rocky, apollo, boxing) are
important
Low-frequency words (e.g., shot, befriended, checks) are
arbitrary
The author chose these, but could have easily chosen
others
So, we want to allocate some probability to unobserved
each document
Conceptually!
probabilities
In practice, a more effective approach to smoothing for
document D
Let C denote the language model associated with the
entire collection
Using linear interpolation, the probability given by the
73
Thursday, February 21, 13
74
Thursday, February 21, 13
75
Thursday, February 21, 13
query-term probabilities
However, the probabilities are obtained using the
score( Q, D ) =
(P(qi |D ) + (1 ) P(qi |C ))
i =1
76
Thursday, February 21, 13
77
Thursday, February 21, 13
apple ipad
occurrences
D1 (ND1=50)
D2 (ND2=50)
apple
2/50 = 0.04
3/50 = 0.06
ipad
3/50 = 0.06
2/50 = 0.04
score
78
Thursday, February 21, 13
apple ipad
occurrences
D1 (ND1=50)
D2 (ND2=50)
apple
2/50 = 0.04
3/50 = 0.06
ipad
3/50 = 0.06
2/50 = 0.04
score
apple or ipad?
79
80
Thursday, February 21, 13
apple ipad
occurrences
D1 (ND1=50)
D2 (ND2=50)
apple
2/50 = 0.04
3/50 = 0.06
ipad
3/50 = 0.06
2/50 = 0.04
score
Therefore:
200
= 0.0002
P(apple|C ) =
1000000
100
P(ipad|C ) =
= 0.0001
1000000
82
Thursday, February 21, 13
score( Q, D ) =
(P(qi |D ) + (1 ) P(qi |C ))
i =1
D1 (ND1=50)
D2 (ND2=50)
P(apple|D)
P(apple|C)
score(apple)
0.04
0.0002
0.0201
0.06
0.0002
0.0301
P(ipad|D)
P(ipad|C)
score(ipad)
0.06
0.0001
0.03005
0.04
0.0001
0.02005
total score
0.000604005
0.000603505
= 0.50
Thursday, February 21, 13
83
probabilities ...
It also introduces an IDF-like scoring of documents
mathematical proof!?
84
Thursday, February 21, 13
E(+,*:+,n.$;AB,C2DF
Query Likelihood Retrieval Model
="e$,2fG4,0e$Effec+$&f$He4,ne0GIercer$)*&&+",n.
with linear interpolation smoothing
#(q | d ) &
&
M'E
M'E
I,J+?re$*&2e4
qi (q
&
!
1 $ # " #M'E !qi | C " I?4+,;4#
!##M'E !qi | d " % !1 $ # " #M'E !qi | C ""
@#$L
'
!1 $ # " #M'E !qi | C "
q (q
* * ##M'E !qi | d "
)
)
-+
+
$
%
1
!
1
#
"
#
!
q
|
C
"
Kec&*@,ne
'
M'E
i
- !1 $ # " # !q | C " +
+
q (q . .
M'E
i
,
,
* ##M'E !qi | d "
)
-% 1++' !1 $ # " #M'E !qi | C "
Kec&*@,ne
'
q (q . !1 $ # " # M'E !qi | C "
, q (q
)
* ##M'E !qi | d "
%r&;$c&n(+:n+
-% 1++
'
q (q . !1 $ # " # M'E !qi | C "
,
M+fN
i
&
&
M,2fN
22
2008,
2009, Jamie Callan
85