Professional Documents
Culture Documents
Email:yangjw@pku.edu.cn
1
1997:
3.2
1999: 8(15TB)
2002: 20(Google)
2004: 43(Google)
2006:100(Google)
for books,
2TB/year for newspapers,
1TB/year for periodcals,
19TB/year for office docs.
68%
51%2
4
Information Retrieval
information search
5
IRTRECSDR (Speech
Document Retrieval) TrackVideo Track
(:,,,)
(,)
(1950s)
(,1960s)
1970s,1980s)
Web(1990s)
10
11
12
13
14
TR(Text Retrieval) ,
TRIR
15
User
The Web
Spider
Inverted
Index
Search
Engine
Library, etc.
16
,
: 80%
17
18
{relevant}
{retrieved}
{retrieved}
{relevant}
19
(precision)
precision
{relevant} {retrieved }
{retrieved }
(recall)
recall
{relevant} {retrieved }
{relevant}
20
| RelRetriev ed |
Recall
| Rel in Collection |
| RelRetriev ed |
Precision
| Retrieved |
21
A,C,E,G,
B,D,F
H, I, J
W,Y
{relevant}
={A,B,C,D,E,F,G,H,I,J} = 10
{retrieved} = {B, D, F,W,Y} = 5
{relevant} {retrieved} ={B,D,F} = 3
precision =
recall =
22
precision
x
x
x
recall
23
F-measure (F-)
F-measure (F-)
F
1
1
1
1
P
R
>0.5
P@NN
MRR
S@1010
25
26
IR model
IR
< D, Q, F, R(qi,dj) >
:
D
Q
F,
R(qi,dj) qidj
27
28
29
30
31
32
(Hypertext)
Web
33
Chapter
Section
Paragraph
34
35
01
and, or, not
DNF
q ka (kb kc )
p-norm
36
VSM
T3
Example:
D2 = 3T1 + 7T2 + T3
D1 = 2T1+ 3T2 + 5T3
T1
D2 = 3T1 + 7T2 + T3
T2
Is D1 or D2 more similar to Q?
How to measure the degree of
similarity? Distance? Angle?
Projection?
37
(GVSM)
sim(q,d) = i[maxj(sim(tqi,tdj)]
38
P( A | B)
P( B | A) P( A)
P( B)
P(dj)=P(k1)P(kt)
Ki
sim(d j , q) ~
( g ( d
j ) 1
gi ( d j ) 1
P(ki | R)) ( g ( d
i
j ) 0
j ) 0
P(ki | R)) ( g ( d
P(k i | R))
P( k i |R)
(BM25)
w log
df 0.5
( K tf ) (k3 qft )
39
(Unigram)
N(NGram)
N(NGram)
P(W1W2 ..Wn ) P(W1 ) P(W2 | W1 )...P(Wn | Wn1Wn2 ..W1 )
40
41
IR
Term
(,)
42
43
Q = F[Q, Dret ]
F =
such as:
W(t,Q) = W(t,Q) + W(t,Drel)
- W(t,Dirr)
tQ
44
Query
Retrieval
Engine
Updated
query
Document
collection
Feedback
Results:
d1 3.5
d2 2.4
dk 0.5
...
Judgments:
d1 +
d2 +
d3 +
dk ...
top 10
45
46
(LSA)
47
48
49
page
50
(LSA)
singular
value decomposition,SVD
KK
51
Topic model
52
(LSA)
LSI
,
, LSI
10%--30%.
, LSI
/
53
(LSA)
(LSA)
Query Terms:
- Insulation
- Joint
55
(LSA)
term-by-document matrix
U: concept-by-term matrix
V: concept-by-document matrix
S: elements assign weights to concepts
56
td
td
t1
t2
t3
t4
d1
d2
322 85
361 90
25 33
30 140
d3
d4
d5
d6
35 69 15 320
76 57 13 370
160 48 221 26
70 201 16 35
57
1.frequency
matrix
2.frequency matrix
frequency matrix3US
VUVUTU=IS
KK
3.
dSVD
4.
58
SVD
A
XY
A2-nnBC
X
A2-nnBD
59
SVD
XYX1Y1
XY
A2-nnBC
X1
A2-nnBC
60
SVD
61
SVD
62
SVD
63
(LSA)
A UV T
U orthogonal mr matrix whose columns are left
singular vectors of A
( UTU=I, U terms)
diagonal matrix on whose diagonal are singular
values of matrix A in descending order (
)
V orthogonal rn matrix whose columns are
right singular vectors of A
( VTV=I, Vdocuments )
64
(LSA)
65
(LSA)
U k k V k
where
Uk is mk matrix whose columns are first k left singular
vectors of A
k is kk diagonal matrix whose diagonal is formed by k
leading singular values of A
Vk is nk matrix whose columns are first k right singular
vectors of A
Rows of Uk = terms
Rows of Vk = documents
66
(LSA)
67
(LSA)
68
SVD Example
D -- 10 x 2
2.9002
4.0860
1.9954
3.5069
4.4620
-2.9444
-4.1132
-3.6208
-3.0558
-6.1204
3.6790
5.2366
3.3687
1.6748
2.7684
-4.6447
-4.7043
-5.0181
-4.1821
-2.4790
U -- 10 x 2
-0.2750
-0.3896
-0.2247
-0.2150
-0.3005
0.3177
0.3682
0.3613
0.3027
0.3563
-0.1242
-0.1846
-0.2369
0.3514
0.3318
0.2906
0.0833
0.2319
0.1861
-0.6935
S -- 2 x 2
16.9491
0
0 3.8491
V T -- 2 x 2
-0.6960 -0.7181
0.7181 -0.6960
69
SVD Example
70
SVD for IR
72
Document
Documents
Terms
car
SVD
U
auto
k-dim vector
73
SVDd-
k
B+R
74
d (d )U k
~
(d1 , d 2 ) (d1 )i U k U k (d 2 ) j
i, j
75
pLSA
Probability
76
pLSA
EM ()
Eestimate
Mre-estimate
parameters
77
pLSA
document
pLSA
document i
(1)
Topic model
79
Topic model
p(w | d )
p(w | z)p(z | d )
80
LDA
pLSA
doc~topic
LDA
pLSA
LDAdoc~topic
model
doc~topic
81
82
LDA
LDA assumes the following generative process
for each document w in a corpus D :
1.
Choose N ~ Poisson()
2. Choose ~ Dirichlet()
(N:).
(:k k:Topic ).
3.
(a)
83
LDA
,
EM
()
Gibbs Sampling.
84
85
9
c1 Human machine interface for Lab ABC computer
application
c2 A survey of user opinion of computer system response time
c3 The EPS user interface management system
c4 System and human system engineering testing of EPS
c5 Relations of user-perceived response time to error
measurement
m1 The generation of random, binary, unordered trees
m2 The intersection graph of paths in trees
m3 Graph minors IV: Widths of trees and well-quasi-ordering
m4 Graph minors: A survey
86
12
human
interface
computer
user
system
response
time
EPS
survey
trees
graph
minors
c1
1
1
1
0
0
0
0
0
0
0
0
0
c2
0
0
1
1
1
1
1
0
1
0
0
0
c3
0
1
0
1
1
0
0
1
0
0
0
0
c4
1
0
0
0
2
0
0
1
0
0
0
0
c5
0
0
0
1
0
1
1
0
0
0
0
0
m1
0
0
0
0
0
0
0
0
0
1
0
0
m2
0
0
0
0
0
0
0
0
0
1
1
0
m3
0
0
0
0
0
0
0
0
0
1
1
1
m4
0
0
0
0
0
0
0
0
1
0
1
1
87
Sim( x, y) x y ( xk y )
k 1
M A A
T
M ?
88
Sim( x, y) x y ( xk y )
M AT A
k 1
x y
Sim ( x, y )
| x || y |
M (i , j )
(x y )
k
k 1
xk
k 1
y k
t
k 1
M (i , j )
M ( i ,i ) M ( j , j )
89
(x y )
x y
Sim ( x, y )
x k y k
| x || y |
k
k 1
k 1
k 1
c1
c2
c3
c4
c5
m1
m2
M3
m4
c1
0.24
0.29
0.24
c2
0.24
0.41
0.33
0.71
0.24
c3
0.29
0.41
0.61
0.29
c4
0.24
0.33
0.61
c5
0.71
0.29
m1
0.71
0.58
m2
0.71
0.82
0.41
m3
0.58
0.82
0.67
m4
0.24
0.41
0.67
1 90
Query: Human-Computer Interaction
qT= <1 0 1 0 0 0 0 0 0 0 0 0>
x y
Sim ( x, y )
| x || y |
human
interface
computer
user
system
response
time
EPS
survey
trees
graph
minors
(x y )
k
k 1
xk
k 1
y k
t
k 1
c1
Query
c2
0.82 0.28
c3
0
c4
0.35
c1 c2 c3 c4 c5 m1 m2 m3 m4
1 0 0 1 0 0 0 0 0
1 0 1 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
0 1 1 0 1 0 0 0 0
0 1 1 2 0 0 0 0 0
0 1 0 0 1 0 0 0 0
0 1 0 0 1 0 0 0 0
0 0 1 1 0 0 0 0 0
0 1 0 0 0 0 0 0 1
0 0 0 0 0 1 1 1 0
0 0 0 0 0 0 1 1 1
0 0 0 0 0 0 0 1 1
c5
0
m1
0
m2
0
m3
0
m4
0
91
Query: Human-Computer Interaction
c1 Human machine interface for Lab ABC computer
application
c2 A survey of user opinion of computer system response time
c3 The EPS user interface management system
c4 System and human system engineering testing of EPS
c5 Relations of user-perceived response time to error
measurement
m1 The generation of random, binary, unordered trees
m2 The intersection graph of paths in trees
m3 Graph minors IV: Widths of trees and well-quasi-ordering
m4 Graph minors: A survey
Query
c1
c2 c3 c4 c5
0.82 0.28 0 0.35 0
m1 m2 m3 m4
0
0 92
12 * 9(129)
(t=12, d=9)
A=
1
1
1
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
0
1
0
0
0
0
1
0
1
1
0
0
1
0
0
0
0
1
0
0
0
2
0
0
1
0
0
0
0
0
0
0
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
0
1
0
1
1
93
SVD
A US V
U=
0.22
-0.11
0.29
-0.41
-0.11
-0.34
0.52
-0.06
-0.41
0.20
-0.07
0.14
-0.55
0.28
0.50
-0.07
-0.01
-0.11
0.24
0.04
-0.16
-0.59
-0.11
-0.25
-0.30
0.06
0.49
0.40
0.06
-0.34
0.10
0.33
0.38
0.00
0.00
0.01
0.64
-0.17
0.36
0.33
-0.16
-0.21
-0.17
0.03
0.27
0.27
0.11
-0.43
0.07
0.08
-0.17
0.28
-0.02
-0.05
0.27
0.11
-0.43
0.07
0.08
-0.17
0.28
-0.02
-0.05
0.30
-0.14
0.33
0.19
0.11
0.27
0.03
-0.02
-0.17
0.21
0.27
-0.18
-0.03
-0.54
0.08
-0.47
-0.04
-0.58
0.01
0.49
0.23
0.03
0.59
-0.39
-0.29
0.25
-0.23
0.04
0.62
0.22
0.00
-0.07
0.11
0.16
-0.68
0.23
0.03
0.45
0.14
-0.01
-0.30
0.28
0.34
0.68
0.18
94
SVD
A US V
S=
(r * r, r=9 (rmin(t ,d ))
3.34
2.54
2.35
1.64
1.50
1.31
0.85
0.56
0.36
95
SVD
A US V
V=
V:d * r
0.20
-0.06
0.11
-0.95
0.05
-0.08
0.18
-0.01
-0.06
0.61
0.17
-0.50
-0.03
-0.21
-0.26
-0.43
0.05
0.24
0.46
-0.13
0.21
0.04
0.38
0.72
-0.24
0.01
0.02
0.54
-0.23
0.57
0.27
-0.21
-0.37
0.26
-0.02
-0.08
0.28
0.11
-0.51
0.15
0.33
0.03
0.67
-0.06
-0.26
0.00
0.19
0.10
0.02
0.39
-0.30
-0.34
0.45
-0.62
0.01
0.44
0.19
0.02
0.35
-0.21
-0.15
-0.76
0.02
0.02
0.62
0.25
0.01
0.15
0.00
0.25
0.45
0.52
0.08
0.53
0.08
-0.03
-0.60
0.36
0.04
-0.07
-0.45
96
SVD
97
SVD
K=2:U: t*rt*k;
VkT
A k = U k * Sk *
0.22
-0.11
0.20
-0.07
0.24
0.04
0.40
0.06
0.64
-0.17
0.27
0.11
0.27
0.11
0.30
-0.14
0.21
0.27
0.01
0.49
0.04
0.62
0.03
0.45
3.34
2.54
0.20
0.61
0.46
0.54
0.28
0.00
0.02
0.02
0.08
-0.06
0.17
-0.13
-0.23
0.11
0.19
0.44
0.62
0.53
k=2
98
M k ( A A) (i , j ) Vix V jx S xx
T
x 1
c1
c2
c3
c4
c5
m1
m2
M3
m4
c1
0.47
1.3
1.08
1.29
0.58
-0.03
-0.13
-0.2
-0.03
c2
1.3
4.34
2.99
3.42
2.03
0.34
0.62
0.82
1.13
c3
1.08
2.99
2.47
2.96
1.34
-0.06
-0.27
-0.42
-0.03
c4
1.29
3.42
2.96
3.59
1.52
-0.16
-0.53
-0.8
-0.3
c5
0.58
2.03
1.34
1.52
0.95
0.2
0.37
0.5
0.63
m1
-0.03
0.34
-0.06
-0.16
0.2
0.24
0.54
0.76
0.67
m2
-0.13
0.62
-0.27
-0.53
0.37
0.54
1.25
1.76
1.52
m3
-0.2
0.82
-0.42
-0.8
0.5
0.76
1.76
2.48
2.14
m4
-0.03
1.13
-0.03
-0.3
0.63
0.67
1.52
2.14
1.88
99
M '(i , j ) M ( i , j )
c1
c2
c1
1.00
0.91
c2
0.91
c3
M ( i ,i ) M ( j , j )
c3
c4
c5
m1
m2
M3
m4
0.99
0.99
0.87
-0.09
-0.16
-0.18
-0.03
1.00
0.91
0.87
0.99
0.34
0.27
0.25
0.39
0.99
0.91
1.00
0.99
0.88
-0.07
-0.15
-0.17
-0.02
c4
0.99
0.87
0.99
1.00
0.82
-0.17
-0.25
-0.27
-0.12
c5
0.87
0.99
0.88
0.82
1.00
0.41
0.34
0.33
0.47
m1
-0.09
0.34
-0.07
-0.17
0.41
1.00
0.99
0.99
0.99
m2
-0.16
0.27
-0.15
-0.25
0.34
0.99
1.00
0.99
0.99
m3
-0.18
0.25
-0.17
-0.27
0.33
0.99
0.99
1.00
0.99
m4
-0.03
0.39
-0.02
-0.12
0.47
0.99
0.99
0.99
1.00
100
SVD
A US V
T
k
qk U k q qk q U k
T
Ak U A U USVk SVk
T
qk q U k
T
Ak SVk
101
SVD
A US V
qk q U k
T
T
k
Ak SVk
S 1U
qk q U k S
T
S
T
T
k
Ak Vk
1/ 2
qk q U k S
T
Uk
1/ 2
Ak S
1/ 2
T
T
Vk
T
102
SVD
Query: Human-Computer Interaction
qk q U k
T
q u1 , q u2 , q uk
T
qkT=
0.22
-0.11
0.20
-0.07
0.24
0.04
0.40
0.06
0.64
-0.17
0.27
0.11
0.27
0.11
0.30
-0.14
0.21
0.27
0.01
0.49
0.04
0.62
0.03
0.45
103
SVD
qkT= < 0.46 -0.07 >
sim (Q, Di )
V
k
sim (Q, Di )
(qk ( S Vk
T
Sk
T
(i )
3.34
2.54
))
| qk | | ( S Vk
T
(i )
)|
c1
c2
c3
c4
c5
m1
m2
m3
m4
0.20
0.61
0.46
0.54
0.28
0.00
0.02
0.02
0.08
-0.06
0.17
-0.13
-0.23
0.11
0.19
0.44
0.62
0.53
c1
c2
c3
c4
c5
m1
m2
m3
m4
0.99
0.93
0.99
0.98
0.90
-0.01
-0.09
-0.11
0.04
104
Query: Human-Computer Interaction
qT
(1 0 1 0 0 0 0 0 0 0 0 0)
qk T
(0.46 -0.07)
0.82
0.99
0.35
0.99
0.28
0.98
0.00
0.93
0.00
0.90
0.00
m1 trees
0.04
0.00
m2 graph trees
-0.01
m1 trees
0.00
-0.09
m2 graph trees
0.00
-0.11
105
106
SVD2
Query: Human-Computer Interaction
qT=
U
k
<1 0 1 0 0 0 0 0 0 0 0 0>
qk qT U k S 1
T
1
11
1
22
1
kk
q u1 s , q u2 s , q uk s
T
0.22
-0.11
0.20
-0.07
0.24
0.04
0.40
0.06
0.64
-0.17
0.27
0.11
0.27
0.11
0.30
-0.14
0.21
0.27
0.01
0.49
0.04
0.62
0.03
0.45
3.34
2.54 107
SVD2
qkT= < 0.1377 -0.0276 >
(qk Vk (i ) )
T
sim (Q, Di )
V
k
sim (Q, Di )
| qk | | Vk (i ) |
c1
c2
c3
c4
c5
m1
m2
m3
m4
0.20
0.61
0.46
0.54
0.28
0.00
0.02
0.02
0.08
-0.06
0.17
-0.13
-0.23
0.11
0.19
0.44
0.62
0.53
c1
c2
c3
c4
c5
m1
m2
m3
m4
0.99
0.89
0.99
0.96
0.84
-0.20
-0.15
-0.16
-0.05
108
qk q U k
T
qk qT U k S 1
qk T
(0.46 -0.07)
qk T
(0.1377 -0.0276 )
0.99
0.99
0.99
0.99
0.98
0.96
0.93
0.89
0.90
0.84
0.04
-0.05
-0.01
m1 trees
-0.15
m2 graph trees
-0.09
m2 graph trees
-0.16
-0.11
-0.20
m1 trees
109
SVD
110
(LSA)
LDA (Latent Dirichlet Allocation)
111
112