You are on page 1of 112

(2013)

Email:yangjw@pku.edu.cn
1

1997:

3.2
1999: 8(15TB)
2002: 20(Google)
2004: 43(Google)
2006:100(Google)

recently study shows:


1TB/year

for books,
2TB/year for newspapers,
1TB/year for periodcals,
19TB/year for office docs.

68%

51%2
4


Information Retrieval

information storage and


retrieval

information search
5


IRTRECSDR (Speech
Document Retrieval) TrackVideo Track

(:,,,)


(,)
(1950s)
(,1960s)
1970s,1980s)
Web(1990s)

10

11



12

13

14


TR(Text Retrieval) ,

TRIR




15

User

The Web

Spider

Inverted
Index

Search
Engine

Library, etc.
16


,
: 80%

17

18

{relevant}

{retrieved}

{retrieved}

{relevant}

19

(precision)

precision

{relevant} {retrieved }
{retrieved }

(recall)

recall

{relevant} {retrieved }
{relevant}

20

| RelRetriev ed |
Recall
| Rel in Collection |

| RelRetriev ed |
Precision
| Retrieved |

21


A,C,E,G,
B,D,F
H, I, J

W,Y

{relevant}

={A,B,C,D,E,F,G,H,I,J} = 10
{retrieved} = {B, D, F,W,Y} = 5
{relevant} {retrieved} ={B,D,F} = 3
precision =
recall =

22

Difficult to determine which of these two


hypothetical results is better:

precision

x
x
x

recall
23

F-measure (F-)

F-measure (F-)
F

1
1
1
1
P
R

>0.5

: precision is more important


<0.5 : recall is more important
2 PR
Usually =0.5
F
PR
precision

= 60% recall = 30%F =


24


P@NN
MRR

S@1010

25

26


IR model

IR
< D, Q, F, R(qi,dj) >
:

D
Q
F,

R(qi,dj) qidj

27

28

29

30

31

32

(Hypertext)

Web
33



Chapter
Section
Paragraph

34

35

01
and, or, not

DNF

q ka (kb kc )

p-norm
36

VSM

T3

Example:

D1 = 2T1 + 3T2 + 5T3

D2 = 3T1 + 7T2 + T3
D1 = 2T1+ 3T2 + 5T3

Q = 0T1 + 0T2 + 2T3


Q = 0T1 + 0T2 + 2T3
2

T1
D2 = 3T1 + 7T2 + T3

T2

Is D1 or D2 more similar to Q?
How to measure the degree of
similarity? Distance? Angle?
Projection?

37

(GVSM)

Generalized Vector Space Model


C = [D1, D2, ..., Dm ]


vec(ti)= [Tf(ti, D1), Tf(ti,D2), ..., Tf(ti, Dm )]



sim(ti, tj) = cos(vec(ti), vec(tj))
sim(q,d)qd
-

sim(q,d) = i[maxj(sim(tqi,tdj)]

38


P( A | B)

P( B | A) P( A)
P( B)


P(dj)=P(k1)P(kt)

Ki
sim(d j , q) ~

( g ( d

j ) 1

gi ( d j ) 1

P(ki | R)) ( g ( d
i

j ) 0

j ) 0

P(ki | R)) ( g ( d

P(k i | R))
P( k i |R)

sim (d j , q) ~ wiq wij wi


i 1

(BM25)

w log

N df 0.5 (k1 1)tf (k3 1)qtf

df 0.5
( K tf ) (k3 qft )

39

(Unigram)

P(W1W2 ..Wn ) P(W1 ) P(W2 )...P(Wn )

N(NGram)

N(NGram)
P(W1W2 ..Wn ) P(W1 ) P(W2 | W1 )...P(Wn | Wn1Wn2 ..W1 )

40

41

IR
Term
(,)


42

Q: how to avoid heart disease


D: Factors in minimizing stroke
and cardiacarrest
: Recommended dietary and
exercise regimens"

43

Q = F[Q, Dret ]
F =
such as:
W(t,Q) = W(t,Q) + W(t,Drel)
- W(t,Dirr)
tQ
44


Query

Retrieval
Engine

Updated
query

Document
collection

Feedback

Results:
d1 3.5
d2 2.4

dk 0.5
...

Judgments:
d1 +
d2 +
d3 +

dk ...

top 10

45

46

(LSA)

Latent Semantic Analysis (LSA)


Latent Semantic Indexing (LSI)

47

48

49

GOOGLE USES LSI

Increasing its weight in ranking pages


~

sign before the search term stands for the semantic


search
~phone
the first link appearing is the page for Nokia
although page does not contain the word phone
~humor
retrieved pages contain its synonyms; comedy, jokes,
funny

Google AdSense sandbox


check

which advertisements google would put on your

page
50

(LSA)

Latent Semantic Analysis (LSA)


Latent Semantic Indexing (LSI)

singular
value decomposition,SVD
KK
51

Topic model

52

(LSA)

LSI
,
, LSI
10%--30%.
, LSI
/

53

(LSA)

Introduced in 1990; improved in 1995


S. Deerwester, S. Dumas, G. Furnas, T. Landauer,
R. Harsman: Indexing by latent semantic analysis,
J. American Society for Information Science, 41,
1990, pp. 391-407
M. W. Berry, S.T. Dumas, G.W. OBrien: Using
linear algebra for intelligent information retrieval,
SIAM Review, 37, 1995, pp. 573-595
Based on spectral analysis of term-document
matrix
54

(LSA)

Weighted Frequency Matrix

Query Terms:
- Insulation
- Joint
55

(LSA)

term-by-document matrix

U: concept-by-term matrix
V: concept-by-document matrix
S: elements assign weights to concepts

56

td

td

t1
t2
t3
t4

d1
d2
322 85
361 90
25 33
30 140

d3
d4
d5
d6
35 69 15 320
76 57 13 370
160 48 221 26
70 201 16 35

57


1.frequency

matrix
2.frequency matrix
frequency matrix3US

VUVUTU=IS
KK
3.

dSVD

4.
58

SVD

A
XY
A2-nnBC
X

A2-nnBD
59

SVD

XYX1Y1
XY

A2-nnBC
X1

A2-nnBC

60

SVD

61

SVD

62

SVD

63

(LSA)

For every mn matrix A there is singular value


decomposition (SVD)

A UV T
U orthogonal mr matrix whose columns are left
singular vectors of A
( UTU=I, U terms)
diagonal matrix on whose diagonal are singular
values of matrix A in descending order (
)
V orthogonal rn matrix whose columns are
right singular vectors of A
( VTV=I, Vdocuments )
64

(LSA)

t is the number of rows of X


d is the number of columns of X
r is the rank of X (min(t , d ))

65

(LSA)

For LSI truncated SVD is used (k<r)

U k k V k

where
Uk is mk matrix whose columns are first k left singular
vectors of A
k is kk diagonal matrix whose diagonal is formed by k
leading singular values of A
Vk is nk matrix whose columns are first k right singular
vectors of A
Rows of Uk = terms
Rows of Vk = documents
66

(LSA)

67

(LSA)

68

SVD Example
D -- 10 x 2
2.9002
4.0860
1.9954
3.5069
4.4620
-2.9444
-4.1132
-3.6208
-3.0558
-6.1204

3.6790
5.2366
3.3687
1.6748
2.7684
-4.6447
-4.7043
-5.0181
-4.1821
-2.4790

U -- 10 x 2
-0.2750
-0.3896
-0.2247
-0.2150
-0.3005
0.3177
0.3682
0.3613
0.3027
0.3563

-0.1242
-0.1846
-0.2369
0.3514
0.3318
0.2906
0.0833
0.2319
0.1861
-0.6935

S -- 2 x 2

16.9491
0
0 3.8491

V T -- 2 x 2

-0.6960 -0.7181
0.7181 -0.6960

69

SVD Example

70

LSI: SVD for IR

The entries of (AT A)n*n may be interpreted as the


pairwise documents in vector space.
AT A=( USVT) T USV T
=VSTUT USV T
=VS2 V T

Similarly, (A AT) m*m as the pairwise term.


A AT= US2 UT
The representations are vectors in an r-dimension
subspace.
71

SVD for IR

72

Latent semantic indexing


SVD-
Term

Document

Documents

Terms

car

SVD

U
auto

k-dim vector
73

Latent semantic indexing

SVDd-

k
B+R
74

Latent Semantic Kernels


LSI uses a reduction of the first k columns of U.

d (d )U k

~
(d1 , d 2 ) (d1 )i U k U k (d 2 ) j

i, j

75

pLSA
Probability

Latent Semantic Analysis

76

pLSA

EM ()
Eestimate

the expected values

Mre-estimate

parameters

77

pLSA

document
pLSA

document i

(1)

the number of parameters in the model grows


linearly with the size of the corpus, which leads
to serious problems with overfitting, and
(2) it is not clear how to assign probability to a
document outside of the training set.
78

Topic model

79

Topic model

p(w | d )

p(w | z)p(z | d )

80

LDA
pLSA
doc~topic

LDA

pLSA

LDAdoc~topic
model

doc~topic
81

LDA (Latent Dirichlet Allocation)

82

LDA
LDA assumes the following generative process
for each document w in a corpus D :
1.

Choose N ~ Poisson()
2. Choose ~ Dirichlet()

(N:).

(:k k:Topic ).
3.

For each of the N words wn :

(a)

Choose a topic zn ~ Multinomial(). ()


(b) Choose a word wn from p(wn|zn ,), a multinomial
probability conditioned on the topic zn.
(k*V, V i,jj
Topic i,Topic)

83

LDA

,
EM

()
Gibbs Sampling.

84

85


9
c1 Human machine interface for Lab ABC computer
application
c2 A survey of user opinion of computer system response time
c3 The EPS user interface management system
c4 System and human system engineering testing of EPS
c5 Relations of user-perceived response time to error
measurement
m1 The generation of random, binary, unordered trees
m2 The intersection graph of paths in trees
m3 Graph minors IV: Widths of trees and well-quasi-ordering
m4 Graph minors: A survey

86


12
human
interface
computer
user
system
response
time
EPS
survey
trees
graph
minors

c1
1
1
1
0
0
0
0
0
0
0
0
0

c2
0
0
1
1
1
1
1
0
1
0
0
0

c3
0
1
0
1
1
0
0
1
0
0
0
0

c4
1
0
0
0
2
0
0
1
0
0
0
0

c5
0
0
0
1
0
1
1
0
0
0
0
0

m1
0
0
0
0
0
0
0
0
0
1
0
0

m2
0
0
0
0
0
0
0
0
0
1
1
0

m3
0
0
0
0
0
0
0
0
0
1
1
1

m4
0
0
0
0
0
0
0
0
1
0
1
1

87

Sim( x, y) x y ( xk y )
k 1

M A A
T

M ?
88

Sim( x, y) x y ( xk y )

M AT A

k 1

x y
Sim ( x, y )

| x || y |
M (i , j )

(x y )
k

k 1

xk
k 1

y k
t

k 1

M (i , j )
M ( i ,i ) M ( j , j )
89

(x y )
x y
Sim ( x, y )

x k y k
| x || y |
k

k 1

k 1

k 1

c1

c2

c3

c4

c5

m1

m2

M3

m4

c1

0.24

0.29

0.24

c2

0.24

0.41

0.33

0.71

0.24

c3

0.29

0.41

0.61

0.29

c4

0.24

0.33

0.61

c5

0.71

0.29

m1

0.71

0.58

m2

0.71

0.82

0.41

m3

0.58

0.82

0.67

m4

0.24

0.41

0.67

1 90


Query: Human-Computer Interaction
qT= <1 0 1 0 0 0 0 0 0 0 0 0>

x y
Sim ( x, y )
| x || y |

human
interface
computer
user
system
response
time
EPS
survey
trees
graph
minors

(x y )
k

k 1

xk

k 1

y k
t

k 1

c1
Query

c2

0.82 0.28

c3
0

c4

0.35

c1 c2 c3 c4 c5 m1 m2 m3 m4
1 0 0 1 0 0 0 0 0
1 0 1 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
0 1 1 0 1 0 0 0 0
0 1 1 2 0 0 0 0 0
0 1 0 0 1 0 0 0 0
0 1 0 0 1 0 0 0 0
0 0 1 1 0 0 0 0 0
0 1 0 0 0 0 0 0 1
0 0 0 0 0 1 1 1 0
0 0 0 0 0 0 1 1 1
0 0 0 0 0 0 0 1 1
c5
0

m1
0

m2
0

m3
0

m4
0

91


Query: Human-Computer Interaction
c1 Human machine interface for Lab ABC computer
application
c2 A survey of user opinion of computer system response time
c3 The EPS user interface management system
c4 System and human system engineering testing of EPS
c5 Relations of user-perceived response time to error
measurement
m1 The generation of random, binary, unordered trees
m2 The intersection graph of paths in trees
m3 Graph minors IV: Widths of trees and well-quasi-ordering
m4 Graph minors: A survey
Query

c1
c2 c3 c4 c5
0.82 0.28 0 0.35 0

m1 m2 m3 m4
0

0 92


12 * 9(129)
(t=12, d=9)

A=

1
1
1
0
0
0
0
0
0
0
0
0

0
0
1
1
1
1
1
0
1
0
0
0

0
1
0
1
1
0
0
1
0
0
0
0

1
0
0
0
2
0
0
1
0
0
0
0

0
0
0
1
0
1
1
0
0
0
0
0

0
0
0
0
0
0
0
0
0
1
0
0

0
0
0
0
0
0
0
0
0
1
1
0

0
0
0
0
0
0
0
0
0
1
1
1

0
0
0
0
0
0
0
0
1
0
1
1

93

SVD
A US V

U=

(t=12, r=9 (rmin(t , d ))

0.22

-0.11

0.29

-0.41

-0.11

-0.34

0.52

-0.06

-0.41

0.20

-0.07

0.14

-0.55

0.28

0.50

-0.07

-0.01

-0.11

0.24

0.04

-0.16

-0.59

-0.11

-0.25

-0.30

0.06

0.49

0.40

0.06

-0.34

0.10

0.33

0.38

0.00

0.00

0.01

0.64

-0.17

0.36

0.33

-0.16

-0.21

-0.17

0.03

0.27

0.27

0.11

-0.43

0.07

0.08

-0.17

0.28

-0.02

-0.05

0.27

0.11

-0.43

0.07

0.08

-0.17

0.28

-0.02

-0.05

0.30

-0.14

0.33

0.19

0.11

0.27

0.03

-0.02

-0.17

0.21

0.27

-0.18

-0.03

-0.54

0.08

-0.47

-0.04

-0.58

0.01

0.49

0.23

0.03

0.59

-0.39

-0.29

0.25

-0.23

0.04

0.62

0.22

0.00

-0.07

0.11

0.16

-0.68

0.23

0.03

0.45

0.14

-0.01

-0.30

0.28

0.34

0.68

0.18

94

SVD
A US V
S=

(r * r, r=9 (rmin(t ,d ))

3.34

2.54
2.35
1.64
1.50
1.31
0.85
0.56
0.36
95

SVD
A US V

V=

V:d * r
0.20

-0.06

0.11

-0.95

0.05

-0.08

0.18

-0.01

-0.06

0.61

0.17

-0.50

-0.03

-0.21

-0.26

-0.43

0.05

0.24

0.46

-0.13

0.21

0.04

0.38

0.72

-0.24

0.01

0.02

0.54

-0.23

0.57

0.27

-0.21

-0.37

0.26

-0.02

-0.08

0.28

0.11

-0.51

0.15

0.33

0.03

0.67

-0.06

-0.26

0.00

0.19

0.10

0.02

0.39

-0.30

-0.34

0.45

-0.62

0.01

0.44

0.19

0.02

0.35

-0.21

-0.15

-0.76

0.02

0.02

0.62

0.25

0.01

0.15

0.00

0.25

0.45

0.52

0.08

0.53

0.08

-0.03

-0.60

0.36

0.04

-0.07

-0.45

96

SVD

97

SVD
K=2:U: t*rt*k;

S: r*rk*k; VT: r*dk*d; A?

VkT

A k = U k * Sk *
0.22

-0.11

0.20

-0.07

0.24

0.04

0.40

0.06

0.64

-0.17

0.27

0.11

0.27

0.11

0.30

-0.14

0.21

0.27

0.01

0.49

0.04

0.62

0.03

0.45

3.34
2.54

0.20

0.61

0.46

0.54

0.28

0.00

0.02

0.02

0.08

-0.06

0.17

-0.13

-0.23

0.11

0.19

0.44

0.62

0.53

k=2

98

AT A=(USVT)T (USVT) =VSTUT USVT =VS2 VT


k

M k ( A A) (i , j ) Vix V jx S xx
T

x 1

c1

c2

c3

c4

c5

m1

m2

M3

m4

c1

0.47

1.3

1.08

1.29

0.58

-0.03

-0.13

-0.2

-0.03

c2

1.3

4.34

2.99

3.42

2.03

0.34

0.62

0.82

1.13

c3

1.08

2.99

2.47

2.96

1.34

-0.06

-0.27

-0.42

-0.03

c4

1.29

3.42

2.96

3.59

1.52

-0.16

-0.53

-0.8

-0.3

c5

0.58

2.03

1.34

1.52

0.95

0.2

0.37

0.5

0.63

m1

-0.03

0.34

-0.06

-0.16

0.2

0.24

0.54

0.76

0.67

m2

-0.13

0.62

-0.27

-0.53

0.37

0.54

1.25

1.76

1.52

m3

-0.2

0.82

-0.42

-0.8

0.5

0.76

1.76

2.48

2.14

m4

-0.03

1.13

-0.03

-0.3

0.63

0.67

1.52

2.14

1.88
99


M '(i , j ) M ( i , j )
c1

c2

c1

1.00

0.91

c2

0.91

c3

M ( i ,i ) M ( j , j )
c3

c4

c5

m1

m2

M3

m4

0.99

0.99

0.87

-0.09

-0.16

-0.18

-0.03

1.00

0.91

0.87

0.99

0.34

0.27

0.25

0.39

0.99

0.91

1.00

0.99

0.88

-0.07

-0.15

-0.17

-0.02

c4

0.99

0.87

0.99

1.00

0.82

-0.17

-0.25

-0.27

-0.12

c5

0.87

0.99

0.88

0.82

1.00

0.41

0.34

0.33

0.47

m1

-0.09

0.34

-0.07

-0.17

0.41

1.00

0.99

0.99

0.99

m2

-0.16

0.27

-0.15

-0.25

0.34

0.99

1.00

0.99

0.99

m3

-0.18

0.25

-0.17

-0.27

0.33

0.99

0.99

1.00

0.99

m4

-0.03

0.39

-0.02

-0.12

0.47

0.99

0.99

0.99

1.00
100

SVD
A US V

T
k

qk U k q qk q U k
T

Ak U A U USVk SVk
T

qk q U k
T

Ak SVk

101

SVD
A US V

qk q U k
T

T
k

Ak SVk

S 1U

qk q U k S
T

S
T

T
k

Ak Vk

1/ 2

qk q U k S
T

Uk

1/ 2

Ak S

1/ 2

T
T

Vk

T
102

SVD
Query: Human-Computer Interaction

qT= <1 0 1 0 0 0 0 0 0 0 0 0>

qk q U k
T

q u1 , q u2 , q uk
T

qkT=

< 0.46 -0.07 >

0.22

-0.11

0.20

-0.07

0.24

0.04

0.40

0.06

0.64

-0.17

0.27

0.11

0.27

0.11

0.30

-0.14

0.21

0.27

0.01

0.49

0.04

0.62

0.03

0.45
103

SVD
qkT= < 0.46 -0.07 >
sim (Q, Di )

V
k

sim (Q, Di )

(qk ( S Vk
T

Sk
T
(i )

3.34
2.54

))
| qk | | ( S Vk

T
(i )

)|

c1

c2

c3

c4

c5

m1

m2

m3

m4

0.20

0.61

0.46

0.54

0.28

0.00

0.02

0.02

0.08

-0.06

0.17

-0.13

-0.23

0.11

0.19

0.44

0.62

0.53

c1

c2

c3

c4

c5

m1

m2

m3

m4

0.99

0.93

0.99

0.98

0.90

-0.01

-0.09

-0.11

0.04
104


Query: Human-Computer Interaction

qT

(1 0 1 0 0 0 0 0 0 0 0 0)

qk T

(0.46 -0.07)

0.82

c1 human interface com

0.99

c3 eps user interface system

0.35

c4 system human system eps

0.99

c1 human interface computer

0.28

c2 survey user computer

0.98

c4 system human system eps

0.00

c3 eps user interface system

0.93

c2 survey user computer

0.00

c5 user response time

0.90

c5 user response time

0.00

m1 trees

0.04

m4 graph minors survey

0.00

m2 graph trees

-0.01

m1 trees

0.00

m3 graph minors trees

-0.09

m2 graph trees

0.00

m4 graph minors survey

-0.11

m3 graph minors trees

system response time

system response time

105

106

SVD2
Query: Human-Computer Interaction
qT=

U
k

<1 0 1 0 0 0 0 0 0 0 0 0>

qk qT U k S 1
T

1
11

1
22

1
kk

q u1 s , q u2 s , q uk s
T

qkT= < 0.1377 -0.0276 >


Sk

0.22

-0.11

0.20

-0.07

0.24

0.04

0.40

0.06

0.64

-0.17

0.27

0.11

0.27

0.11

0.30

-0.14

0.21

0.27

0.01

0.49

0.04

0.62

0.03

0.45

3.34
2.54 107

SVD2
qkT= < 0.1377 -0.0276 >
(qk Vk (i ) )
T

sim (Q, Di )

V
k

sim (Q, Di )

| qk | | Vk (i ) |

c1

c2

c3

c4

c5

m1

m2

m3

m4

0.20

0.61

0.46

0.54

0.28

0.00

0.02

0.02

0.08

-0.06

0.17

-0.13

-0.23

0.11

0.19

0.44

0.62

0.53

c1

c2

c3

c4

c5

m1

m2

m3

m4

0.99

0.89

0.99

0.96

0.84

-0.20

-0.15

-0.16

-0.05
108


qk q U k
T

qk qT U k S 1

qk T

(0.46 -0.07)

qk T

(0.1377 -0.0276 )

0.99

c3 eps user interface system

0.99

c3 eps user interface system

0.99

c1 human interface computer

0.99

c1 human interface computer

0.98

c4 system human system eps

0.96

c4 system human system eps

0.93

c2 survey user computer

0.89

c2 survey user computer

0.90

c5 user response time

0.84

c5 user response time

0.04

m4 graph minors survey

-0.05

m4 graph minors survey

-0.01

m1 trees

-0.15

m2 graph trees

-0.09

m2 graph trees

-0.16

m3 graph minors trees

-0.11

m3 graph minors trees

-0.20

m1 trees

system response time

system response time

109

SVD

110

(LSA)
LDA (Latent Dirichlet Allocation)


111

112

You might also like