Professional Documents
Culture Documents
Based on a Book
Published by Springer
Similarity Search
Web page:
http://www.nmis.isti.cnr.it/amato/similarity-search-book/
ACM SAC Tutorial, March 2007
Quotation:
An ability to assess similarity lies close to the core of
cognition. The sense of sameness is the very keel and
backbone of our thinking. An understanding of problem
solving, categorization, memory retrieval, inductive
reasoning, and other cognitive processes require that we
understand how humans assess similarity.
Finance:
Digital library:
text retrieval
multimedia information retrieval
ACM SAC Tutorial, March 2007
Search Problem
sim
ilar?
image database
ACM SAC Tutorial, March 2007
Feature-based Approach
image layer
feature layer
R
G
10
11
M = (D,d) [Kel55]
Data domain D
Total (distance) function d: D D (metric function or
metric)
non negativity
symmetry
identity
triangle inequality
x , y D , d ( x , y ) 0
x , y D , d ( x , y ) = d ( y , x )
x, y D , x = y d ( x, y ) = 0
x, y , z D , d ( x, z ) d ( x, y ) + d ( y , z )
12
Metric Space
Another specification:
x, y D, d(x, y) 0
x, y D, d(x, y) = d( y, x)
x D, d(x, x) = 0
x, y D, x y d(x, y) > 0
x, y, z D, d(x, z) d(x, y) + d( y, z)
13
Pseudo Metric
To be proved
Since
We get
d ( x, z) = d ( y, z), d ( x, y) = 0
d ( x, z) d ( x, y) + d ( y, z)
d ( y, z) d ( x, y) + d ( x, z)
14
Quasi Metric
15
Super Metric
x, y, z D : d ( x, z ) max{d ( x, y ), d ( y, z )}
16
Distance Measures
Discrete
functions returning only a small (pre-defined) set of
values
Continuous
functions in which the cardinality of the set of values
returned is very large or infinite.
17
Minkowski Distances
L p [( x1 ,L, xn ), ( y1 , L, yn )] =
p
|
x
y
|
i i
i =1
18
Special Cases
max
L1
n
i =1
| xi yi |
L2
L6
19
r r
r rT
r r
d M ( x, y) = ( x y) M ( x y)
r r
d M ( x, y) =
ACM SAC Tutorial, March 2007
w (x y )
i =1
20
Example
Pure red:
Pure orange:
Pure blue:
r
vred = (0,1,0)
r
vorange = (0,0,1)
r
vblue = (1,0,0)
21
Example (cont.)
Red and orange are more alike than red and blue.
Matrix specification:
red
orange
orange
red
blue
blue
0.2
2
22
Edit Distance
23
24
25
Jaccards Coefficient
r r
r r
x y
dTS ( x , y ) = 1 r 2
r 2 r r
|| x || + || y || x y
r r
x y is the scalar product
r
|| x || is the Euclidean norm
ACM SAC Tutorial, March 2007
26
Hausdorff Distance
27
d p ( A, y ) = inf d e ( x, y ),
x A
d s ( A, B ) = sup d p ( x, B ),
x A
d s ( B, A) = sup d p ( A, y ).
yB
28
29
Similarity Queries
Range query
Nearest neighbor query
Reverse nearest neighbor query
Similarity join
Combined queries
Complex queries
30
range query
R(q,r) = { x X | d(q,x) r }
31
NN(q) = x
x X, y X, d(q,x) d(q,y)
k=5
k-NN(q,k) = A
A X, |A| = k
x A, y X A, d(q,x) d(q,y)
32
kRNN (q ) = {R X , x R : q kNN ( x)
x X R : q kNN ( x)}
33
Example of 2-RNN
o4
o1
o2
q
o5
o3
o6
34
X D ,Y D , 0
J ( X , Y , ) = {( x, y ) X Y : d ( x, y ) }
35
Combined Queries
kNN (q, r ) = {R X , | R | k x R, y X R :
d ( q , x ) d ( q , y ) d ( q , x ) r}
by analogy
36
Complex Queries
37
i | i X i |= k
For all
o i X i
38
39
avg(2,4) = 3
avg(4,1) = 2.5
X1 list (color)
X2 list (shape)
=1
=2
=3
=4
40
Partitioning Principles
Ball partitioning
41
dm
42
dm2
p
dm1
43
{ x X | d(p1,x) d(p2,x) }
{ x X | d(p1,x) > d(p2,x) }
p2
p1
44
dm
45
46
Basic Strategies
Partitioning principle
Query execution algorithm
R(q,4):
q
3 10 8 1
47
Initially: take the first k objects and order them with respect
to the distance from q.
All other objects are consecutively scanned and d(q,o) are
evaluated.
If d(q,oi) d(q,ok), oi is inserted to a correct position in
answer and the last neighbor ok is eliminated.
3-NN(q):
q
Answer:
1 9
10
48
r
p
49
o8
o9 B2
B1
o2
o1 o2 o3 o4
B3 B2
o3
o4
B3 o6
o5
o7
o5 o6 o7
o8 o9
50
object element oj G:
51
B3 is not intersected.
Inspect elements in B2.
Search is complete.
B1
o1 o2 o3 o4
B1
o9
o8
o1
o2
B2
o3
o4
B3 B2
o5 o6 o7
B3
o8 o9
o5
o6
o7
Response = o8 , o9
52
53
Initialization:
54
55
Process B1
Skip B3
B1
B1
Process B2
q
o1
o9
o8
B2
o2
PR is empty, quit.
Final result
o1 o4 o3 o2
o4
B3 B2
Processing:
PR=
o5 o6 o7
ACM SAC Tutorial, March 2007
o3
o8 o9
B12
o5
B3
o6
o7
B21
56
o4
B3 o6
o5
o7
57
Incremental NN Search
Initialization:
58
59
o4 o3 o2
B1
q
o1
B3 o1 B2
o9
o8
o2
o3
o4
o5 o6 o7
o8 o9
B2
B3
o5
o6
o7
B473652198321
Processing: o
(B219854367132,4)
,0)
,1)
,2)
,3)
(B12439763,3)
,2)
,7)
,6)
,4)
,5) (o
(B3421673,7)
,3)
,4)
,6)
,5) ((o
B74233,,8)
,7)
,4)
,6)
,5)
5) ((o
B7433,8)
,7)
,6)
,5) (o34 ,7)
,6) (o3 ,7)
,8)
,8)
,5) (o
,7)
,6)
,5)
,8)
PR = (o
Response = o8 , o9 , o1 , o2 , o5 , o4 , o6 , o3 , o7
ACM SAC Tutorial, March 2007
60
61
object-pivot
range-pivot
pivot-pivot
double-pivot
pivot filtering
62
Explanatory Example
applies ball-partitioning
o11
o3
p1
p2
p3
o4
p1
p2
o4 o6 o10
o1 o5 o11
o2 o9
o3 o7 o8
o6
o7
o5
o9
o1
p3
o10
o2
o8
63
o3
o4
p1
p2
o6
o7
o9
p1
o1
p2
p3
o10
o2
o4 o6 o10
p3
o1 o5 o11
o2 o9
o3 o7 o8
o8
ACM SAC Tutorial, March 2007
64
p2
o4 o6 o10
p3
o1 o5 o11
o2 o9
o3 o7 o8
During insertion
65
Estimation of d(q,o10)
o10
p2
o6
66
o10
o4
o6
o10
o4
o6
o4
p2
r
q
p2
r
o6
o10
o6
o4
o10
67
Ex
tra
d ( q, p ) d ( p, o) d ( q, o) d ( q, p ) + d ( p, o)
68
o3
o4
p1
p2
o6
o7
o8
o5
o9
p1
o1
p2
p3
o10
o2
o4 o6 o10
p3
o1 o5 o11
o2 o9
o3 o7 o8
y
69
?
o4 o6 o10
p2
p3
o1 o5 o11
o2 o9
o3 o7 o8
70
r
q
rh
rl
o6
o10
p2
r
q
rh
rl
o6
o4
o10
p2
r
q
rh
rl
o6
o4
o10
71
Ex
tra
rh
p
rh
p
rl
rh
p
rl
rl
q
q
upper bound
ACM SAC Tutorial, March 2007
lower bound
72
Ex
tra
73
Only range-pivot:
object-pivot +range-pivot:
o11
o3
o4
p1
p2
o6
o7
o8
o5
o9
p1
o1
p2
p3
o10
o2
o4 o6 o10
p3
o1 o5 o11
n
o2 o9
n
o3 o7 o8
y
74
Other Constraints
Pivot-pivot
p1
p2
p3
d(q,p1), d(p1,p2)
d(o,p2)[rl ,rh]
o4 o6 o10
o1 o5 o11
o2 o9
Double-pivot
o3 o7 o8
Lower bound
o
p1
p2
Equidistant line
ACM SAC Tutorial, March 2007
75
Ex
tra
76
Ex
tra
d (q, p1 ) d (q, p2 )
, 0 d ( q, o)
max
2
77
r
p
78
o2
r
p1
q
o3
p2
o1
79
Ex
tra
80
Ex
tra
81
o3
o4
p1
p2
o6
o7
o8
o9
p1
o1
p2
p3
o10
o2
o4 o6 o10
Distance computed?
p3
o1 o5 o11
o2 o9
o3 o7 o8
n
82
83
84
dm
85
pe
po
86
center
corner
distance
87
88
1.
2.
3.
89
Ex
tra
90
Ex
tra
Good pivots are far away from other objects in the metric
space.
Good pivots are far away from each other.
The best pivot is the query object itself.
91
Purpose:
User-defined search functions
Metric space embedding
92
M1 = (D1, d1)
Function
M2 = (D2, d2)
f : D1 D 2
93
o, o D : d1 (o, o) d 2 (o, o)
94
Color histograms
Users may not be able to form their preferences.
data-mining systems
95
dp lower-bounds db and du
It is applied during the search.
An index structure can be exploited.
96
Searching using dp
Possible, because
o1 , o2 D : d p (o1 , o2 ) d b (o1 , o2 )
o1 , o2 D : d p (o1 , o2 ) d u (o1 , o2 )
ACM SAC Tutorial, March 2007
97
Drawbacks
False-positives
98
Embedding Examples
99
r r
r
V = {v1 , v2 , K , vn }
Transformation is contractive
Used in the FastMap technique [FL95]
MetricMap [WWL+00]
100
101
Ball-partitioning methods
AESA, Spaghettis
Hybrid methods
Precomputed distances
Others
M-tree, D-Index
102
X3
X4
pj
X2
X3
X4
103
Report pj on output
Enter a child i
if d(q,pj) r
if max{d(q,pj) r, 0} i d(q,pj) + r
p1
r
q
p2
p1
p2
3
p3
5
p3
ACM SAC Tutorial, March 2007
104
p1
3
4
p2
FQT
p1
0
2
2
0
3
3 4
p2
p1 p2
FHFQT
p1
p2
105
m1
m2
p1
p2
S1,1
S1,2
p1,m1
p2,m2
S1,1
S1,2
106
Ex
tra
if d(q,pi) r
if d(q,pi) - r mi
if d(q,pi) + r mi
report pi on output
search the left sub-tree (a,b)
search the right sub-tree (b)
mi
(a)
mi
pi
q
pi
(b)
r
q
107
Ex
tra
if d(q,pi) dNN
if d(q,pi) - dNN mi
if d(q,pi) + dNN mi
108
p1
S1,1 S1,2 S1,3 S1,4
p1 m
1
S1,1
S1,2
S1,3
S1,4
109
pi
mi
110
VPF (cont.)
M1 + M2 + M3
p1
p2
p1
M1
p3
p2
M1
p3
111
S1 = {o X, d(o,p1) d(o,p2)}
S2 = {o X, d(o,p1) > d(o,p2)}
Covering radii r1c and r2c are
established:
r2c
p2
r1c
p1
112
Ex
tra
Report px on output
Enter a child of px
rjc
if d(q,px) r
if d(q,px) r rxc
pj
pi pj
ric
q
pi
113
Extensions of BT
Monotonous BT [NVZ92b,NVZ92a]
114
p5
p1 p2
p6
p3
p3 p4
p5 p6
p1
p4
115
Ex
tra
Report px on output
Enter the left child
Enter the right child
pj
r
q1
pi
if d(q,px) r
if d(q,pi) r d(q,pj) + r
if d(q,pi) + r d(q,pj) - r
116
117
AESA [Vid86,Vid94]
o2
o3
o4
o1
o2
o3
o4
o5
o6
o1
0
1.6
2.0
3.5
1.6
3.6
o2
1.6
0
1.0
2.6
2.6
4.2
o3
2.0
1.0
0
1.6
2.1
3.5
o4
3.5
2.6
1.6
0
3.0
3.4
o5
1.6
2.6
2.1
3.0
0
2.0
o6
3.6
4.2
3.5
3.4
2.0
0
118
o1 o2 o3 o4 o5 o6
o1
o2
o3
o4
o5
o6
ACM SAC Tutorial, March 2007
r
q
o6
o5
o2=p
o3
o4
119
o2
o3
o1
pivots
o2
o3
o4
o5
o6
o4
Search:
120
Extensions of LAESA
o3
o6
Spaghettis [CMB99]
o2
o5
Shapiro [Sha77]
o1
o4
o2 o3 o1 o4 o5 o6
o2 0 1.0 1.6 2.6 2.6 4.2
o6 4.2 3.5 3.6 3.4 2.0 0
o2
o6
o2
o3
1.0
2.0
o1
1.6
3.4
o4
2.6
3.5
o5
2.6
3.6
o6
4.2
4.2
121
Hybrid approaches
Hybrid approaches combine partitioning and precomputed distances into a single system.
122
MVPT
o1
o2
internal
o2
node
o3
o4
o5
o6
o7
o8 o9
o10 o11
o12 o13
o14 o15
o4 o8
o9
o1
o2
o5 o10
o11
o6 o12
o13
o3 o7
o14 o15
123
Pivot p2 is shared
dm 1
p1
p2
S1
S3
p2
S2 S3
S4
p2
S1
dm 3
dm 2
p1
S2
S4
124
Ex
tra
o1
o5
p2
o6
p1
o3
o4
o1
o2
o3
o4 o5
o6
p1
1.6
4.1
1.0
2.6
2.6
3.3
p2
3.6
3.4
3.5
3.4
2.0
2.5
125
Voronoi-based Approaches
126
Other Hybrids
M-tree family
M-tree [CPZ97]
Slim Tree [TTSF00]
Pivoting M-tree [Sko04]
DBM-tree [VTCT04]
M+-tree (BM+-tree) [ZWYY03, ZWZY05]
M2-tree [CP00a]
D-index [DGSZ03]
127
128
129
M-tree: Example
o3
Covering
radius
o6
o1 4.5 -.-
o1 0.0 o6 1.4
o5
o11
o2
o7
o1
o8
o4
o9
Distance
to parent
o1 1.4 0.0
o10
o7 1.3 3.8
Distance
Leaf
toentries
parent
ACM SAC Tutorial, March 2007
o2 6.9 -.-
o2 2.9 0.0
o4 1.6 5.3
o2 0.0 o8 2.9
o7 0.0 o5 1.3 o11 1.0
o4 0.0 o9 1.6
130
M-tree: Insert
Insert a new object oN:
Recursively descend the tree to locate the most
suitable leaf for oN
In each step, enter the subtree with pivot p for which:
131
132
M-tree: Split
Split(N,oN):
Let S be the set containing all entries of N and oN
Select pivots p1 and p2 from S
Partition S to S1 and S2 according to p1 and p2
Store S1 in N and S2 in a new allocated node N
If N is root
133
134
Generalized hyper-plane
p2
p1
p1
135
rc
pp
rc
136
137
138
M-tree Family
Bulk-Loading Algorithm
139
DBM-tree [VTCT04]
M+-tree [ZWYY03]
BM+-tree [ZWZY05]
Two key-dimensions
M2-tree [CP00a]
140
Hybrid structure
141
D-index: Structure
4 separable buckets at
the first level
2 separable buckets at
the second level
exclusion bucket of
the whole structure
142
D-index: Partitioning
bps1,(x)=
1 if d(x,p) > dm +
otherwise
Separable set 1
Exclusion set
2
dm
Separable set 0
ACM SAC Tutorial, March 2007
143
Separable
set 1
Separable
set 2
2
2
dm1
Separable
set 3
dm2
Exclusion
set
144
Ex
tra
bpsn,: D {0..2n-1, }
bpsn,(x) =
if i, bpsi1,(x) =
b all bpsi1,(x) form a binary number b
145
D-index: Insertion
146
h number of levels
mi number of binary functions combined on level i
147
i =1
mi distance computations
148
h number of levels
mi number of binary functions combined on level i
149
r
q
r
r
q
q
r
q
r
q
150
151
D-index: Features
Supports disk storage
Insertion needs one bucket access
152
Performance Trials
1.
2.
153
154
155
156
Comparison of Indexes
Comparing performance of
157
158
159
160
161
Possible solutions:
Sacrifice some precision: approximate techniques
Use more storage & computational power:
distributed data structures
ACM SAC Tutorial, March 2007
162
163
164
165
Space transformation
Example:
166
167
Possible strategies:
A search algorithm might stop before all the needed data has
been accessed.
168
169
170
Access B1
Report o1
If early termination stopped now,
we would lose objects.
Access B2
o2
Report o4 ,o5
If early termination stopped now,
we would not lose anything.
Access B3
o1
o6
B2
o3
o7
o4
o5
o9
Nothing to report
A relaxed branching strategy
may discard this region we
dont lose anything.
B1
B3
o8
o11
o10
171
Access B1
Access B2
o2
Neighbors: o1 ,o3
If early termination stopped now,
we would lose objects.
Neighbors: o4 ,o5
If early termination stopped now,
we would not lose anything.
Access B3
o1
o6
o3
o7
o4
o5
o9
B2
r=
B1
B3
o8
o11
o10
172
173
d (q, p ) > rq + rp
rp
p
rq
q
174
rq rq
< 1+
< 1
d (qd, (pq),pr)p rp
rp
p
rq
q
rq/(1+)
175
176
1
0,9
Fraction of the
dataset whose
distances from q
are smaller than
d(q,ok)
0,8
0,7
0,6
0,5
0,4
Fq(x)
0,3
ok
d(q,ok)
0,2
0,1
0
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
177
178
179
0,38
0,36
0,34
Distance
0,32
0,3
0,28
0,26
0,24
0,22
0,2
0
500
1000
1500
Iteration
ACM SAC Tutorial, March 2007
180
Regression Curves
0,4
0,38
0,36
Distance
0,34
0,32
Distance
Hyperbolic Regr.
Logarithmic Regr.
0,3
0,28
0,26
0,24
0,22
0,2
0
500
1000
1500
Iteration
ACM SAC Tutorial, March 2007
181
182
Proximity-based Approximation
[Ama02, ARSZ03]
183
1.q
1.R
1R.2
R
1.R
3
1.3
184
185
186
Measures of Performance
Improvement in efficiency
Accuracy of approximate results
High improvement in efficiency is obtained at the expense
of accuracy in the results.
187
Improvement in Efficiency
Cost (Q)
IE =
A
Cost (Q)
188
Recall
Recall (R): percentage of qualifying objects that are
retrieved by the approximate algorithm.
R=
S SA
S
189
Error on Position
(OX (o ) S
EP =
SA
i =1
(oi )
S X
A
Where:
190
Comparison
Tests on the VEC dataset of 11,100 objects
Relative error
Proximity
All methods
191
IE
Relative error
2
1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
r=1,800
r=2,200
r=2,600
r=3,000
0.2
0.4
0.6
0.8
R
ACM SAC Tutorial, March 2007
192
IE
r=1,800
r=2,200
r=2,600
r=3,000
4
3
2
1
0
0.2
0.4
0.6
0.8
R
ACM SAC Tutorial, March 2007
193
Comparison: NN Queries
Relative error
1.6
1.5
IE
1.4
k=1
k=3
k=10
k=50
1.3
1.2
1.1
1
0
0.001
0.002
0.003
0.004
EP
ACM SAC Tutorial, March 2007
194
IE
500
400
300
200
100
0
0
0.01
0.02
0.03
EP
ACM SAC Tutorial, March 2007
195
IE
k=1
k=3
k=10
k=50
0.02
0.04
0.06
0.08
0.1
EP
ACM SAC Tutorial, March 2007
196
IE
500
400
300
200
100
0
0
0.005
0.01
0.015
0.02
0.025
0.03
EP
ACM SAC Tutorial, March 2007
197
IE
PAC
500
450
400
350
300
250
200
150
100
50
0
eps=2
eps=3
eps=4
0.001
0.002
0.003
0.004
0.005
EP
ACM SAC Tutorial, March 2007
198
Conclusion
These techniques for approximate similarity
searching can be applied to generic metric spaces.
199
200
Implementation Postulates of
Distributed Indexes
201
Distributed
Similarity Search Structures
Transformation approaches:
202
GHT* Architecture
Peers
Buckets
203
p2
p5
p3
p4
p5
p6
p6
p3
p1
p2
p4
204
Inner node
Leaf node
p3
BID1
p2
p4
BID2
p5
BID3
p6
NNID2
Peer 2
205
Peer 2
Peer 3
206
p2
p5
p1
p2
p4
p5
p6
r
q
p3
p1
p6
BID1
BID2
BID3
Peer 2
p4
NNID2
207
p1
Deleted sub-trees:
replaced by NNID
of the leftmost peer
p3
p7
BID1
p8
NNID2
p2
p4
p5
p9
NNID3
p10
NNID4
p11
NNID5
p12
NNID6
p6
p13
NNID7
p14
NNID8
208
Resulting tree
p1
p3
p7
BID1
p8
p4
p2
NNID5
NNID3
NNID2
209
VPT* Structure
p2
p3
r1
r3
p1 (r1)
p2 (r2)
p3 (r3)
p1
210
1.
2.
The Idea:
Map the metric space to a vector space
given N pivots: p1, p2 , , pN, transform every o into vector
F(o)
Use CAN to
distribute the vector space zones among the nodes
navigate in the network
ACM SAC Tutorial, March 2007
211
greedy routing
in every step, move to the
neighbor closer to the target
location
4
x,y
212
L ( F ( x), F ( y )) d ( x, y )
Additional filtering
some pivots are only used for filtering data (inside the
explored nodes)
they are not used for mapping into CAN vector space
213
map q on F(q)
route the query towards F(q)
L(F(x),F(q)) r
214
215
iDist ( x ) = d ( pi , x) + i c
Generalization to metric
spaces
216
M-Chord
mchord ( x ) = h ( d ( p i , x ) + i c )
217
generalized iDistance
218
Computing infrastructure
Datasets
219
Frequency
TTL
Frequency
0.03
0.04
DNA
Frequency
0.05
0.04
0.03
0.02
0.03
0.02
0.01
0
0.02
0.01
0.01
0
2000
4000
Distance
6000
8000
50
100
Distance
150
200
10
20
30 40 50
Distance
60
70
220
80
221
222
223
224
Measures:
225
226
227
228
GHT* visits the most peers for each query thus it has the
lowest interquery improvement.
229
Concluding Summary
multiple queries
GHT*
excellent
poor
VPT*
good
satisfactory
MCAN
satisfactory
good
M-Chord
satisfactory
very good
230
https://sysrun.haifa.il.ibm.com/sapir/news.html
Partners:
231
Conference series on
similarity searching
Yair Bartal
Christian Bohm
Edgar Chavez
Paolo Ciaccia
Christos Faloutsos
Piotr Indyk
Daniel Keim
Gonzalo Navarro
Marco Patella
Hanan Samet
Enrique Vidal
Peter Yianilos
Pavel Zezula
232
Acknowledgements
Partially supported by
233