You are on page 1of 23

Revisiting M-tree Building Principles

Tom Skopal1, Jaroslav Pokorn2, Michal Krtk1, Vclav Snel1


1

DepartmentofComputerScience 2DepartmentofSoftware VBTechnicalUniversityof Engineering Ostrava CharlesUniversityinPrague CzechRepublic CzechRepublic

ADBIS 2003

Presentation Outline
Metric Indexing M-tree
basic concepts motivation for the M-tree revision fat-factor multi-way insertion slim-down algorithm

Experimental results Conclusions

Multimedia Indexing
Reasons for indexing of multimedia databases:
Implementation of the mechanism how to query Fast Retrieval

Vectormodel:
Multimediadocument Featurevector
Oi=(251,250,251,251,249,...)

Multimedia Indexing, cont.


Every indexing model must follow a retrieval semantics. The multimedia indexing model must support the similarity queries. Two types similarity queries: range queries
return documents similar more than a given threshold

k nearest neighbour queries


return the first k most similar documents

Metric Indexing
Feature vectors are indexed according to distances between each other. As a dissimilarity measure, a distance function d(Oi,Oj) is specified such that the metric axioms are satisfied:
d(Oi,Oi) = 0 reflexivity d(Oi,Oj) > 0 positivity d(Oi,Oj) = d(Oj,Oi) symmetry d(Oi,Ok) + d(Ok,Oj) d(Oi,Oj) triangular inequality

Metric structures:
Main memory structures: metric tree, vp-tree, mvp-tree Persistent structures:M-tree, Slim-tree (modification of M-tree)

M-tree at a glance
indexing objects of a general metric space (not only vector spaces) up to this time the only persistent and balanced metric tree doesnt directly use dimensions, just the distances between objects (possible vector coordinates are handled by a metric defined by user) The correct M-tree hierarchy is guaranteed due to the triangular inequality axiom of d. The hierarchy consists of nested metric regions. better resists to the curse of dimensionality (it depends on the metric) the hierarchy of nodes allows to natively implement the similarity queries

Structure of the M-tree


The M-tree nodes contain items of two types:
ground objects in leafs, representing the data objects routing objects in inner nodes, representing the metric regions

O O O
j l

O O
m

r o 1u r o 1 u( O i t r o 1 u( O pt

j t)

( O

r o )

( 0u

Opt

r n g

p d)

( O g r n k d) ( O g r n l )d ( r n md ) (g O r n i )d ( O g r n j )d ( O

Similarity queries in the M-tree


ArangequeryisspecifiedbyaqueryobjectOqandaqueryradius rq.
AkNNqueryisbasedonamodifiedrangequery(usingdynamicradius)andapriorityqueue.

Duringtherangequeryevaluation,theMtreeisLIFOpassedand onlytherelevant(i.e.intersecting)metricregions(theirnodes r o 0 u( Opt resp.)arefurtherprocessed. O i O k r o u t) ( O


O O
j l

p q

r o 1 u( O i t r o 1 u( O pt

) )

r n g

p d)

( O g r n k d) ( O g r n l )d ( r n md ) (g O r n i )d ( O g r n j )d ( O

M-tree, revision motivation


Internet-based applications: huge public multimedia databases (web pages, digital libraries, etc.) millions of users Which means: to focus on higher retrieval efficiency
thousands of users query at a moment

the building costs can increase


the index updates are much less frequent than querying

M-tree, fat-factor
TheretrievalefficiencyofanMtreeisextremelyaffectedbytheamountofoverlapamongthe metricregionsonalevelofMtree.Thus,thereisneedtominimizethevolumeofmetric regions. Sincevolumedoesntexistsingeneralmetricspaces,theamountofoverlapintheMtreecanbe measuredbythefatfactor,introducedforSlimtrees. Thefatfactorisaproportionofthediskaccesesneededforprocessingpointqueriesforallthe groundobjects.Thefatfactorisininterval<0,1>.(0meansthebest,1theworst)

O O O
j l

O O
m

r o 1u r o 1 u( O i t r o 1 u( O pt

j t)

( O

r o )

( 0u

Opt

r n g

pd)

( O g r n k d) ( O g r n l )d ( O r n md ) (g O r n i )d ( O g r n j )d ( O

M-tree, single-way insertion


Duringanobjectinsertion,onlysinglesubtreeisfurtherprocessed onacurrentlevelofMtree.Aheuristiccriterion:anodeischosen, thatspatiallycontainstheinsertedobjectand/orwhoserouting objectisthenearest O v O(log(n))complexity O
O
u w

z r o 1 u t) i r o 1 u Oi t ( r o 1 u Opt ( )

r o 0 (u Opt r o 0 u Owt ( (j O )

) ) r o 1 u Owt ( r o 1 (u Out ) )

O O O
j

Onew
O O
p

k g r n pd) ( O g r n kd) ( O g r n md ) ( O r n i d ( gO r n j d ( Og r n wd) ( g O r n ud) ( O g ) ) ( g r n l d ( g O r n zd) g O r n vd) )

M-tree, multi-way insertion


Duringanobjectinsertion,apointqueryfortheinsertedobjectis executedandroutingobjectsofallrelevantnonfullleafsare checked.Thenearestoneischosen.Ifsuchleafdoesntexist,the singlewayinsertionisperformed. O v O(n)complexity O
O
u w

z r o 1 u t) i r o 1 u Oi t ( r o 1 u Opt ( )

r o 0 (u Opt r o 0 u Owt ( (j O )

) ) r o 1 u Owt ( r o 1 (u Out ) )

O O O
j

Onew
O O
p

k g r n pd) ( O g r n kd) ( O g r n md ) ( O r n id ( gO r n j d ( Og r n wd) ( g O r n ud) ( O g ) ) ( g r n l d ( g O r n zd) g O r n vd) )

M-tree, slim-down algorithm


ApostprocessingmethodinspiredbySlimtrees,reducingthefat factorofanMtree. Inprinciple,itisaredistributionofgroundobjects,aswellasrouting objectswithinanexistingMtree.Theredistributionislevelbased,i.e. foreachobjectthebestnodeonthesamelevelistriedtofind(usingan algorithmsimilartopointquery).Ifsuchnodeisfoundtheobjectis movedandtheoriginalnodecanmeslimmed.Theslimmingstarts fortheleaflevelandcontinuesupwardsforthehigherlevels.
Advantages: asignificantreductionofthefatfactoraswellasoftheretrievalcosts astablealgorithm(canbewheneverinterruptedandresumedor

restarted)

doesntdirectlyincreasetheinsertioncostse.g.canrunintheidletime Disadvantages:

Slim-down algorithm, example


LetshaveacorrectbutpoorlybuiltMtree.Theregionsare highlyoverlappingandthefatfactorishigh. Twogroundobjects canbemovedto moreappropriate leafs.

Slimmingtheleaflevel(Level 0)

Slim-down algorithm, example


Theradiiofmetricregionsrepresentingtheleafsaswellasthe higherlevelednodeshavereduced.Alsothefatfactorisnow lower. Twolevel1routing objectscanbe movedtomore appropriatelevel1 nodes.

SlimmingLevel1

Slim-down algorithm, example


Again,theradiiofmetricregionsaswellasthefatfactorhave reduced. Therootlevel cannotbeslimmed becausenoparent nodeexists.

TheslimmedMtree

Experimental results
synthetic datasets of clustered tuples used metric: L2 (Euclidean) dimensionality: 2 50 number of tuples: 20,000 1,000,000 index sizes: 1 400 MB node capacity: 20 M-tree height: 3 5

Experiments were performed on an Intel Pentium4, 2.53GHz, 512 MB DDR333, under Windows XP pro.

Building Costs

Fat-factor, node utilization

Range queries costs

100-NN queries costs

Conclusions
New M-tree building techniques were proposed, improving the M-tree retrieval efficiency. These techniques are beneficial especially for modeling query-intensive MDBMS scenarios. The multi-way insertion improves the M-tree retrieval efficiency by up to 50%. The slim-down algorithm improves the M-tree retrieval efficiency by up to 300%.

References
T.Skopal, J.Pokorny, M.Kratky, V.Snasel. Revisiting M-tree Building Principles, ADBIS 2003, Dresden P. Ciaccia, M. Patella, P. Zezula. M-tree: An Efficient Access Method for Similarity Search in Metric Spaces, VLDB 1997, Athens C. Traina Jr., A. Traina, B. Seeger, C. Faloutsos. Slim-Trees: High performance metric trees minimizing overlap between nodes. LNCS 1777, 2000.

You might also like