Optimizing Keyword Queries in XML Tree Structure

Running Head: Optimizing Keyword Queries in XML tree structures
Optimizing Keyword Queries in XML Tree Structure

Name:
Instructor:
Institution:
Date
ABSTRACT
XML which stands for Extensible markup language remains to be the most
popular and frequently used format for representing and exchanging data in
the World Wide Web. Its application is wide based on the various different
data types and applications that exist. The data may take different forms
which may include unstructured heterogeneous, semi structured and
Optimizing Keyword Queries in XML Tree Structures
structured data types. XML has been a progressive language increasing its
functionalities with various inventions and researches to the level of
development of data streaming applications. These types of inventions have
received numerous significance and attention by many experienced users of
the web. These developments have led to the centralization of efficient
processing and querying of XML streams.
This study focuses on retrieving queries through a combination of structural
constraints which essentially use key words as a search tool to represent an
essential executable function in the XML systems of data management.
Various expectations are forecasted on and they are expected to yield best
case answers in an effective an efficient manner like the traditional key
search while factoring in the various additional constraints that may exist.
The definition of studying the new problem of top-k keyword query and
search over XML probabilistic data with the aim of retrieving k SLCA finding
where k has the highest existence capabilities. Finally the study is going to
preview various other forms of keyword searches using different forms and
make a comparison through the analysis of the algorithms that have been
used.
TABLE OF CONTENTS
1.0 INTRODUCTION.....................................................................................5
Problem Definition........................................................................................7
Proposal........................................................................................................7
2.0 OVERVIEW OF RELATED WORKS........................................................7
Cost-Based Query Optimization in RDBMSs..................................................8
Query Optimization Frameworks...................................................................9
XML Query Optimization...............................................................................9
Keyword Query in Ordinary XML Documents..............................................10
Probabilistic XML.........................................................................................11
XQuery Streaming Optimization.................................................................13
Querying XML streams.............................................................................13
XML temporal model................................................................................13
Time Intervals and Model............................................................................13
Mode Relationship Evaluation.....................................................................13
Probability...................................................................................................14
Join Probability............................................................................................15
Node Relationship.......................................................................................16
Dominance Lowest Common Ancestor (DLCA)...........................................18
Dominance relationship...........................................................................18
Dominance...............................................................................................20
3.0 MEASUREMENT OF THE RELATIONSHIP BETWEEN NODES IN A
DATA TREE...................................................................................................20
Mutual Information Concepts......................................................................21
Mutual information and entropy..............................................................21
Mutual information...................................................................................21
4.0 ANSWERS RETRIEVED FROM TOP-K................................................23
Dominating Score.......................................................................................23
Dominated Score........................................................................................23
Dominance Score........................................................................................24
5.0 ALGORITHMS USED TO RETRIEVE TOP-K RESULTS......................25
Nave Algorithm for Selection of Top-K Answers.........................................25
Top-K Dominated Algorithm (TKDD)............................................................27
Top-K Dominating Algorithm (TKDG)...........................................................29
6.0 EXPERIMENTAL EVALUATION...........................................................31
Experimental Setup....................................................................................31
Query Sets..................................................................................................31
Search Quality............................................................................................31
Efficiency and Scalability of Top-K Algorithms.............................................34
7.0 CONCLUSIONS....................................................................................35
REFERENCES................................................................................................35
LIST OF FIGURES
Figure 1: Probabilistic XML document [18].....................................................10

Figure 2: An example of an XML data tree structure, T [19].........................14
Figure 3: Data Tree T2 [19]..............................................................................18
Figure 4: List of candidates sorted in descending order using their M() values
[24]................................................................................................................28
Figure 5: Ranking efficiency of TKDD, TKDG and TKD algorithms [9]............34
LIST OF TABLES
Table 1: The join probability of the two specific nodes at context node:/paper
.......................................................................................................................15
Table 2: The join probability of the two specific nodes at context node:
/proceeding....................................................................................................15
Table 3: Two dimension set candidate data...................................................26
Table 4: Search spaces LA and LB derived from a list l of candidates............27
Table 5: Candidates list sorted in descending order of f() values..................27
Table 6: Dominance checks count used in calculating dominated candidate
scores............................................................................................................27
Table 7: Precision and recall of queries on mondial data...............................33
Table 8: Precision and recall of queries on auction data................................33
Table 9: Precision and recall of queries on dblp data.....................................33
Table 10: Comparisons on ranking effectiveness of the algorithms...............33
1.0INTRODUCTION
XML (Extensible Mark-up Language) has over the years evolved to become a
de facto standard used for the exchange and representation of data which
results into the distribution of proliferated XML documents which are spread
all over the internet. In the past, there are various query languages that were
used to retrieve xml documents and data. These included languages such as
XQuery and XPath. And twig pattern queries. These languages made it
essential for the users of the systems to be versed with the specific query
languages and the relevant data schemas so that they may be able to
execute the XZML queries efficiently [5]. This therefore limited the type of
users since the advanced users since the query languages and the data
schemas seemed to be complex concepts to understand. The data search
through XQuery/XPath languages therefore was a very big limiting factor.
The use of major keywords to search for documents has been widely
accepted as a very convenient way in retrieving resources from various
remote servers that hold that specific type of data on the internet. Majority
of the search engines such as Google, Bing, Yahoo and many more have
adopted the use of these technologies so as to efficiently facilitate the
process of data mining and data warehousing. The adaptation and use of
keywords for querying various databases has attracted various researches to
be conducted by the research community from the affected fields of
database and information retrieval (IR) [5][2][3]. This is a very efficient way
of facilitating the retrieval of documents because it does not involve the
learning of any concepts. This process is an advancement of the traditional
search algorithms that were specifically involving and required the mastering
of the particular IP (Internet Protocol) addresses of various documents or
information content and typing them in the URL bar. His later advanced and
the IP addresses were able to be attached to various web links. It is from this
that the search engines were developed with a more interactive and
responsive algorithm that was able to handle a lot of bits and pieces of
information including data mining [6]. A variety of approaches have been

accessed and previewed to find alternatives to the keyword queries as
opposed to the XML data. The basic approaches that currently exist use
lowest common ancestor (LCA) type of semantics as opposed to the common
graph theory for identification of the hit list given a certain keyword query.
This particular approach generates results composed of all candidates, also
known as sub trees, containing an instance of the queried keywords [5]. The
LCA returned values can be numerous yet the user may just be interested in
a portion or bit of the whole hit list. It therefore remains and unsolved issue
to be able to identify the exact dataset that is required by the user of the
system. The ideal situation and the best case scenario would be for the
system to be able to generate an exact piece that is required by the user as
opposed to providing a whole set of hits which also gives the user an extra
job to filter the content until they obtain an exact piece[1][2][3][4][5][6].
Various researches and studies have not been successful in the
implementation of exact retrieval of the search queries hence it remains an
unresolved and challenging issue.
Many proposals have been drafted on the basis of improving the baseline
approach precision. The application of heuristic-based rules in enhancing the
Tree Search for the best case scenario at the shortest time possible has been
the main foundation of majority of the proposals. These approach though it is
intuitive, it portrays characteristics of being ad-hoc in the sense that data
filtration takes place in majority of the operations. Candidate data sets that
meet the specified criterion are separated from those that do not meet what
the user seeks to find out. The results of these process of filtration is what is
shown as a result of the algorithmic computation [2][3][4][5]. However,
research and studies show that I as much as it is assumed that the results of
these algorithms yield best case results, they do not only miss out on the
false negatives (relevant results) but they also return false positives
(irrelevant results).
The results that received from XML queries can however be boosted and
made more reliable if the following considerations are addressed and made:
If the candidate hits can be measured in terms of relevance. Secondly, if
there can be a mechanism that can be able to ultra-filter from the relevant
candidate hits to produce a more specific list with finer details that closely
match the users search[5][3]. There third proposition is about the
positioning of the candidate hits in a descending order from the one that is
most probable to the least probable case among the best case scenario hits.
In this particular paper, we are going to focus on the optimization of keyword
queries in XML tree structures so as to yield a very efficient result.
Smallest (05767875) Lowest Common Ancestor (SLCA) is a model keyword

search outcome that is a widely accepted semantic on a deterministic XML
tree named T in our scenario. A specific node named v is therefore
considered SLCA if:
Node v is the root of the sub tree at Tsub(v) and it consists of all the keywords
There is no existence of a descendant node v` of the root node v in such a
way that Tsub(v`) consists of all the probable keywords. For instance, consider
a scenario of {k1, k2} being a keyword query on a certain document, pdocument as shown in the figure below. The particular SLCAs in this case are
{MUX3, IND3}. When the algorithm attempts to perform a keyword search
on the specific p-document, the following challenges are encountered as
being a cumulative example of a probabilistic XML document [5][2].
Problem Definition
Current keyword searches in XML can be divided into tree and graph
supported searches which are largely predicated on structural document
features. However, these approaches on structure do not comprehensively
utilize the hidden semantics within the XML documents leading to issues in
the processing of specific keyword query classes. The growing reputation of
XML has intensified the necessitation of an accessible and precise XML query
interface that is predicated on natural language and search procedures that
exploit XML structures to simplify queries by ordinary users within XML
databases. Conventional methodologies however process queries rely based
on ad hoc and intuitive heuristics which frequently regain false positive and
unranked answers.
Proposal
This paper systematically explores XML structure-based answers and user
expectations in order to identify the significance of XML keyword search
semantics. This paper further posits a semantics-based methodology to
develop XML keyword queries principally through data-centric coherency
ranking which is kernelled in the design of the domain and database which is
predicated on data dependence and mutual information models.
Consequently, keyword query results occur within a under schema

reorganization structures which process, present rank and query algorithms
through coherency ranking to develop answers. Actual XML data indicates
that coherency ranking is the methodology with the highest precision, recall
and ranking as compared with approaches.
2.0 OVERVIEW OF RELATED WORKS

In this particular section, a brief assessment of relevant publications on
optimization frameworks, XML query optimization and relational cost-based
query optimization is going to be previewed.
Cost-Based Query Optimization in RDBMSs
A researcher proposed the first cost-based ever query optimizer, which
formed part of System R and therefore was the prototype of the relational
database system. The optimizer had various capabilities which included the
optimization of linear and simple select, project, and join (SPJ) queries. A
simple cost model introduced by the authors was based on CPU costs and
weighted IO which used statistics on data page numbers consumed by
relations that bound the cost model concrete values [9]. The dynamic
programming algorithm provides a selected optimal operator fitting for
specific access paths. After that, an optimal join order is verified based on an
assumption of local optimality. In order to prune early the search space, not
all possible enumerations are considered. In their place, focus is laid on
interesting join orders, for instance orders which can do without additional
introductions of products of the Cartesian. Graefe and DeWitt showcased the

EXODUS Optimizer Generator and the purpose of this system was not
confined to a specific data model but it supported the algebraic
transformations specification as rules [5][17] [25]. Incorporated with a data
model that is concrete, the rules serve as input for the generator optimizer,
creating a tailor-made query optimizer.
Figure 1(a) showing the p-document T. Tag names are used to show ordinary
nodes, for instance C1, C2, and B1and B2. As for the distributional nodes,
MUX is shown as rectangular rounded cornered boxes while IND is depicted
as circles. Taking into consideration the IND2 node, there exist two children
nodes B2 and C1 with respective existence of probabilities 0.6 and 0.5 [5].
Therefore, for neither C1 nor B2, the absence probability being seen is
(1-0.6)*(1-0.5) =0.2.
Considering the MUX2 node that consists of three children, IND3, E2 and
D1node, their probabilities of existence respectively are 0.5, 0.3 and
0.1.Therefore, the probability for non-existence is
1 - 0.5 - 0.3 - 0.1 = 0.1.
Provided a p-document tree named T, there is a possibility of generating all
possible deterministic documents as sown below; basically traversing T in a
top-down manner, two situations arise that require to be dealt with
independently if:
(1) It occurs that it is an IND node consisting of m child nodes, 2m copies of T
are generated, and the IND node deleted; m child nodes are replaced with
one distinct subset as a copy which is a representation of them and the
ordinary parent node which is connected to each child node in the IND node
subset. The probability for this copy to occur for each copy is the product of
all probabilities that exist of the respective child nodes in the particular
subset and the absence probabilities for instance the existence probability
deducted from 1, of the child nodes which are not existent in the subset [5].
(2) For a MUX node consisting of m child nodes, m + 1 copies of T can be
generated, and the MUX node deleted, replacing the m child nodes with no
child or one distinct child node for a copy. An established connection from
the child node of MUX to the ordinary parent node is made. The existence
probability for every copy is the occurrence probability of the distinct child
node in the subset or the absence probability denoting no child node
appearance. For every T generated copy, the research adopts traversing
using the top-down approach until the deletion of all distributional nodes is
confirmed [5].
10
This study explores the retrieval of queries through a combination of

structural constraints which fundamentally use key words as a search tool to
represent an essential executable function in the XML systems of data
management. Through exploration of evidence-based research and the
extrapolation of related works, it is expected that the study will yield best
case answers in an effective and efficient manner similar to the conventional
key search while factoring in the constraints that may exist.
Query Optimization Frameworks.
Lanzelotte and Valduriez contributed a framework which is extensible for
query optimization which incorporates the concepts of modeling the
independent search space of a particular type of search strategy [19][21].
Using this approach, highly-extensible plans can be built by developers on
enumeration frameworks. Kabra and DeWitt made a proposition on OPT++
as an OO approach for very extensive query optimizing [16]. A combination
of extensible search components together with physical algebra and
extensible logical representation, the work of Lanzelotte and Valduriez is
lifted to object-oriented level [9].
XML Query Optimization
On XML queries optimization targets the strictly limited and isolated
problems of path expressions optimization using navigational access paths
and is deprived of HTJ and SJ operators support. Wu et al made a proposition
of a dynamic programming five novel algorithms for join reordering
structuring [10]. Their orthogonal approach is unique, for instance, it can be
used to select the most efficient join order in SJ-only scenarios. Compared to
this research, they use a simple cost model for heading the process of joinreordering and no consideration is made on the combination of HTJ and SJ
operators together with different access operators that are index-based [9].
The research introduced learning techniques that are statistical for cost
modeling of XML. In contrast to this research which is going to adopt a static
cost model approach, previous research demonstrate various ways to
describe the navigational access cost operator. Unfortunately, they fail to
focus on oriented HTJ and SJ operators. The researchers sketched the hybrid
cost based optimizer development for XQuery and SQL as part of the
DB2XML. Compared to the approach that this particular type of research
takes, an evaluation of every expression path using HTJ and a fine-granular
level cannot be decided upon whether to adopt SJ operators or neglect[3]
[11] [14][26]. The research describes an XPath framework of optimization
which performs deterministic optimization in algebraic form. There is no
11
framework for cost-based optimization and a full-fledged DBMS does not

seem to provide the solution. It described a cost-based XPath optimization
first approach. Contrary to this particular proposal, which is in support of the
optimization of XQueries, it has not considered HTJ and access paths that are
advanced CAS indexes or indexes [9].
Figure 1: Probabilistic XML document [18]
Keyword Query in Ordinary XML Documents

XML databases are involved in the query of various major keyword searches.
Given an XML data source keyword query, most of related work took lowest
common ancestors smallest LCA of the nodes that matched as the results to
be returned. Schema-Free XQuery and XRANK are able to compute SLCAs
and develop stack-based algorithms [12][28]. The Indexed Lookup Eager
algorithm is introduced when the appearance of the keywords with
frequencies which are different significantly. The Scan Eager algorithm will
take over the process once keywords register similar frequencies. Majority of
the authors and researchers of various previous works focused on inferring
the definition of returned results and discussed the differentiations of result.
The researchers provided more meaningful conclusions and utilized the
underlying XML statistics of the data for identification of the return node
types [9]. The researchers also proposed that a number of cleaning keyword
queries algorithms for optimality could be developed. This therefore resulted
into the designing of MS approach for computation of SLCAs for queries of
keywords in multiple manners. They took the Valuable LCA as results by
12
intentionally avoiding the false negative and false positive SLCA and LCA
[12][25]. The various researchers also proposed Indexed Stack, which was
an efficient algorithm for finding answers based on semantics of Exclusive
LCA. In addition, there exist other related works which process keyword
search through the integration of keywords into specific structured queries.
XMLQL, which is a new query language, has the structure of the keywords
and query separated. The research also introduced a method to embed
various keywords into XQuery for processing of the specific keyword search
[9].
Probabilistic XML
The probabilistic XML topic has been a recently studied subject in which
majority of the proposed models have been incorporated together with
evaluations of structured query. Nierman et al first introduced the concept of
ProTDB, with the existent probabilistic types MUX - mutually exclusive and
IND - independent. The researchers modeled the probabilistic XML in the
form of acyclic graphs, which support distributions that are arbitrary over
children sets. The research adopted a probabilistic tree approach for the
purpose of data integration where its possibility and probability nodes are
similar to IND and MUX respectively [9] [21][29].
A p-document, which is a probabilistic document written in XML specifies a
probability distribution across space of deterministic documents written in
XML. Each deterministic document that belongs to this space is referred to as
a possible word. A probabilistic document referenced as a tree that has been
labeled has distributional and ordinary nodes [11][14]. Ordinary nodes are
basically regular and normal XML nodes and their appearance may be seen
in deterministic documents, whereas distributional nodes are only used in
the definition of probabilistic process that involves the generating of
deterministic documents and their occurrence is not visible in those
documents[5][12] [24]. In the adaptation of PrXML {ind, mux} as part of the
XML model which is probabilistic, two distributional nodes types appear in a
p-document, which are MUX and IND[10][13].
Considering an example 1:
Consider Figure 1(a) showing the p-document T. Tag names are used to show
ordinary nodes, for instance C1, C2, and B1and B2. As for the distributional
nodes, MUX is shown as rectangular rounded cornered boxes while IND is
depicted as circles. Taking into consideration the IND2 node, there exist two
children nodes B2 and C1 with respective existence of probabilities 0.6 and
13
0.5 [5]. Therefore, for neither C1 nor B2, the absence probability being seen
is
(1-0.6)*(1-0.5) =0.2.
Considering the MUX2 node that consists of three children, IND3, E2 and
D1node, their probabilities of existence respectively are 0.5, 0.3 and
0.1.Therefore, the probability for non-existence is
1 - 0.5 - 0.3 - 0.1 = 0.1.
Provided a p-document tree named T, there is a possibility of generating all
possible deterministic documents as sown below; basically traversing T in a
top-down manner, two situations arise that require to be dealt with
independently if:
(1) It occurs that it is an IND node consisting of m child nodes, 2m copies of T
are generated, and the IND node deleted; m child nodes are replaced with
one distinct subset as a copy which is a representation of them and the
ordinary parent node which is connected to each child node in the IND node
subset. The probability for this copy to occur for each copy is the product of
all probabilities that exist of the respective child nodes in the particular
subset and the absence probabilities for instance the existence probability
deducted from 1, of the child nodes which are not existent in the subset [5].
(2) For a MUX node consisting of m child nodes, m + 1 copies of T can be
generated, and the MUX node deleted, replacing the m child nodes with no
child or one distinct child node for a copy. An established connection from
the child node of MUX to the ordinary parent node is made. The existence
probability for every copy is the occurrence probability of the distinct child
node in the subset or the absence probability denoting no child node
appearance. For every T generated copy, the research adopts traversing
using the top-down approach until the deletion of all distributional nodes is
confirmed [5].
Various researchers have proposed the adoption of a fuzzy trees model
where nodes are specifically associated with probabilistic event variables
conjunctions. A full complexity query analysis update on the fuzzy trees in
the research is also referenced. They also proposed algorithms that solve the
constraint-satisfaction which were efficient [6][18] [24][29]. The specific
sampling problem and query evaluation under constraints set can be well
defined to yield efficient query results that are expected. Other publications
summarized and extended the previously proposed probabilistic XML models,
tractability of queries and the expressiveness on different models were
discussed with the consideration of MUX and IND[5][13] [15][18] [27].
Various studies on the evaluation problem of twig queries considered over
14
probabilistic XML that may generate partial and incomplete answers with
particular respect to user probability threshold. The researchers also
addressed and proposed the ranking top-k probabilities problem of answers
of a twig query. In summary, the work that has been cited focused on
discussions of various probabilistic XML data models on a structured XML
query, for instance a twig query[1][7] [16][23]. Our research however is
going to be different in the sense that the keyword search problem in
probabilistic XML data is going to be critically previewed and analyzed [9].
XQuery Streaming Optimization
Querying XML streams
Several streaming algorithms exist that particularly focus on the querying
problem and the filtration procedure. Many of these algorithms center their
operations on tree-pattern queries (TPQs). TPQs efficiently correspond to
XPath queries which involve mainly descendant and child axes [14][26]. TPQ
streaming algorithms can be extended to facilitate the process obtaining
XPath queries which come along with ordered axes that involve preceding,
preceding-sibling, following, and following-sibling) [10].Processing techniques
are therefore introduced on ordered axes Streaming algorithms broadly fall in
three categories: The array-based approach, automaton-based approach and
the stack-bas.
XML temporal model
Previous studies conducted on time-based XML model have identified several
disadvantages and benefits. The bitemporal approach is inclusive of both
valid time and transaction time in timestamp attributes [22][26]. Normative
texts will always comprise of four time intervals. Normative texts consisting
of temporal values in an XML database represent new attributes of intervals
for instance efficacy time and publication time. This particular approach of
XML tree partitioning guarantees the distribution of data into partitions of
equal size making considerations of both the query processing load and data
storage cost [10].
Time Intervals and Model
The intervals are publication time efficacy time transaction time and validity
time. Transaction time refers to the time a transaction is reflected in the
database as a representation of an important factor for all transactions that
occur in time-referenced databases. Valid time refers to the interval that
indicates the time when the data becomes valid for general use or it may be
invalid and unusable [8][19][20]. Efficacy time is when data is used under
15
various conditions or in specialized cases only. Publication time represents

the alert time as to when or during the publication of data. Reason nodes
particularly hold data that is sensitive and may find utility in decision support
Systems [10].
Mode Relationship Evaluation
This research deviates from the other researches which focus on the
evaluation of the relationship between multiple nodes with the
implementation of heuristics-based intuitive rule. It will focus on the
relationship between multiple nodes in a data tree structure which is
measured by the adaptation of mutual information concept which derives its
application various data mining processes[5][6][11]. This operates on the
correlation of various database relation attributes. Being common for XML
data tree structures to consist of various nodes having the same labels but
occurring in different contexts, prefix labeled nodes are used to depict the
types of nodes. A prefix label path refs to a sequence of names of elements
appearing in along the path from the root node to the specific node in
question. The node types are used to identify the specific nodes that are
found in the data tree [16][22].
Figure 2: An example of an XML data tree structure, T [19]
For instance at node 4, the prefix labeled path in the data tree is defined by
dblp/proceeding/paper/author. Many occurrences can exist in a specific prefix
labeled path in the XML data tree structure, and these occurrences are
referred to as node instances [2][7][21][25]. It is therefore definite that all
the instances of a specific node will consist of the same prefix label path.
Every instance consists of a unique value which constitutes to the specific
set of key words contained directly in that particular instance. For instance,
16
using the tree structure in Figure 2, prefix label path:

dblp/proceeding/paper/author has instances 28, 20, 15, 11 and 4 which
consist of the values Richards, Wang, Zhang, Liu and Jinli respectively. A set
of distinct values of all the instances of the prefix labeled path u consists of a
value of the domain which is denoted as dom (u).It therefore becomes
dom(dblp/proceeding/paper/author) =
{Richard,Wang,Zhang,Liu,Jinli}
In this particular instance, every prefix label path has taken a value from
their respective value domains with specific probability [4][17].
Probability
The probability of obtaining value vu from the node u, originating from
dom(u) in then given data tree structure can be defined as the instances
consisting of the value vu over the overall number of u instances contained in
the specific data tree structure[11]. Taking for instance:
dblp/proceeding/paper/author may consider the value Richard with the
following speculative probability p(dblp/proceeding/paper/author=Richard)
= 1/5.
Join Probability
The hierarchical structure of XML data makes any two nodes to be joined in
different node levels leading to different joint distributions in multiple nodes
[11]. In relation to the data tree structure that is shown in Figure 1(b),
multiple nodes dblp/proceeding/paper/title and dblp/proceeding/paper/author
are joined at the node dblp/proceeding/paper at the instances 5 and 4. They
can also be joined through the node dblp/proceeding at the following
instances 12 and 4. This particular research study is going to refer to a node
denoted as c with multiple nodes v and u. which take the form of context
nodes v and u. p(vu, vv|c) is defined as join probability of the particular node
type u which provides the value vu and the node v provides the value vv
which jointly consecrate at node c. For instance:
p(dblp/proceeding/paper/author=Richard,
dblp/proceeding/paper/title=XML update | dblp/proceeding/paper) =1/5 .
This result query can also be derived from the following query
/; p(dblp/proceeding/paper/author=Jinli, dblp/proceeding/
paper/title=multimedia search | dblp/proceeding) = 1/8.
Table 1 and 2 therefore show the nodes dblp/ proceeding/paper/title and
dblp/proceeding/paper/author joining at the node (context) dblp/proceeding
and dblp/proceeding/paper found in the XML data tree structure T.
17
TABLE 1: THE JOIN PROBABILITY OF THE TWO SPECIFIC NODES (dblp/

proceeding/paper/title) AND (dblp/proceeding/paper/author) AT THE
RESPECTIVE CONTEXT NODE (dblp/proceeding/paper) [19]
TABLE 2: THE JOIN PROBABILITY OF THE TWO SPECIFIC NODES (dblp/

proceeding/paper/title) AND (dblp/proceeding/paper/author) AT THE
RESPECTIVE CONTEXT NODE (dblp/proceeding) [19]
In this particular type of scenario, p (vv) and p (vu) are the respective
probabilities of (v = vv) and (u = vu).
p(vu, vv|c) at the context node c is the join probability of v = vv and u = vu
A stronger relationship is denoted by the higher value of the mutual
information that is revealed between two specific types of nodes [16][20].
For instance, the mutual information of nodes dblp/proceeding/paper/author,
dblp/proceeding/paper/title at context node dblp/proceeding is given by
I (dblp/proceeding/paper/author; dblp/proceeding/paper/title|
dblp/proceeding/paper)
As shown in the Table 2, the mutual information of nodes

dblp/proceeding/paper/author, dblp/proceeding/paper/title at context node
dblp/proceeding is given by
I(dblp/proceeding/paper/author; dblp/proceeding/paper/title|dblp/proceeding)
18
This therefore indicates that there is variance in the type of mutual

information between two nodes that are in different contexts. The mutual
information between the author(s) and title in the paper context is higher
than the mutual information between the author(s) and title of two papers
that are different or which may be considered in the context of a proceeding.
We can therefore conclude that MI acts as a superb measure of showing how
closely nodes are interrelated to each other. The MI values scale has no
specific unique range as depicted by property 5 [25]. The property states
that the nodes can be bound by the minimum value of their entropy. In a
proper application of this particular concept, we require a unified scale for
the sole purpose of measuring the MI along with global node sets [4][18].
Node Relationship
Considering two nodes, u and v which are joined at the context of c, the
relationship of the two nodes is defined as:
In particular case, H (u) and H (v) refer to the specific entropy of nodes u and
v respectively and the values are calculated the same way as the random
variable entropies [16].
When the value of rel (u; v|c) is high, this means that the relationship
between nodes u and v is strong at the context node c. For instance, the
entropy of nodes dblp/ proceeding/paper/title and
dblp/proceeding/paper/author can be obtained by:
H(dblp/proceeding/paper/title)
= [(1/5) log(1/5) + (1/5) log(1/5)
(1/5) log(1/5) + (1/5) log(1/5) + (1/5) log(1/5)]
= log(1/5) = log 5 = 0.70
H(dblp/proceeding/paper/author)
= [(1/5) log(1/5) + (1/5) log(1/5)
(1/5) log(1/5) + (1/5) log(1/5) + (1/5) log(1/5)]
19
= log(1/5) = log 5 = 0.70

This therefore implies that the relationship between
dblp/proceeding/paper/title and dblp/proceeding/paper/author meeting at
context node dblp/proceeding/ paper can be derived as
rel(dblp/proceeding/paper/author; dblp/proceeding/paper/tile
|dblp/proceeding) = 0.49
0.70
= 0.7
This is also similar to:
dblp/proceeding/paper/title; rel(dblp/proceeding/paper/author
|dblp/proceeding/paper) = 0.70
0.70
= 1.0
This therefore shows that the relationship that exists between any two given
nodes must fall in a specific range of [0,1] in any XML data tree given
multiple nodes u and v at any particular context node c.
0 rel(u; v|c) 1
This can further be proved when property 3 as stated previously is closely
examined. It states that
I(u; v|c) 0, which therefore implies that
Property 5 also states that I(x; y|c) H(y) and I(x; y|c) H(x). This therefore
generates:
Dominance Lowest Common Ancestor (DLCA)

In order to retrieve a particular hit or answer in a wide mass of LCA based
candidates, this research proposes the use of new semantics referred to as
Dominance LCA. We begin by the introduction of the relationship between
LCA-based candidates.
Dominance relationship
Query candidates are represented by their root nodes can be depicted as
subsets of the sub trees. For instance given a keyword search query say Q
={k1, . . . , kq}a specific candidate of the query Q called S is represented in
the form S(nlca, {n1; . . . ; nq}) in this particular result nq refers to a leaf
node that contains ki and nlca becomes the distant and the lowest common
ancestor of the series {n1; . . . ; nm. Identifiers are used to identify each
20
node in the candidate series which according to this research is encoded as a

Dewey code.
The foundations of the Dewey code are derived from the Dewey Decimal
Classification which were developed for the purpose of classification of
general knowledge[1][23]. With the implementation of the Dewey coding, a
vector is assigned to each node which is a representation of the path to the
node from the tree root. The local order of the ancestral node is represented
by the each component that is found along the path. This can evidently be
illustrated in the Figure 3:
Figure 3: Data Tree T2 [19]
The researcher selected to encode the specific node identifiers with Dewey
code since it is very useful in the representation of the hierarchical
relationships that exist between nodes of a tree that forms a very important
variable in the tree structure. The corresponding label path of the specific
node can be found from the Dewey code. For instance, considering the
sample data tree structure T2 in the Figure 3, every node s always identified
by the Dewey code. For a node identification of [0.1.0.0], the corresponding
label path of the corresponding node n1/n3/n4/n5. We therefore give a name,
ID2LP(id) which is an id which represents the Dewey code that serves as an
input and reflects the corresponding path label[13][24]. There is a vast
chance of the possibility that the key words in a particular search tree may
yield many occurrences in the specific sub tree candidate S(nlca, {n1; . . . ;
nm}). Every keyword yields a set of Li = {ni|val (ni) with the keywords ki (1
i m)}
The relationship between the various keywords that are produced in the
specific search tree is given as:
21
In this particular case scenario, ID2LP(ni) is an important function which

returns corresponding node types w9th the Dewey code ni and rel(ID2LP(ni);
ID2LP(nj)|ID2LP(lca(ni, nj))) which is normally calculated by the formulae
stipulated in formula (1). This therefore measures the correlation between
the nodes that have been tagged with the Dewey codes ni and nj at the
lowest common ancestor [18][25]. This therefore implies that the
relationship between the keywords ki and kj contained in the candidate
structures is analyzed as the maximum relationship that exists between two
nodes that contain two keywords in that specific candidate[13].
For instance, taking a query Q = {k1, k2, k3} with a specific data tree T2,
only one of the sub-tree Q candidate is present and is rooted at a place node
of n3 [0.1]. In Tree T2 and this can be represented and can also take the form
of S(0.1, {0.1.0.0; 0.1.1.0; 0.1.1.1}). The relationship that exists in the
keyword queries in the specified candidate S can be calculated as follows:
rel(k2, k3) = rel(ID2LP(0.1.1.0); ID2LP(0.1.1.1)|ID2LP(0.1.1))
= rel(n1/n3/n6/n7; n1/n3/n6/n8|n1/n3/n6)
rel(k1, k3) = rel(ID2LP(0.1.0.0); ID2LP(0.1.1.1)|ID2LP(0.1))
= rel(n1/n3/n4/n5; n1/n3/n6/n8|n1/n3)
rel(k1, k2) = rel(ID2LP(0.1.0.0); ID2LP(0.1.1.0)|ID2LP(0.1))
= rel(n1/n3/n4/n5; n1/n3/n6/n7|n1/n3)
Provided the keyword query Q = {k1, . . . , kq} the relationship of each pair
calculated is stored in the vector Ds of the query keywords in sub-tree
candidate S. The keyword relationship vector in this particular research is
defined by:
DS = [rel(ki, kj)|ki, kj Q (i < j)]
Having a total of C2q combinations of two-keywords derived from a stable set
of say q keywords {k1, . . . , kq}the vector Ds therefore contains C2q number
of elements. This is normally denoted as |DS| = C2q . For instance, the vector
of the keyword relationship that corresponds to the candidate S(0.1,
{0.1.0.0, 0.1.1.0, 0.1.1.1}) query Q = {k1, k2, k3} consists:
of C23 =3!= 3
Respective elements
2!(31)!
DS = [rel(k1, k2), rel(k1, k3), rel(k2, k3)].
Letting Ds and Ds become the two specific types of keywords in a specific
relationship, of the candidates S and S, the dominance relationship that
exists between the candidates S and S id can therefore be defined [12].
22
Dominance
Letting S and S to become the two candidates of the XML search query Q
over a specific named and given database T, S dominates S. This is
represented as S > S and this condition will only hold if the following are
met:
j(1 j d)DS[ j] < DS_ [ j] and i(1 i d)DS[i] DS_ [i]
In this scenario d refers to the keyword length relationship vector of S and
S which is (d = |DS| = |DS_| =C2q). Ds[i] is the element in the ith vector Ds
Candidate S dominates S in the relationship [4][11][24].
3.0
MEASUREMENT OF THE RELATIONSHIP BETWEEN NODES IN A

DATA TREE
This particular section reviews the mutual information (MI) concept alongside
with various other concepts that are related. The in depth detail of this
particular concept will be discussed with emphasis on the concept adaptation
in the measurement of the meaningful relationship that exists between
various nodes that exist in an XML data tree.
23
Mutual Information Concepts

Mutual information and entropy
These are very central and fundamental concepts that do exist in the field of
the information theory. Entropy therefore refers to the measure of
uncertainty of a particular random variable. MI quantifies the existing mutual
dependence of two particular random variables [3][8][19][27].
Entropy: Taking a discrete random variable x which takes the value vx
extracted from the set dom (x) which is generalized and governed by a
probability distribution function of value p (vx).The definition of entropy of x
is defined as follows:
Conditional Entropy of a particular random variable say y provided a second

variable x, which is referred to as entropy y conditional x, which usually takes
the general form of H (y|x) has the following definition:
In this particular type of equation, p (vy,vx) refers to joint probability of (y=vy)

and (x=vx); whereas p(vx,vy) given (x=vx) is the conditional probability of the
equation (y=vy) [10][18].
Mutual information
In reference to two random variables it can be referred to as a quantity that
measures mutual independence between two variables [6]. In a given case
scenario, discrete variables x and y which are random, the definition of their
mutual information can be defined as:
In this particular scenario, p (vx,vy) refers to the joint probability of the

defined variables (x = vx) and (y= vy). In this particular scenario, p(vx) and
p(vy) are the probabilities of (x=vx),(y=vy) respectively.
There are various properties that characterize Mutual Information, and some
of the existing properties are detailed as follows:
Property 1:
I (x; y) = H (x) H (x|y) = H (y) H (y|x)
24
This deals with the interpretation of Mutual Information. It indicates that the
information that has been provided by y concerning x is the reduction or
decrease in the uncertainty of x provided the knowledge possessed by y.
Similarly, this occurs for all the bits of information availed by x concerning
random variable y. The value of the mutual information is directly
proportional to the information that is revealed by both the variables x and y
in this particular property [16][20][24][29].
Property 2:
I (x; y) = I (y; x)
It puts forward that mutual information takes a symmetric form, meaning
that information availed by x concerning y is the very same type of
information y conveys about x [16][20][24].
Property 3:
I(x; y) 0
The lower bound of the mutual information is given in this particular
scenario. Given I(x; y) = 0, we get the result p(vx, vy) = p (vx) p (vy) for the
possible values of x and y. These means that the variables x and y are
independent, therefore obtaining the value of x does not necessarily provide
clues of the probable or exact value of the variable y. This therefore puts
their mutual information at zero [5][16][20][29].
Property 4
I(x; x) = H(x).
This property puts forward that mutual information of variable x is by itself
the entropy of x. This therefore means that entropy is also referred to as selfinformation [16][24].
Property 5:
I(x; y) H(x) and I(x; y) H(y)
The mutual information that exists between two variables is limited and
bound to the minimum of their specific entropy [16][20][24].
25
4.0 ANSWERS RETRIEVED FROM TOP-K

The researcher observes that DLCA answers alter with different search
queries. Conducting a data and information search, users usually are
interested in top-k answers. They are sorted in descending order using their
respective relevance degrees to the need of users information. This section
defines three ranking functions used for identification of the top-k results for
a keyword-based sequential search through XML data. The particular ranking
functions used in this study exploit different and several aspects of
dominance relationships existing between query candidates for ranking their
relevance degree to the specific search query[1][20][28].
Provided C (Q, T) as a set of candidates of a specific query Q in an existing
XML database T, the degree of relevance of a candidate based is measured
on the following three ranking scores:
Dominating Score
Provided a candidate answer structure S, the dominating score of S is
defined as follows.
scoredg(S) = |{S C (Q, T)|S >S}|
(3)
The dominating score of a specific candidate scoredg(S) shows the cumulative
count of candidates which S dominates. A candidate portrays more relevance
if it dominates as numerous and many other candidates as it possibly can.
Therefore, a higher dominating score of a specific candidate S denotes that S
is more significant to the specified query [7][12].
Example of an Instance 1:
Letting S C (Q, T) and SC (Q, T) be two respective candidates of a
specified query Q in a stated XML data tree T. Therefore, if S >S, then this
implies that scoredg(S) scoredg(S).
This example can be proved through using the transitive property which is a
subset of a dominance relationship. Therefore, for any two candidates on a
query Q, S C (Q, T) and S C (Q, T), if S>S, then Si C (Q, T)|S >Si, we
therefore have S >Si. Finally,
|{Si C (Q, T)|S >Si}| |{Si C (Q, T)|S >Si}|, or it can be stated as
scoredg(S) scoredg(S)
26
This particular example gives an assurance that candidate S dominates

candidate S, which then means that S is ranked higher as compared to S in
the top k results that have been returned [4][11][24].
Dominated Score
Provided a candidate answer S, the dominated score of S is defined as
follows:
scoredd(S) = |{SC (Q, T)|S>S}|
(4)
The dominated score of the specified candidate S, scoredd (S), shows the
number of other different candidates which can dominate S. Therefore, the
lower the dominated score, the more meaningful to the query for candidate
S [20][24]. This therefore implies that candidate S is more relevant when
dominated by fewer candidates as possible.
Dominance Score
Example of Instance 3
Letting S C (Q, T) and S C (Q, T) be two respective candidates of a
specified query Q in a stated XML data tree T.
If S >S, then scoredd(S) scoredd(S_ ).
This example can be proved in a similar manner like the previous examples.
For any existing two candidates S C (Q, T) and S_ C (Q, T), if S >S then
Si C (Q, T)|Si >S, we have Si _ S_ [16][20][24]. Therefore, |{Si C (Q, T)|Si
_ S}| |{Si C (Q, T)|Si >S}|, or scoredd(S) scoredd(S)
5.0 ALGORITHMS USED TO RETRIEVE TOP-K RESULTS

This particular section is meant as an introduction to algorithms which
identify relevant search results and the top-k answers, normally based on
various skyline semantics in accordance to the aforementioned criteria of
ranking. In order to obtain the specified set of LCA-based candidates of a
particular given keyword query, given other significant approaches in the
literature, the research adopts the inverted indexes [17]. These particular
indexes are built offline during a time it parsed the XML database tree
structure. Specifically, letting Q = {k1, . . . , kq} be parsed a given keyword
query and ILi be the inverted list consisting of keyword ki. Every entry
contained in the inverted list ILi is the Dewey code of a particular node
containing the keyword ki. The candidate set C of query Q is defined as
C = {lca(n1, nq)|n1 IL1, . . . , nq ILq},
Given that lca(n1, . . . , nq) is an operation that gives the lowest common
ancestor of {n1, . . . , nq}, the keyword relationship vector of every
candidate is concurrently fed as input during the candidate generation
process. The generated candidates are stored in a specified list ordered by
the values of their relevant keyword relationship vectors [4][17][25]. The
detailed explanations will be in the following subsections.
27
28
Nave Algorithm for Selection of Top-K Answers

Algorithm 1: Nave Algorithm [17]
The nave algorithm used for identification of the top-k results that are
desired corresponding to their respective dominated scores (similarly,
dominance and dominating scores) is illustrated in the Algorithm 1. This
specific algorithm iterates through every candidate in the specified
candidate set and facilitates the calculation of its score by performing pair
wise dominance checks between these candidates and all other candidates
defined in the set (lines 26) [17][21][23]. The resultant set is then updated
depending on the result obtained on the score compared between the
current kth candidate and the new candidate in the current top-k results
(lines 713) [17][21][24][28].
The major drawback of this particular algorithm is that its specified
computational cost is very high because regardless of the value of k, there is
need to iterate through each component candidate found in the candidate
set and calculate the score derived by each candidate by performing the
specified pair wise dominance checks that occur between the candidate with
all other present candidates in the existing set [16][20][24]. This therefore
means that no matter what the derived value of k is, the algorithm
exhaustively performs and conducts all pair wise dominance tests across all
candidates.

TABLE 3: TWO DIMENSION SET CANDIDATE DATA [24]
D1
S1
0.95
D2
0.9
S2
0.15
0.5
S3
0.1
0.95
S4
0.5
0.4
S5
0.8
0.8
S6
0.9
0.4
S7
0.4
0.4
S8
0.3
0.2
S9
0.7
0.6
S10
0.3
0.3
29
For instance, provided a set of candidates in Table 3, in order for the proper
identification of the top-3 results, there is need to calculate the score derived
by of each candidate Si(1 i 10) through iteration over 9 other candidates
and conducting a pair wise dominance check [24][28]. This therefore implies
that it takes 10 9 = 90 pair wise dominance checks. Generally, for
calculation of the score of a particular candidate in a given set of n
candidates, there is need to do pair wise dominance checks between that
specific candidate together with (n 1) other candidates found in the set.
Top-K Dominated Algorithm (TKDD)
The chief aim of TKDD is algorithm to each candidate is to efficiently find the
number of other candidates which dominate it, while avoiding exhaustive
pair wise comparisons between the candidates[2][8][24] . After the retrieval
of k results, the score of the k-th result is used as a maximum threshold and
therefore pruning occurs for the candidates whose overall dominated scores
extend the threshold [24]. To add to that fact, safe termination of the
algorithm is guaranteed if the scores of all the remaining candidates exceed
the provided threshold. More specifically analyzed, the TKDD takes course
through the following four steps:
30
(i) Initialization
(line 1): the result set R and minValue are initialized;
(ii) Termination condition
TABLE 4: SEARCH SPACES LA AND LB USED TO CALCULATE scoredd(Si) AND

scoredg(Si) RESPECTIVELY DERIVED FROM A LIST L OF CANDIDATES WHICH
ARE SORTED IN DESCENDING ORDER OF SPECIFIC F() VALUES [24]
TABLE 5: CANDIDATES LIST SORTED IN DESCENDING ORDER OF F() VALUES

[24]
TABLE 6: DOMINANCE CHECKS COUNT USED FOR CALCULATION OF THE

DOMINATED CANDIDATE SCORES [24]
Figure 4: List of candidates sorted in descending order using their M()

values [24]
Algorithm 2: TKDD [17]
(lines 46): Provided that M() value of the present candidate S is below the
minimum value of the current k-th candidate in R, the algorithm terminates
and the resultant set R is returned;
(iii) Dominance checks (lines 710):
31
32
The pair wise dominance checks between S and every other candidate Sin
the respective search space of S where the operation takes place. The
dominated score of S is found to be increased by 1 every time another
candidate dominates [17][22].
(iv) Result updates (lines 1117): provided that k results are existent and the
dominated score of the k-th candidate is larger than the current candidates
score, the k-th candidate is ejected and the current candidate is put into R;
otherwise if it becomes less than k results exist in R, there is an insertion of
the current candidate into R. Finally, taking the size of R as k, the threshold
minValue undergoes updating (lines 1821) [17][22][24].
Top-K Dominating Algorithm (TKDG)
The chief aim of TKDG algorithm is to perform retrieval of the top-k results
which dominate the larger number of the other existing candidates. This
possess as a more challenging task as compared to TKDD, because of the
larger search space associated with TKDD [17][24]. To illustrate:
Letting pos(S) be the position of the candidate currently examined S in the
list dubbed L of sorted candidates in F() values descending order.
Calculating scoredd(S), TKDD performs a number of pos(S) tests of dominance
on candidates whose F() values F(S). However, the calculation of
scoredg(S), TKDG performs (|L| pos (S)) tests on those candidates whose
F( ) values F (S).As the top relevant results are usually located in the top of
L (i.e., pos(S) _ |L|), it implies that posS _ (|L| pos(S)). Thus, the search
space of the TKDG algorithm is significantly larger than that of TKDD [17][21]
[24].
Letting S be a candidate currently examined. S has the probability to
dominate a maximum number of (|L| pos(S)) candidates. Therefore, taking
(|L| pos(S)) scoredg(Sk), where Sk refers to the k-th result in R, the
specific algorithm terminates. Specifically, TKDG algorithm (Algorithm 3)
proceeds in its execution in the following steps:
(i)
Initialization (line 1): result set initialization;
(ii)
Stopping condition (lines 38): provided an existence of k
candidates in the result set together with the dominating score of
the k-th candidate should be equal to or greater than (|L| pos(S)),
the algorithm then terminates [17][24].
(iii) Dominance tests (lines 914): the particular pair wise dominance
test between each other candidate and the currently examined
33
candidate found to exist in the search space provided will be

performed [24].
Concurrently, the dominating score of the candidate is calculated;
(iv) Result set update (lines 1523): provided that there is already a k
result in the particular result set, this implies that the k-th candidate
obtained from the result set is displaced and replaced with a new
candidate provided that its dominating score is of a lesser measure
than the current candidates score. If this factor is not achieved,
provided that it is less than k, its results exist in R, it therefore
inserts the new candidate formed into R [17][24].
Algorithm 3: TKDG [17]
34
6.0 EXPERIMENTAL EVALUATION

The researcher performed and designed a couple of experiments to analyze
the search performance of the approach. In the experiment the researcher
evaluates the outcomes and results of the various experiments in order to
compare the efficiency and quality of the approach that the researchers used
and other possible approaches that would have been used[3][5][6] [11][17]
[24][28].
Experimental Setup
The experiments were conducted on the Pentium 4, 3.2GHz computer
operating on windows XP Professional and it had an internal memory of 2GB.
Java was used to implement all the approaches [3][11][24][28].. The
researchers used Oracle Berkeley DB 11g as a tool to store and manage the
data indexes that were used to perform the experiments [5][11][28].
The three XML data sets that were tested include Auction 36.7 MB, Mondial
1MB and DBLP Computer Science Bibliography 877 MB. DBLP Computer
Science Bibliography includes a list of bibliographic information of major
computer science proceedings and journals. Mondial on the other hand is a
worldwide geographic database or platform that has been integrated from
the world fact book of the CIA, TERRA database, and the international atlas
among many other sources. Auction is a form of synthetic benchmark set of
data that has been generated by the XML generator using default DTD from
XMark [13][25][27].
35
Query Sets
The researchers asked a group of learners to submit fifty various keyword
questions to search and evaluate on every data set. Every query contained a
specific set of search key words and also a brief description of each query
was also very necessary in order to understand and identify the key
intension of the query[3][5][6] [11][17][24][28]. The researchers at the same
time observed that searching on a specific domain like the three main data
sets that they were experimenting on was not effective as the keyword
queries were ambiguous. This made it had for the users to express the
search intention. Due to this, it is sometimes difficult to obtain the relevant
results and outcomes of the queries at hand which are prerequisite for the
researchers to analyze the performance of their approach and other
available approaches[3][5][24][28].
Search Quality
The researchers compared the quality of the DLCA approach with the other
various approaches that exist like; ELCA, CVLCVA, XReal, MLCA, SLCA and
XSearch. The quality of these approaches were measured in three metrics
popular for retrieval of information: recall (R), F-measure and precision (P) [6]
[11][17][28]. In order for the researchers to recall and compute precision
they reformulated manually the keyword questions into schemas aware
queries based on the data sets schemas and the keywords query
descriptions. The researchers then took the results of transformed queries
results as a platform on which they computed the recall and precision of the
queries according to the platform as follows; given the key word query Q and
its corresponding XQuery that has been transformed [2][18][27]. The
accurate outcome set of Q which is the result a specific algorithm on Q is
recorded as retrieved results [2][11][24][28]. The precision and recall of this
algorithm can be defined as follows.
The precision is a fraction of retrieved results relevant to the search:
P= ((relevant results)n(retrieved results))
(Retrieved results)
The recall is a fraction of the relevant results which are successfully retrieved
by the search system
R= ((relevant results) n (retrieved results))
(Relevant results)
The F- measure which shows the trade-off between the recall and precision is
computed as;
36
F-measure= (1+B^2) PR
B^2 P+R
Where B = 1 the recall and precision are equal, where B < 1 precision is
emphasized and where B > 1 recall is emphasized.
From the calculations it is clear that the relevant results of each key word
query needs to be determined before the calculation of the appropriate
evaluation metrics. To acquire the relevant results of the tested queries the
researchers formed the manual corresponding schema aware Xquery with
the assistance of users [6][17]. The appropriate result of the queries was
then used as the basis for performance evaluation of the researchers
approach and other available approaches.
The researchers conducted experiments with a set of 50 keyword queries by
using various approaches and they measured the recall and precision of
every approach by finding the average of recall and precision values of the
tested queries.
The relationship and comparisons of recall and precision of the researchers
approaches in the three various data sets are shown below.
TABLE 7: PRECISION AND RECALL OF QUERIES ON MONDIAL DATA [11]

ELCA
SLCA
XSear
ch
CVLC
A
MLC
A
XREA DLCA
L
Precisi
on
0 .5 0
3
O .7 1
2
0 .6 3
5
0 .6 7
0
0 .7 1
2
0 .7 2
1
0 .9 2
2
Recall
1 .0 0
0
0 .6 2
4
0 .9 4
3
0 .9 1
0
0 .9 0
3
0 .6 4
7
0 .9 3
9
TABLE 8: PRECISION AND RECALL OF QUERIES ON AUCTION DATA [11]

ELCA
SLCA
XSear
ch
CVLC
A
MLC
A
XREA DLCA
L

Precisi
on
0 .4 7
8
0 .7 0
6
0 .6 2
3
0 .6 4
0
0.6 9
9
0 .7 0
3
0 .9 0
1
Recall
1 .0 0
0
0 .6 5
0
0 .9 3
1
0 .9 2
0
0.9 0
7
0 .6 5
0
0 .9 3
1
37
TABLE 9: PRECISION AND RECALL OF QUERIES ON DBLP DATA [11]

ELCA
SLCA
XSear
ch
CVLCA
MLCA
XREAL
DLCA
Precisi
on
0 .5 2
3
O .7 3
3
0 .6 4
0
0 .6 8
8
0 .7 2
0
0 .7 3
3
0 .9 3
4
Recall
1 .0 0
0
0 .6 4
7
0 .9 4
1
0 .9 1
1
0 .9 2
3
0.6 4 7
0 .9 4
1
TABLE 10: COMPARISONS ON RANKING EFFECTIVENESS OF THE

ALGORITHMS [11]
MAP
RPREC
bpref
RRANK
P@1
P@5
P@1
0
TKDD
0.870
0.830
0.7 5 0
0 .8 6 0
0 .8 9
0
0.8 7 0
0 .8 1
0
TKDG
0 .8 5
0
0 .8 2
0
0 .7 9
0
O .8 7 0
0.9 2
0
0.8 9 0
0 .8 4
0
TK D-0 .2
5
0 .8 4
0
0 .8 0
0
0 .7 9
0
0 .8 6 0
0 .9 1
0
0.9 1 0
0 .8 6
0
TK D-0 .5
0
0 .8 6
0
0 .8 2
0
0 .7 6
0
0 .8 6 0
0.9 0
0
0 .8 7
0
0.8 2 0
TK D-0 .
75
0 .8 7
0
0.8 5 0
0.7 3 0
0.8 8 0
0.8 8
0
0 .8 5
0
0.8 0 0
XRANK
0 .6 7
0
0 .7 5
0
0 .6 1
0
0 .7 1 0
0 .6 9
0
0 .6 8
0
0 .6 5
0
XSEARCH
0 .7 0
0
0 .7 7
0
0 .6 3
0
0 .6 8 0
0 .7 3
0
0.6 8 0
0 .6 6
0
38
All the ranking algorithms makes it possible to identify the top ten results at
a precision ranging between eighty to eighty five percent. The mean
average precision of the algorithm is 85% and the researcher could even
achieve more accurate precision by selecting a suitable value which can
maximize the balance and relationship the dominating and dominated scores
[6] [11][24][28].
Efficiency and Scalability of Top-K Algorithms
The researchers tested ten queries with various lengths in every data set.
They tested about five thousand candidates in default scenarios and the
number of results found was thirty. The queries which had less than required
number of results from candidates, the researchers made a replica of the
candidates repeatedly until they obtained the required candidate number[3]
[11]. The researchers then selected randomly the required candidate number
from the set that was duplicated. The cost of computation of the algorithm is
shown in the figure below. It is clear that when the candidate number
increases the algorithm processing time also increases but at different trends
[3][9][14][17]. TKDD in this case is the most effective and efficient method it
is less affected by the increase in the number of candidates. TKDD is mainly
concerned with in the results which are dominated fewest number of
candidates as possible. This is because the results are usually located at the
top of the list of sorted candidates and as a result it searches a small portion
of the candidate list. For TKDG the search space is much larger and as a
result there is expected delay. On the other hand the lower performance of
TKDD is also as a result of the score that is dominating hence it explains why
its processing time rises the same way as the TKDG which has a small
overhead used in calculating and finding the dominating score [5][9] [13][24]
[28]. From the results in Figure 5, it is clear that the TKDD processing time is
less affected by the increase in number of k of the returned results and it can
return from ten to one hundred results from the set of five thousand
candidates within a second. The TKDG processing time algorithm is more
affected by the change of the parameter but it takes 2.5 seconds to get back
to the top one hundred results from asset of five thousand candidates.
Figure 5: Ranking efficiency of TKDD, TKDG and TKD algorithms [9]
39
7.0 CONCLUSIONS
In the thesis the researchers have studied the issue of identifying the most
accurate outcomes and results and the top-k appropriate results for XML
keyword questions or queries in this matter. The use of major keywords to
search for documents has been widely accepted as a very convenient way in
retrieving resources from various remote servers that hold that specific type
of data on the internet. Majority of the search engines such as Google, Bing,
Yahoo and many more have adopted the use of these technologies so as to
efficiently facilitate the process of data mining and data warehousing. The
adaptation and use of keywords for querying various databases has attracted
various researches to be conducted by the research community from the
affected fields of database and information retrieval (IR). XML documents are
composed of nested XML attributes from the root elements to the nested
sub-elements. XML elements often reference other elements which are
queried as XML values and therefore the text content is captured using the
deputation contains (u, k). Consequently, the predicate returns true when the
element u has keyword k while an XML query Q is mapped from an XML
database D to XML documents that characterize the query output. As a
result, when the XML database environment is UD is and the XML document
sequence environment is S, the outcome is Q: UD S. Q(D) is the result of
query Q over database D whereby the query is identified using XML query
language for instance XQuery. Therefore considering a sequence s, then e
s is true when e is in s. Consider a p-document, which is a probabilistic
document written in XML specifies a probability distribution across space of
deterministic documents written in XML. Each deterministic document that
belongs to this space is referred to as a possible word. A probabilistic
document referenced as a tree that has been labeled has distributional and
ordinary nodes. Ordinary nodes are basically regular and normal XML nodes
and their appearance may be seen in deterministic documents, whereas
distributional nodes are only used in the definition of probabilistic process
that involves the generating of deterministic documents and their occurrence
is not visible in those documents. In the adaptation of PrXML {ind, mux} as
40
part of the XML model which is probabilistic, two distributional nodes types
appear in a p-document, which are MUX and IND.
The researchers have strived to address the three vital requirements and
conditions for effective keyword searches of the XML. The researchers have
introduced new methods of analyzing the relationship between query key
words in the candidates using mutual information idea and come up with a
new DLCA keyword queries semantic. The researchers also have a proposed
strategy and method of selecting the results of DLCA from multiple
candidates and the three ranking methods used in selecting top-k results
based on skyline queries semantics. Some of the properties which have been
proven have been acquired to accelerate proposed algorithms. The findings
and experiments have been conducted to analyze and evaluate the
researchers experimental results and approach and they show that the
approach performs better than the approaches that have been used in the
data sets that have been tested and the evaluation metrics. This is a very
efficient way of facilitating the retrieval of documents because it does not
involve the learning of any concepts. This process is an advancement of the
traditional search algorithms that were specifically involving and required the
mastering of the particular IP (Internet Protocol) addresses of various
documents or information content and typing them in the URL bar. His later
advanced and the IP addresses were able to be attached to various web
links. It is from this that the search engines were developed with a more
interactive and responsive algorithm that was able to handle a lot of bits and
pieces of information including data mining. A variety of approaches have
been accessed and previewed to find alternatives to the keyword queries as
opposed to the XML data. The basic approaches that currently exist use
lowest common ancestor (LCA) type of semantics as opposed to the common
graph theory for identification of the hit list given a certain keyword query.
This particular approach generates results composed of all candidates, also
known as sub trees, containing an instance of the queried keywords. The LCA
returned values can be numerous yet the user may just be interested in a
portion or bit of the whole hit list. It therefore remains and unsolved issue to
be able to identify the exact dataset that is required by the user of the
system. The ideal situation and the best case scenario would be for the
system to be able to generate an exact piece that is required by the user as
opposed to providing a whole set of hits which also gives the user an extra
job to filter the content until they obtain an exact piece. The researchers
have strived to address the three vital requirements and conditions for
effective keyword searches of the XML. The researchers have introduced new
methods of analyzing the relationship between query key words in the
candidates using mutual information idea and come up with a new DLCA
keyword queries semantic. The researchers also have a proposed strategy
and method of selecting the results of DLCA from multiple candidates and
the three ranking methods used in selecting top-k results based on skyline
queries semantics. Some of the properties which have been proven have
41
been acquired to accelerate proposed algorithms. The findings and

experiments have been conducted to analyze and evaluate the researchers
experimental results and approach and they show that the approach
performs better than the approaches that have been used in the data sets
that have been tested and the evaluation metrics examined.
A simple cost model introduced by the authors was based on CPU costs and
weighted IO which used statistics on data page numbers consumed by
relations that bound the cost model concrete values. The dynamic
programming algorithm provides a selected optimal operator fitting for
specific access paths. After that, an optimal join order is verified based on an
assumption of local optimality. In order to prune early the search space, not
all possible enumerations are considered. In their place, focus is laid on
interesting join orders, for instance orders which can do without additional
introductions of products of the Cartesian. Graefe and DeWitt showcased the
EXODUS Optimizer Generator and the purpose of this system was not
confined to a specific data model but it supported the algebraic
transformations specification as rules. Incorporated with a data model that is
concrete, the rules serve as input for the generator optimizer, creating a
tailor-made query optimizer. This paper systematically explores XML
structure-based answers and user expectations in order to identify the
significance of XML keyword search semantics. This paper further posits a
semantics-based methodology to develop XML keyword queries principally
through data-centric coherency ranking which is kernelled in the design of
the domain and database which is predicated on data dependence and
mutual information models. Consequently, keyword query results occur
within a under schema reorganization structures which process, present rank
and query algorithms through coherency ranking to develop answers. Actual
XML data indicates that coherency ranking is the methodology with the
highest precision, recall and ranking as compared with approaches. Current
keyword searches in XML can be divided into tree and graph supported
searches which are largely predicated on structural document features.
However, these approaches on structure do not comprehensively utilize the
hidden semantics within the XML documents leading to issues in the
processing of specific keyword query classes. The growing reputation of XML
has intensified the necessitation of an accessible and precise XML query
interface that is predicated on natural language and search procedures that
exploit XML structures to simplify queries by ordinary users within XML
databases.
42
REFERENCES
[1]Alghamdi, Norah Saleh, Wenny Rahayu, and Eric Pardede. "Object-based
semantic partitioning for XML twig query optimization." In Advanced
Information Networking and Applications (AINA), 2013 IEEE 27th
International Conference on, pp. 846-853. IEEE, 2013.
[2]Baa, Radim, and Michal Krtk. "XML query processing: efficiency and
optimality." In Proceedings of the 16th International Database Engineering &
Applications Sysmposium, pp. 8-13. ACM, 2012.
[3]Consens, Mariano P., Xin Gu, Yaron Kanza, and Flavio Rizzolo. "Self
Managing Top-k (Summary, Keyword) Indexes in XML Retrieval." In Data
43
Engineering Workshop, 2007 IEEE 23rd International Conference on, pp. 245252. IEEE, 2007.
[4]Georgiadis, Haris, Minas Charalambides, and Vasilis Vassalos. "Efficient
physical operators for cost-based XPath execution." In Proceedings of the
13th International Conference on Extending Database Technology, pp. 171182. ACM, 2010.
[5]Li, Jianxin, Chengfei Liu, Rui Zhou, and Wei Wang. "Top-k keyword search
over probabilistic XML data." In Data Engineering (ICDE), 2011 IEEE 27th
International Conference on, pp. 673-684. IEEE, 2011.
[6]Nguyen, Khanh, and Jinli Cao. "Top-k answers for XML keyword queries."
World Wide Web 15, no. 5-6 (2012): 485-515.
[7]Nguyen, Khanh, and Jinli Cao. "K-graphs: selecting top-k data sources for
XML keyword queries." In Database and Expert Systems Applications, pp.
425-439. Springer Berlin Heidelberg, 2011
[8]Si, Xujie, Airu Yin, Xiaocheng Huang, Xiaojie Yuan, Xiaoguang Liu, and
Gang Wang. "Parallel optimization of queries in XML dataset using GPU." In
Parallel Architectures, Algorithms and Programming (PAAP), 2011 Fourth
International Symposium on, pp. 190-194. IEEE, 2011.
[9]Weiner, Andreas M., and Theo Hrder. "An integrative approach to query
optimization in native XML database management systems." In Proceedings
of the Fourteenth International Database Engineering & Applications
Symposium, pp. 64-74. ACM, 2010.
[10]Wu, Xiaoying, and Dimitri Theodoratos. "A survey on XML streaming
evaluation techniques." The VLDB JournalThe International Journal on Very
Large Data Bases 22, no. 2 (2013): 177-202.
[11]Ding, Chen. "XML Query optimization." PhD diss., 2009.
[12] A. Bertolino, et al., "Automatic Test Data Generation for XML
SchemabasedPartition Testing," in Automation of Software Test , 2007. AST
'07.Second International Workshop, Minneapolis, MN 2007.
[13] K. Kido, et al., "Processing XPath Queries in PC-Clusters Using XMLData
Partitioning," in Data Engineering Workshops, 2006. Proceedings.22nd
International Conference, Atlanta, GA, USA, 2006, pp. x114 - x114.
[14] Oracle9i Database Performance Tuning Guide and Reference Release
2(9.2) Using SQL Trace and TKPROF 2002, A96533-02, Oracle, viewed
44
7October 2010,
<http://download.oracle.com/docs/cd/B10500_01/server.920/a96533/sqltrac
e.htm#8344>
[15] Memory Configuration and Use 2008, B28274-02, Oracle, viewed
15September2010,
<http://download.oracle.com/docs/cd/B28359_01/server.111/b28274/memory
.htm >
[16] N. Onose et al., "Rewriting Nested XML Queries Using Nested Views," in
Proceedings of the 2006 ACM SIGMOD International conference on
Management of Data, Chicago, IL, USA, 2006, pp. 443 454.
[17] B. Stantic et al., "Handling of Current Time in Native XML Databases," in
Proceedings of the 17th Australasian Database Conference -Volume 49,
Hobart, Australia, 2006, pp. 175 182.
[18]F. Liu, C. T. Yu, W. Meng, and A. Chowdhury, Effective keyword search in
relational databases, in SIGMOD Conference, 2006, pp. 563574.
[19] V. Hristidis, N. Koudas, Y. Papakonstantinou, and D. Srivastava, Keyword
proximity search in xml trees, IEEE Trans. Knowl. Data Eng., vol. 18, no. 4,
pp. 525539, 2006.
[20] Y. Xu and Y. Papakonstantinou, Efficient LCA based keyword search
inxml data, in EDBT, 2008, pp. 535546.
[21] Z. Liu and Y. Chen, Identifying meaningful return information for XML
keyword search, in SIGMOD Conference, 2007, pp. 329340.
[22] C. Sun, C. Y. Chan, and A. K. Goenka, Multiway SLCA-based keyword
search in xml data, in WWW, 2007, pp. 10431052.
[23] Z. Liu and Y. Chen, Reasoning and identifying relevant matches for xml
keyword search, PVLDB, vol. 1, no. 1, pp. 921932, 2008.
[24] S. Amer-Yahia and M. Lalmas, Xml search: languages, index and
scoring, SIGMOD Record, vol. 35, no. 4, pp. 1623, 2006.
[25] Y. Luo, X. Lin, W. Wang, and X. Zhou, Spark: top-k keyword queryin
relational databases, in SIGMOD Conference, 2007, pp. 115126.
[26] N. Mamoulis, K. H. Cheng, M. L. Yiu, and D. W. Cheung, Efficient
aggregation of ranked inputs, in ICDE, 2006, p. 72.
45
[27] D. Xin, J. Han, and K. C.-C. Chang, Progressive and selective merge:
Computing top-k with ad-hoc ranking functions, in SIGMOD
Conference,2007, pp. 103114.
[28] G. Das, D. Gunopulos, N. Koudas, and D. Tsirogiannis, Answeringtop-k
queries using views, in VLDB, 2006, pp. 451462.
[29] N. Bansal, S. Guha, and N. Koudas, Ad-hoc aggregations of ranked lists
in the presence of hierarchies, in SIGMOD Conference, 2008, pp.6778.

Optimizing Keyword Queries in XML Tree Structure

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Optimizing Keyword Queries in XML Tree Structure

Uploaded by

Copyright:

Available Formats

Running Head: Optimizing Keyword Queries in XML tree structures

Optimizing Keyword Queries in XML Tree Structure

Optimizing Keyword Queries in XML Tree Structures

Optimizing Keyword Queries in XML Tree Structures

Optimizing Keyword Queries in XML Tree Structures

Figure 1: Probabilistic XML document [18].....................................................10

Optimizing Keyword Queries in XML Tree Structures

Optimizing Keyword Queries in XML Tree Structures

information including data mining [6]. A variety of approaches have been

Optimizing Keyword Queries in XML Tree Structures

Smallest (05767875) Lowest Common Ancestor (SLCA) is a model keyword

Optimizing Keyword Queries in XML Tree Structures

Consequently, keyword query results occur within a under schema

2.0 OVERVIEW OF RELATED WORKS

Optimizing Keyword Queries in XML Tree Structures

introductions of products of the Cartesian. Graefe and DeWitt showcased the

Optimizing Keyword Queries in XML Tree Structures

This study explores the retrieval of queries through a combination of

Optimizing Keyword Queries in XML Tree Structures

framework for cost-based optimization and a full-fledged DBMS does not

Keyword Query in Ordinary XML Documents

Optimizing Keyword Queries in XML Tree Structures

Optimizing Keyword Queries in XML Tree Structures

Optimizing Keyword Queries in XML Tree Structures

Optimizing Keyword Queries in XML Tree Structures

various conditions or in specialized cases only. Publication time represents

Optimizing Keyword Queries in XML Tree Structures

using the tree structure in Figure 2, prefix label path:

Optimizing Keyword Queries in XML Tree Structures

TABLE 1: THE JOIN PROBABILITY OF THE TWO SPECIFIC NODES (dblp/

TABLE 2: THE JOIN PROBABILITY OF THE TWO SPECIFIC NODES (dblp/

As shown in the Table 2, the mutual information of nodes

Optimizing Keyword Queries in XML Tree Structures

This therefore indicates that there is variance in the type of mutual

Optimizing Keyword Queries in XML Tree Structures

= log(1/5) = log 5 = 0.70

Dominance Lowest Common Ancestor (DLCA)

Optimizing Keyword Queries in XML Tree Structures

node in the candidate series which according to this research is encoded as a

Optimizing Keyword Queries in XML Tree Structures

In this particular case scenario, ID2LP(ni) is an important function which

Optimizing Keyword Queries in XML Tree Structures

MEASUREMENT OF THE RELATIONSHIP BETWEEN NODES IN A

Optimizing Keyword Queries in XML Tree Structures

Mutual Information Concepts

Conditional Entropy of a particular random variable say y provided a second

In this particular type of equation, p (vy,vx) refers to joint probability of (y=vy)

In this particular scenario, p (vx,vy) refers to the joint probability of the

Optimizing Keyword Queries in XML Tree Structures

Optimizing Keyword Queries in XML Tree Structures

4.0 ANSWERS RETRIEVED FROM TOP-K

Optimizing Keyword Queries in XML Tree Structures

This particular example gives an assurance that candidate S dominates

Optimizing Keyword Queries in XML Tree Structures

5.0 ALGORITHMS USED TO RETRIEVE TOP-K RESULTS

Optimizing Keyword Queries in XML Tree Structures

Nave Algorithm for Selection of Top-K Answers

Optimizing Keyword Queries in XML Tree Structures

Optimizing Keyword Queries in XML Tree Structures

TABLE 4: SEARCH SPACES LA AND LB USED TO CALCULATE scoredd(Si) AND

TABLE 5: CANDIDATES LIST SORTED IN DESCENDING ORDER OF F() VALUES

TABLE 6: DOMINANCE CHECKS COUNT USED FOR CALCULATION OF THE

Figure 4: List of candidates sorted in descending order using their M()