Professional Documents
Culture Documents
Date
ABSTRACT
XML which stands for Extensible markup language remains to be the most
popular and frequently used format for representing and exchanging data in
the World Wide Web. Its application is wide based on the various different
data types and applications that exist. The data may take different forms
which may include unstructured heterogeneous, semi structured and
structured data types. XML has been a progressive language increasing its
functionalities with various inventions and researches to the level of
development of data streaming applications. These types of inventions have
received numerous significance and attention by many experienced users of
the web. These developments have led to the centralization of efficient
processing and querying of XML streams.
This study focuses on retrieving queries through a combination of structural
constraints which essentially use key words as a search tool to represent an
essential executable function in the XML systems of data management.
Various expectations are forecasted on and they are expected to yield best
case answers in an effective an efficient manner like the traditional key
search while factoring in the various additional constraints that may exist.
The definition of studying the new problem of top-k keyword query and
search over XML probabilistic data with the aim of retrieving k SLCA finding
where k has the highest existence capabilities. Finally the study is going to
preview various other forms of keyword searches using different forms and
make a comparison through the analysis of the algorithms that have been
used.
TABLE OF CONTENTS
1.0 INTRODUCTION.....................................................................................5
Problem Definition........................................................................................7
Proposal........................................................................................................7
2.0 OVERVIEW OF RELATED WORKS........................................................7
Cost-Based Query Optimization in RDBMSs..................................................8
Query Optimization Frameworks...................................................................9
XML Query Optimization...............................................................................9
Keyword Query in Ordinary XML Documents..............................................10
Probabilistic XML.........................................................................................11
XQuery Streaming Optimization.................................................................13
Querying XML streams.............................................................................13
XML temporal model................................................................................13
Time Intervals and Model............................................................................13
Mode Relationship Evaluation.....................................................................13
Probability...................................................................................................14
Join Probability............................................................................................15
Node Relationship.......................................................................................16
Dominance Lowest Common Ancestor (DLCA)...........................................18
Dominance relationship...........................................................................18
Dominance...............................................................................................20
3.0 MEASUREMENT OF THE RELATIONSHIP BETWEEN NODES IN A
DATA TREE...................................................................................................20
Mutual Information Concepts......................................................................21
Mutual information and entropy..............................................................21
Mutual information...................................................................................21
4.0 ANSWERS RETRIEVED FROM TOP-K................................................23
Dominating Score.......................................................................................23
Dominated Score........................................................................................23
Dominance Score........................................................................................24
5.0 ALGORITHMS USED TO RETRIEVE TOP-K RESULTS......................25
Nave Algorithm for Selection of Top-K Answers.........................................25
Top-K Dominated Algorithm (TKDD)............................................................27
Top-K Dominating Algorithm (TKDG)...........................................................29
6.0 EXPERIMENTAL EVALUATION...........................................................31
Experimental Setup....................................................................................31
Query Sets..................................................................................................31
Search Quality............................................................................................31
Efficiency and Scalability of Top-K Algorithms.............................................34
7.0 CONCLUSIONS....................................................................................35
REFERENCES................................................................................................35
LIST OF FIGURES
LIST OF TABLES
Table 1: The join probability of the two specific nodes at context node:/paper
.......................................................................................................................15
Table 2: The join probability of the two specific nodes at context node:
/proceeding....................................................................................................15
Table 3: Two dimension set candidate data...................................................26
Table 4: Search spaces LA and LB derived from a list l of candidates............27
Table 5: Candidates list sorted in descending order of f() values..................27
Table 6: Dominance checks count used in calculating dominated candidate
scores............................................................................................................27
Table 7: Precision and recall of queries on mondial data...............................33
Table 8: Precision and recall of queries on auction data................................33
Table 9: Precision and recall of queries on dblp data.....................................33
Table 10: Comparisons on ranking effectiveness of the algorithms...............33
1.0INTRODUCTION
XML (Extensible Mark-up Language) has over the years evolved to become a
de facto standard used for the exchange and representation of data which
results into the distribution of proliferated XML documents which are spread
all over the internet. In the past, there are various query languages that were
used to retrieve xml documents and data. These included languages such as
XQuery and XPath. And twig pattern queries. These languages made it
essential for the users of the systems to be versed with the specific query
languages and the relevant data schemas so that they may be able to
execute the XZML queries efficiently [5]. This therefore limited the type of
users since the advanced users since the query languages and the data
schemas seemed to be complex concepts to understand. The data search
through XQuery/XPath languages therefore was a very big limiting factor.
The use of major keywords to search for documents has been widely
accepted as a very convenient way in retrieving resources from various
remote servers that hold that specific type of data on the internet. Majority
of the search engines such as Google, Bing, Yahoo and many more have
adopted the use of these technologies so as to efficiently facilitate the
process of data mining and data warehousing. The adaptation and use of
keywords for querying various databases has attracted various researches to
be conducted by the research community from the affected fields of
database and information retrieval (IR) [5][2][3]. This is a very efficient way
of facilitating the retrieval of documents because it does not involve the
learning of any concepts. This process is an advancement of the traditional
search algorithms that were specifically involving and required the mastering
of the particular IP (Internet Protocol) addresses of various documents or
information content and typing them in the URL bar. His later advanced and
the IP addresses were able to be attached to various web links. It is from this
that the search engines were developed with a more interactive and
responsive algorithm that was able to handle a lot of bits and pieces of
Problem Definition
Current keyword searches in XML can be divided into tree and graph
supported searches which are largely predicated on structural document
features. However, these approaches on structure do not comprehensively
utilize the hidden semantics within the XML documents leading to issues in
the processing of specific keyword query classes. The growing reputation of
XML has intensified the necessitation of an accessible and precise XML query
interface that is predicated on natural language and search procedures that
exploit XML structures to simplify queries by ordinary users within XML
databases. Conventional methodologies however process queries rely based
on ad hoc and intuitive heuristics which frequently regain false positive and
unranked answers.
Proposal
This paper systematically explores XML structure-based answers and user
expectations in order to identify the significance of XML keyword search
semantics. This paper further posits a semantics-based methodology to
develop XML keyword queries principally through data-centric coherency
ranking which is kernelled in the design of the domain and database which is
predicated on data dependence and mutual information models.
10
11
12
intentionally avoiding the false negative and false positive SLCA and LCA
[12][25]. The various researchers also proposed Indexed Stack, which was
an efficient algorithm for finding answers based on semantics of Exclusive
LCA. In addition, there exist other related works which process keyword
search through the integration of keywords into specific structured queries.
XMLQL, which is a new query language, has the structure of the keywords
and query separated. The research also introduced a method to embed
various keywords into XQuery for processing of the specific keyword search
[9].
Probabilistic XML
The probabilistic XML topic has been a recently studied subject in which
majority of the proposed models have been incorporated together with
evaluations of structured query. Nierman et al first introduced the concept of
ProTDB, with the existent probabilistic types MUX - mutually exclusive and
IND - independent. The researchers modeled the probabilistic XML in the
form of acyclic graphs, which support distributions that are arbitrary over
children sets. The research adopted a probabilistic tree approach for the
purpose of data integration where its possibility and probability nodes are
similar to IND and MUX respectively [9] [21][29].
A p-document, which is a probabilistic document written in XML specifies a
probability distribution across space of deterministic documents written in
XML. Each deterministic document that belongs to this space is referred to as
a possible word. A probabilistic document referenced as a tree that has been
labeled has distributional and ordinary nodes [11][14]. Ordinary nodes are
basically regular and normal XML nodes and their appearance may be seen
in deterministic documents, whereas distributional nodes are only used in
the definition of probabilistic process that involves the generating of
deterministic documents and their occurrence is not visible in those
documents[5][12] [24]. In the adaptation of PrXML {ind, mux} as part of the
XML model which is probabilistic, two distributional nodes types appear in a
p-document, which are MUX and IND[10][13].
Considering an example 1:
Consider Figure 1(a) showing the p-document T. Tag names are used to show
ordinary nodes, for instance C1, C2, and B1and B2. As for the distributional
nodes, MUX is shown as rectangular rounded cornered boxes while IND is
depicted as circles. Taking into consideration the IND2 node, there exist two
children nodes B2 and C1 with respective existence of probabilities 0.6 and
13
0.5 [5]. Therefore, for neither C1 nor B2, the absence probability being seen
is
(1-0.6)*(1-0.5) =0.2.
Considering the MUX2 node that consists of three children, IND3, E2 and
D1node, their probabilities of existence respectively are 0.5, 0.3 and
0.1.Therefore, the probability for non-existence is
1 - 0.5 - 0.3 - 0.1 = 0.1.
Provided a p-document tree named T, there is a possibility of generating all
possible deterministic documents as sown below; basically traversing T in a
top-down manner, two situations arise that require to be dealt with
independently if:
(1) It occurs that it is an IND node consisting of m child nodes, 2m copies of T
are generated, and the IND node deleted; m child nodes are replaced with
one distinct subset as a copy which is a representation of them and the
ordinary parent node which is connected to each child node in the IND node
subset. The probability for this copy to occur for each copy is the product of
all probabilities that exist of the respective child nodes in the particular
subset and the absence probabilities for instance the existence probability
deducted from 1, of the child nodes which are not existent in the subset [5].
(2) For a MUX node consisting of m child nodes, m + 1 copies of T can be
generated, and the MUX node deleted, replacing the m child nodes with no
child or one distinct child node for a copy. An established connection from
the child node of MUX to the ordinary parent node is made. The existence
probability for every copy is the occurrence probability of the distinct child
node in the subset or the absence probability denoting no child node
appearance. For every T generated copy, the research adopts traversing
using the top-down approach until the deletion of all distributional nodes is
confirmed [5].
Various researchers have proposed the adoption of a fuzzy trees model
where nodes are specifically associated with probabilistic event variables
conjunctions. A full complexity query analysis update on the fuzzy trees in
the research is also referenced. They also proposed algorithms that solve the
constraint-satisfaction which were efficient [6][18] [24][29]. The specific
sampling problem and query evaluation under constraints set can be well
defined to yield efficient query results that are expected. Other publications
summarized and extended the previously proposed probabilistic XML models,
tractability of queries and the expressiveness on different models were
discussed with the consideration of MUX and IND[5][13] [15][18] [27].
Various studies on the evaluation problem of twig queries considered over
14
probabilistic XML that may generate partial and incomplete answers with
particular respect to user probability threshold. The researchers also
addressed and proposed the ranking top-k probabilities problem of answers
of a twig query. In summary, the work that has been cited focused on
discussions of various probabilistic XML data models on a structured XML
query, for instance a twig query[1][7] [16][23]. Our research however is
going to be different in the sense that the keyword search problem in
probabilistic XML data is going to be critically previewed and analyzed [9].
XQuery Streaming Optimization
Querying XML streams
Several streaming algorithms exist that particularly focus on the querying
problem and the filtration procedure. Many of these algorithms center their
operations on tree-pattern queries (TPQs). TPQs efficiently correspond to
XPath queries which involve mainly descendant and child axes [14][26]. TPQ
streaming algorithms can be extended to facilitate the process obtaining
XPath queries which come along with ordered axes that involve preceding,
preceding-sibling, following, and following-sibling) [10].Processing techniques
are therefore introduced on ordered axes Streaming algorithms broadly fall in
three categories: The array-based approach, automaton-based approach and
the stack-bas.
XML temporal model
Previous studies conducted on time-based XML model have identified several
disadvantages and benefits. The bitemporal approach is inclusive of both
valid time and transaction time in timestamp attributes [22][26]. Normative
texts will always comprise of four time intervals. Normative texts consisting
of temporal values in an XML database represent new attributes of intervals
for instance efficacy time and publication time. This particular approach of
XML tree partitioning guarantees the distribution of data into partitions of
equal size making considerations of both the query processing load and data
storage cost [10].
Time Intervals and Model
The intervals are publication time efficacy time transaction time and validity
time. Transaction time refers to the time a transaction is reflected in the
database as a representation of an important factor for all transactions that
occur in time-referenced databases. Valid time refers to the interval that
indicates the time when the data becomes valid for general use or it may be
invalid and unusable [8][19][20]. Efficacy time is when data is used under
15
For instance at node 4, the prefix labeled path in the data tree is defined by
dblp/proceeding/paper/author. Many occurrences can exist in a specific prefix
labeled path in the XML data tree structure, and these occurrences are
referred to as node instances [2][7][21][25]. It is therefore definite that all
the instances of a specific node will consist of the same prefix label path.
Every instance consists of a unique value which constitutes to the specific
set of key words contained directly in that particular instance. For instance,
16
17
In this particular type of scenario, p (vv) and p (vu) are the respective
probabilities of (v = vv) and (u = vu).
p(vu, vv|c) at the context node c is the join probability of v = vv and u = vu
A stronger relationship is denoted by the higher value of the mutual
information that is revealed between two specific types of nodes [16][20].
For instance, the mutual information of nodes dblp/proceeding/paper/author,
dblp/proceeding/paper/title at context node dblp/proceeding is given by
I (dblp/proceeding/paper/author; dblp/proceeding/paper/title|
dblp/proceeding/paper)
18
In particular case, H (u) and H (v) refer to the specific entropy of nodes u and
v respectively and the values are calculated the same way as the random
variable entropies [16].
When the value of rel (u; v|c) is high, this means that the relationship
between nodes u and v is strong at the context node c. For instance, the
entropy of nodes dblp/ proceeding/paper/title and
dblp/proceeding/paper/author can be obtained by:
H(dblp/proceeding/paper/title)
= [(1/5) log(1/5) + (1/5) log(1/5)
(1/5) log(1/5) + (1/5) log(1/5) + (1/5) log(1/5)]
= log(1/5) = log 5 = 0.70
H(dblp/proceeding/paper/author)
= [(1/5) log(1/5) + (1/5) log(1/5)
(1/5) log(1/5) + (1/5) log(1/5) + (1/5) log(1/5)]
19
20
The researcher selected to encode the specific node identifiers with Dewey
code since it is very useful in the representation of the hierarchical
relationships that exist between nodes of a tree that forms a very important
variable in the tree structure. The corresponding label path of the specific
node can be found from the Dewey code. For instance, considering the
sample data tree structure T2 in the Figure 3, every node s always identified
by the Dewey code. For a node identification of [0.1.0.0], the corresponding
label path of the corresponding node n1/n3/n4/n5. We therefore give a name,
ID2LP(id) which is an id which represents the Dewey code that serves as an
input and reflects the corresponding path label[13][24]. There is a vast
chance of the possibility that the key words in a particular search tree may
yield many occurrences in the specific sub tree candidate S(nlca, {n1; . . . ;
nm}). Every keyword yields a set of Li = {ni|val (ni) with the keywords ki (1
i m)}
The relationship between the various keywords that are produced in the
specific search tree is given as:
21
22
Dominance
Letting S and S to become the two candidates of the XML search query Q
over a specific named and given database T, S dominates S. This is
represented as S > S and this condition will only hold if the following are
met:
j(1 j d)DS[ j] < DS_ [ j] and i(1 i d)DS[i] DS_ [i]
In this scenario d refers to the keyword length relationship vector of S and
S which is (d = |DS| = |DS_| =C2q). Ds[i] is the element in the ith vector Ds
Candidate S dominates S in the relationship [4][11][24].
3.0
23
24
This deals with the interpretation of Mutual Information. It indicates that the
information that has been provided by y concerning x is the reduction or
decrease in the uncertainty of x provided the knowledge possessed by y.
Similarly, this occurs for all the bits of information availed by x concerning
random variable y. The value of the mutual information is directly
proportional to the information that is revealed by both the variables x and y
in this particular property [16][20][24][29].
Property 2:
I (x; y) = I (y; x)
It puts forward that mutual information takes a symmetric form, meaning
that information availed by x concerning y is the very same type of
information y conveys about x [16][20][24].
Property 3:
I(x; y) 0
The lower bound of the mutual information is given in this particular
scenario. Given I(x; y) = 0, we get the result p(vx, vy) = p (vx) p (vy) for the
possible values of x and y. These means that the variables x and y are
independent, therefore obtaining the value of x does not necessarily provide
clues of the probable or exact value of the variable y. This therefore puts
their mutual information at zero [5][16][20][29].
Property 4
I(x; x) = H(x).
This property puts forward that mutual information of variable x is by itself
the entropy of x. This therefore means that entropy is also referred to as selfinformation [16][24].
Property 5:
I(x; y) H(x) and I(x; y) H(y)
The mutual information that exists between two variables is limited and
bound to the minimum of their specific entropy [16][20][24].
25
26
27
28
The nave algorithm used for identification of the top-k results that are
desired corresponding to their respective dominated scores (similarly,
dominance and dominating scores) is illustrated in the Algorithm 1. This
specific algorithm iterates through every candidate in the specified
candidate set and facilitates the calculation of its score by performing pair
wise dominance checks between these candidates and all other candidates
defined in the set (lines 26) [17][21][23]. The resultant set is then updated
depending on the result obtained on the score compared between the
current kth candidate and the new candidate in the current top-k results
(lines 713) [17][21][24][28].
The major drawback of this particular algorithm is that its specified
computational cost is very high because regardless of the value of k, there is
need to iterate through each component candidate found in the candidate
set and calculate the score derived by each candidate by performing the
specified pair wise dominance checks that occur between the candidate with
all other present candidates in the existing set [16][20][24]. This therefore
means that no matter what the derived value of k is, the algorithm
exhaustively performs and conducts all pair wise dominance tests across all
candidates.
D2
0.9
S2
0.15
0.5
S3
0.1
0.95
S4
0.5
0.4
S5
0.8
0.8
S6
0.9
0.4
S7
0.4
0.4
S8
0.3
0.2
S9
0.7
0.6
S10
0.3
0.3
29
For instance, provided a set of candidates in Table 3, in order for the proper
identification of the top-3 results, there is need to calculate the score derived
by of each candidate Si(1 i 10) through iteration over 9 other candidates
and conducting a pair wise dominance check [24][28]. This therefore implies
that it takes 10 9 = 90 pair wise dominance checks. Generally, for
calculation of the score of a particular candidate in a given set of n
candidates, there is need to do pair wise dominance checks between that
specific candidate together with (n 1) other candidates found in the set.
Top-K Dominated Algorithm (TKDD)
The chief aim of TKDD is algorithm to each candidate is to efficiently find the
number of other candidates which dominate it, while avoiding exhaustive
pair wise comparisons between the candidates[2][8][24] . After the retrieval
of k results, the score of the k-th result is used as a maximum threshold and
therefore pruning occurs for the candidates whose overall dominated scores
extend the threshold [24]. To add to that fact, safe termination of the
algorithm is guaranteed if the scores of all the remaining candidates exceed
the provided threshold. More specifically analyzed, the TKDD takes course
through the following four steps:
30
(i) Initialization
(line 1): the result set R and minValue are initialized;
(ii) Termination condition
(lines 46): Provided that M() value of the present candidate S is below the
minimum value of the current k-th candidate in R, the algorithm terminates
and the resultant set R is returned;
31
32
The pair wise dominance checks between S and every other candidate Sin
the respective search space of S where the operation takes place. The
dominated score of S is found to be increased by 1 every time another
candidate dominates [17][22].
(iv) Result updates (lines 1117): provided that k results are existent and the
dominated score of the k-th candidate is larger than the current candidates
score, the k-th candidate is ejected and the current candidate is put into R;
otherwise if it becomes less than k results exist in R, there is an insertion of
the current candidate into R. Finally, taking the size of R as k, the threshold
minValue undergoes updating (lines 1821) [17][22][24].
Top-K Dominating Algorithm (TKDG)
The chief aim of TKDG algorithm is to perform retrieval of the top-k results
which dominate the larger number of the other existing candidates. This
possess as a more challenging task as compared to TKDD, because of the
larger search space associated with TKDD [17][24]. To illustrate:
Letting pos(S) be the position of the candidate currently examined S in the
list dubbed L of sorted candidates in F() values descending order.
Calculating scoredd(S), TKDD performs a number of pos(S) tests of dominance
on candidates whose F() values F(S). However, the calculation of
scoredg(S), TKDG performs (|L| pos (S)) tests on those candidates whose
F( ) values F (S).As the top relevant results are usually located in the top of
L (i.e., pos(S) _ |L|), it implies that posS _ (|L| pos(S)). Thus, the search
space of the TKDG algorithm is significantly larger than that of TKDD [17][21]
[24].
Letting S be a candidate currently examined. S has the probability to
dominate a maximum number of (|L| pos(S)) candidates. Therefore, taking
(|L| pos(S)) scoredg(Sk), where Sk refers to the k-th result in R, the
specific algorithm terminates. Specifically, TKDG algorithm (Algorithm 3)
proceeds in its execution in the following steps:
(i)
Initialization (line 1): result set initialization;
(ii)
Stopping condition (lines 38): provided an existence of k
candidates in the result set together with the dominating score of
the k-th candidate should be equal to or greater than (|L| pos(S)),
the algorithm then terminates [17][24].
(iii) Dominance tests (lines 914): the particular pair wise dominance
test between each other candidate and the currently examined
33
34
35
Query Sets
The researchers asked a group of learners to submit fifty various keyword
questions to search and evaluate on every data set. Every query contained a
specific set of search key words and also a brief description of each query
was also very necessary in order to understand and identify the key
intension of the query[3][5][6] [11][17][24][28]. The researchers at the same
time observed that searching on a specific domain like the three main data
sets that they were experimenting on was not effective as the keyword
queries were ambiguous. This made it had for the users to express the
search intention. Due to this, it is sometimes difficult to obtain the relevant
results and outcomes of the queries at hand which are prerequisite for the
researchers to analyze the performance of their approach and other
available approaches[3][5][24][28].
Search Quality
The researchers compared the quality of the DLCA approach with the other
various approaches that exist like; ELCA, CVLCVA, XReal, MLCA, SLCA and
XSearch. The quality of these approaches were measured in three metrics
popular for retrieval of information: recall (R), F-measure and precision (P) [6]
[11][17][28]. In order for the researchers to recall and compute precision
they reformulated manually the keyword questions into schemas aware
queries based on the data sets schemas and the keywords query
descriptions. The researchers then took the results of transformed queries
results as a platform on which they computed the recall and precision of the
queries according to the platform as follows; given the key word query Q and
its corresponding XQuery that has been transformed [2][18][27]. The
accurate outcome set of Q which is the result a specific algorithm on Q is
recorded as retrieved results [2][11][24][28]. The precision and recall of this
algorithm can be defined as follows.
The precision is a fraction of retrieved results relevant to the search:
P= ((relevant results)n(retrieved results))
(Retrieved results)
The recall is a fraction of the relevant results which are successfully retrieved
by the search system
R= ((relevant results) n (retrieved results))
(Relevant results)
The F- measure which shows the trade-off between the recall and precision is
computed as;
36
F-measure= (1+B^2) PR
B^2 P+R
Where B = 1 the recall and precision are equal, where B < 1 precision is
emphasized and where B > 1 recall is emphasized.
From the calculations it is clear that the relevant results of each key word
query needs to be determined before the calculation of the appropriate
evaluation metrics. To acquire the relevant results of the tested queries the
researchers formed the manual corresponding schema aware Xquery with
the assistance of users [6][17]. The appropriate result of the queries was
then used as the basis for performance evaluation of the researchers
approach and other available approaches.
The researchers conducted experiments with a set of 50 keyword queries by
using various approaches and they measured the recall and precision of
every approach by finding the average of recall and precision values of the
tested queries.
The relationship and comparisons of recall and precision of the researchers
approaches in the three various data sets are shown below.
SLCA
XSear
ch
CVLC
A
MLC
A
XREA DLCA
L
Precisi
on
0 .5 0
3
O .7 1
2
0 .6 3
5
0 .6 7
0
0 .7 1
2
0 .7 2
1
0 .9 2
2
Recall
1 .0 0
0
0 .6 2
4
0 .9 4
3
0 .9 1
0
0 .9 0
3
0 .6 4
7
0 .9 3
9
SLCA
XSear
ch
CVLC
A
MLC
A
XREA DLCA
L
0 .4 7
8
0 .7 0
6
0 .6 2
3
0 .6 4
0
0.6 9
9
0 .7 0
3
0 .9 0
1
Recall
1 .0 0
0
0 .6 5
0
0 .9 3
1
0 .9 2
0
0.9 0
7
0 .6 5
0
0 .9 3
1
37
SLCA
XSear
ch
CVLCA
MLCA
XREAL
DLCA
Precisi
on
0 .5 2
3
O .7 3
3
0 .6 4
0
0 .6 8
8
0 .7 2
0
0 .7 3
3
0 .9 3
4
Recall
1 .0 0
0
0 .6 4
7
0 .9 4
1
0 .9 1
1
0 .9 2
3
0.6 4 7
0 .9 4
1
RPREC
bpref
RRANK
P@1
P@5
P@1
0
TKDD
0.870
0.830
0.7 5 0
0 .8 6 0
0 .8 9
0
0.8 7 0
0 .8 1
0
TKDG
0 .8 5
0
0 .8 2
0
0 .7 9
0
O .8 7 0
0.9 2
0
0.8 9 0
0 .8 4
0
TK D-0 .2
5
0 .8 4
0
0 .8 0
0
0 .7 9
0
0 .8 6 0
0 .9 1
0
0.9 1 0
0 .8 6
0
TK D-0 .5
0
0 .8 6
0
0 .8 2
0
0 .7 6
0
0 .8 6 0
0.9 0
0
0 .8 7
0
0.8 2 0
TK D-0 .
75
0 .8 7
0
0.8 5 0
0.7 3 0
0.8 8 0
0.8 8
0
0 .8 5
0
0.8 0 0
XRANK
0 .6 7
0
0 .7 5
0
0 .6 1
0
0 .7 1 0
0 .6 9
0
0 .6 8
0
0 .6 5
0
XSEARCH
0 .7 0
0
0 .7 7
0
0 .6 3
0
0 .6 8 0
0 .7 3
0
0.6 8 0
0 .6 6
0
38
All the ranking algorithms makes it possible to identify the top ten results at
a precision ranging between eighty to eighty five percent. The mean
average precision of the algorithm is 85% and the researcher could even
achieve more accurate precision by selecting a suitable value which can
maximize the balance and relationship the dominating and dominated scores
[6] [11][24][28].
Efficiency and Scalability of Top-K Algorithms
The researchers tested ten queries with various lengths in every data set.
They tested about five thousand candidates in default scenarios and the
number of results found was thirty. The queries which had less than required
number of results from candidates, the researchers made a replica of the
candidates repeatedly until they obtained the required candidate number[3]
[11]. The researchers then selected randomly the required candidate number
from the set that was duplicated. The cost of computation of the algorithm is
shown in the figure below. It is clear that when the candidate number
increases the algorithm processing time also increases but at different trends
[3][9][14][17]. TKDD in this case is the most effective and efficient method it
is less affected by the increase in the number of candidates. TKDD is mainly
concerned with in the results which are dominated fewest number of
candidates as possible. This is because the results are usually located at the
top of the list of sorted candidates and as a result it searches a small portion
of the candidate list. For TKDG the search space is much larger and as a
result there is expected delay. On the other hand the lower performance of
TKDD is also as a result of the score that is dominating hence it explains why
its processing time rises the same way as the TKDG which has a small
overhead used in calculating and finding the dominating score [5][9] [13][24]
[28]. From the results in Figure 5, it is clear that the TKDD processing time is
less affected by the increase in number of k of the returned results and it can
return from ten to one hundred results from the set of five thousand
candidates within a second. The TKDG processing time algorithm is more
affected by the change of the parameter but it takes 2.5 seconds to get back
to the top one hundred results from asset of five thousand candidates.
39
7.0 CONCLUSIONS
In the thesis the researchers have studied the issue of identifying the most
accurate outcomes and results and the top-k appropriate results for XML
keyword questions or queries in this matter. The use of major keywords to
search for documents has been widely accepted as a very convenient way in
retrieving resources from various remote servers that hold that specific type
of data on the internet. Majority of the search engines such as Google, Bing,
Yahoo and many more have adopted the use of these technologies so as to
efficiently facilitate the process of data mining and data warehousing. The
adaptation and use of keywords for querying various databases has attracted
various researches to be conducted by the research community from the
affected fields of database and information retrieval (IR). XML documents are
composed of nested XML attributes from the root elements to the nested
sub-elements. XML elements often reference other elements which are
queried as XML values and therefore the text content is captured using the
deputation contains (u, k). Consequently, the predicate returns true when the
element u has keyword k while an XML query Q is mapped from an XML
database D to XML documents that characterize the query output. As a
result, when the XML database environment is UD is and the XML document
sequence environment is S, the outcome is Q: UD S. Q(D) is the result of
query Q over database D whereby the query is identified using XML query
language for instance XQuery. Therefore considering a sequence s, then e
s is true when e is in s. Consider a p-document, which is a probabilistic
document written in XML specifies a probability distribution across space of
deterministic documents written in XML. Each deterministic document that
belongs to this space is referred to as a possible word. A probabilistic
document referenced as a tree that has been labeled has distributional and
ordinary nodes. Ordinary nodes are basically regular and normal XML nodes
and their appearance may be seen in deterministic documents, whereas
distributional nodes are only used in the definition of probabilistic process
that involves the generating of deterministic documents and their occurrence
is not visible in those documents. In the adaptation of PrXML {ind, mux} as
40
part of the XML model which is probabilistic, two distributional nodes types
appear in a p-document, which are MUX and IND.
The researchers have strived to address the three vital requirements and
conditions for effective keyword searches of the XML. The researchers have
introduced new methods of analyzing the relationship between query key
words in the candidates using mutual information idea and come up with a
new DLCA keyword queries semantic. The researchers also have a proposed
strategy and method of selecting the results of DLCA from multiple
candidates and the three ranking methods used in selecting top-k results
based on skyline queries semantics. Some of the properties which have been
proven have been acquired to accelerate proposed algorithms. The findings
and experiments have been conducted to analyze and evaluate the
researchers experimental results and approach and they show that the
approach performs better than the approaches that have been used in the
data sets that have been tested and the evaluation metrics. This is a very
efficient way of facilitating the retrieval of documents because it does not
involve the learning of any concepts. This process is an advancement of the
traditional search algorithms that were specifically involving and required the
mastering of the particular IP (Internet Protocol) addresses of various
documents or information content and typing them in the URL bar. His later
advanced and the IP addresses were able to be attached to various web
links. It is from this that the search engines were developed with a more
interactive and responsive algorithm that was able to handle a lot of bits and
pieces of information including data mining. A variety of approaches have
been accessed and previewed to find alternatives to the keyword queries as
opposed to the XML data. The basic approaches that currently exist use
lowest common ancestor (LCA) type of semantics as opposed to the common
graph theory for identification of the hit list given a certain keyword query.
This particular approach generates results composed of all candidates, also
known as sub trees, containing an instance of the queried keywords. The LCA
returned values can be numerous yet the user may just be interested in a
portion or bit of the whole hit list. It therefore remains and unsolved issue to
be able to identify the exact dataset that is required by the user of the
system. The ideal situation and the best case scenario would be for the
system to be able to generate an exact piece that is required by the user as
opposed to providing a whole set of hits which also gives the user an extra
job to filter the content until they obtain an exact piece. The researchers
have strived to address the three vital requirements and conditions for
effective keyword searches of the XML. The researchers have introduced new
methods of analyzing the relationship between query key words in the
candidates using mutual information idea and come up with a new DLCA
keyword queries semantic. The researchers also have a proposed strategy
and method of selecting the results of DLCA from multiple candidates and
the three ranking methods used in selecting top-k results based on skyline
queries semantics. Some of the properties which have been proven have
41
42
REFERENCES
[1]Alghamdi, Norah Saleh, Wenny Rahayu, and Eric Pardede. "Object-based
semantic partitioning for XML twig query optimization." In Advanced
Information Networking and Applications (AINA), 2013 IEEE 27th
International Conference on, pp. 846-853. IEEE, 2013.
[2]Baa, Radim, and Michal Krtk. "XML query processing: efficiency and
optimality." In Proceedings of the 16th International Database Engineering &
Applications Sysmposium, pp. 8-13. ACM, 2012.
[3]Consens, Mariano P., Xin Gu, Yaron Kanza, and Flavio Rizzolo. "Self
Managing Top-k (Summary, Keyword) Indexes in XML Retrieval." In Data
43
Engineering Workshop, 2007 IEEE 23rd International Conference on, pp. 245252. IEEE, 2007.
[4]Georgiadis, Haris, Minas Charalambides, and Vasilis Vassalos. "Efficient
physical operators for cost-based XPath execution." In Proceedings of the
13th International Conference on Extending Database Technology, pp. 171182. ACM, 2010.
[5]Li, Jianxin, Chengfei Liu, Rui Zhou, and Wei Wang. "Top-k keyword search
over probabilistic XML data." In Data Engineering (ICDE), 2011 IEEE 27th
International Conference on, pp. 673-684. IEEE, 2011.
[6]Nguyen, Khanh, and Jinli Cao. "Top-k answers for XML keyword queries."
World Wide Web 15, no. 5-6 (2012): 485-515.
[7]Nguyen, Khanh, and Jinli Cao. "K-graphs: selecting top-k data sources for
XML keyword queries." In Database and Expert Systems Applications, pp.
425-439. Springer Berlin Heidelberg, 2011
[8]Si, Xujie, Airu Yin, Xiaocheng Huang, Xiaojie Yuan, Xiaoguang Liu, and
Gang Wang. "Parallel optimization of queries in XML dataset using GPU." In
Parallel Architectures, Algorithms and Programming (PAAP), 2011 Fourth
International Symposium on, pp. 190-194. IEEE, 2011.
[9]Weiner, Andreas M., and Theo Hrder. "An integrative approach to query
optimization in native XML database management systems." In Proceedings
of the Fourteenth International Database Engineering & Applications
Symposium, pp. 64-74. ACM, 2010.
[10]Wu, Xiaoying, and Dimitri Theodoratos. "A survey on XML streaming
evaluation techniques." The VLDB JournalThe International Journal on Very
Large Data Bases 22, no. 2 (2013): 177-202.
[11]Ding, Chen. "XML Query optimization." PhD diss., 2009.
[12] A. Bertolino, et al., "Automatic Test Data Generation for XML
SchemabasedPartition Testing," in Automation of Software Test , 2007. AST
'07.Second International Workshop, Minneapolis, MN 2007.
[13] K. Kido, et al., "Processing XPath Queries in PC-Clusters Using XMLData
Partitioning," in Data Engineering Workshops, 2006. Proceedings.22nd
International Conference, Atlanta, GA, USA, 2006, pp. x114 - x114.
[14] Oracle9i Database Performance Tuning Guide and Reference Release
2(9.2) Using SQL Trace and TKPROF 2002, A96533-02, Oracle, viewed
44
7October 2010,
<http://download.oracle.com/docs/cd/B10500_01/server.920/a96533/sqltrac
e.htm#8344>
[15] Memory Configuration and Use 2008, B28274-02, Oracle, viewed
15September2010,
<http://download.oracle.com/docs/cd/B28359_01/server.111/b28274/memory
.htm >
[16] N. Onose et al., "Rewriting Nested XML Queries Using Nested Views," in
Proceedings of the 2006 ACM SIGMOD International conference on
Management of Data, Chicago, IL, USA, 2006, pp. 443 454.
[17] B. Stantic et al., "Handling of Current Time in Native XML Databases," in
Proceedings of the 17th Australasian Database Conference -Volume 49,
Hobart, Australia, 2006, pp. 175 182.
[18]F. Liu, C. T. Yu, W. Meng, and A. Chowdhury, Effective keyword search in
relational databases, in SIGMOD Conference, 2006, pp. 563574.
[19] V. Hristidis, N. Koudas, Y. Papakonstantinou, and D. Srivastava, Keyword
proximity search in xml trees, IEEE Trans. Knowl. Data Eng., vol. 18, no. 4,
pp. 525539, 2006.
[20] Y. Xu and Y. Papakonstantinou, Efficient LCA based keyword search
inxml data, in EDBT, 2008, pp. 535546.
[21] Z. Liu and Y. Chen, Identifying meaningful return information for XML
keyword search, in SIGMOD Conference, 2007, pp. 329340.
[22] C. Sun, C. Y. Chan, and A. K. Goenka, Multiway SLCA-based keyword
search in xml data, in WWW, 2007, pp. 10431052.
[23] Z. Liu and Y. Chen, Reasoning and identifying relevant matches for xml
keyword search, PVLDB, vol. 1, no. 1, pp. 921932, 2008.
[24] S. Amer-Yahia and M. Lalmas, Xml search: languages, index and
scoring, SIGMOD Record, vol. 35, no. 4, pp. 1623, 2006.
[25] Y. Luo, X. Lin, W. Wang, and X. Zhou, Spark: top-k keyword queryin
relational databases, in SIGMOD Conference, 2007, pp. 115126.
[26] N. Mamoulis, K. H. Cheng, M. L. Yiu, and D. W. Cheung, Efficient
aggregation of ranked inputs, in ICDE, 2006, p. 72.
45
[27] D. Xin, J. Han, and K. C.-C. Chang, Progressive and selective merge:
Computing top-k with ad-hoc ranking functions, in SIGMOD
Conference,2007, pp. 103114.
[28] G. Das, D. Gunopulos, N. Koudas, and D. Tsirogiannis, Answeringtop-k
queries using views, in VLDB, 2006, pp. 451462.
[29] N. Bansal, S. Guha, and N. Koudas, Ad-hoc aggregations of ranked lists
in the presence of hierarchies, in SIGMOD Conference, 2008, pp.6778.