Professional Documents
Culture Documents
and left t , right t, and left right. Here, the left and right child
unordered tree that will be use to served as a basis for evaluating the
existing new search algorithm for unordered tree that will reduces the
Given that in unordered tree traversing the entire tree is the only
solution due to the fact that the arrangement of data with respect to its
value of parent and child are none relevance. To help reduce the
are as follows:
(1) to add two data on every nodes of a tree that hold a certain
information such as a frequency,etc., one represent the left subtree
(left bit frequency) and the other one represent the right subtree (right
bit frequency). This data will contains binary digit frequency count of
subtrees which will be use during searching process. Frequency is the
count of 1's of all binary equivalent of data in the tree (2) to create a
search algorithm that will reduce the search space by evaluating the
added data of a node as a basis on which of the sub-tree will be
evaluated first, that if in case, if the data is not in the current node (3)
to test and verify if the added data and the designed algorithm
increases the accuracy and reduces the search space in overall
compare to the traversal one.
With this new approach of the study, it might open a new idea to
the field of computer science that for every random data in a given set
there exists a unique property of a data that can be used and grouped
searching process. It considered that the sets of data of a tree are all
whole number and unique such that there is no exist of duplicate data
our sample and with that the data are limited only from 1 up to 255.
searching.
CHAPTER II
used it form first three phases: (1) It uses another algorithm that
if the new extension is the canonical form for its automorphism group,
Trees using Canonical Forms[Yun Chi, 2005] where it address the one
counting but differ in terms of what we're counting because our study
than zero and skip all others. With projection using PBR, each node Y of
which contain an index in array and skip all other.[3,p2]. This paper
parent node and skip or prune the sub-area if it show zero value but as
to the process of obtaining the region shows the difference. The same
a set of input trees. Then they focus on unordered trees, and show that
tree to avoid repeated exploration of unordered tree and our study can
be use at this phase since our study is a searching algorithm for
unordered tree, and our obtained frequencies can be used farther for
their study.
from which they can extract optimal decision trees in linear time. They
experiments show that under the same constraints, DL8 has better test
exhaustive search does not always imply over fitting. The results also
show that DL8 is a useful and interesting tool to learn decision trees
decision tree[6].
ideas from the VLDB domain, we compress vertical bit vectors using
results from over a billion database and data mining style frequency
used bit of every data to obtain frequency, there in this study item set
CHAPTER III
RESEARCH METHODOLOGY
frequency and store it at the parent node. All the binary frequencies of
left subtree will be store at lbf of parent node and on rbf for the right
subtree.
3.1 Trees
designated the root, and for all x E V, there is a unique path from r to
designate them as the first child, second child, and so on up to the kth
i=0
and there is a path from x to y , then x is called an
siblings, and if they have a common ancestor, they are called cousins
[1].
Illustration 3.1 Rooted ordered and unordered trees
sequence 1's and 0's for a certain data. Using this idea, we can
position 0
position 1
position 2
position 3
position 4
position 5
position 6
position 7
example 2:
Decimal base 128 64 32 16 8 4 2 1
710 = 0 0 0 0 0 1 1 number
Binary 1 (bit)
position 0
position 1
position 2
position 3
position 4
position 5
position 6
position 7
example 3:
Decimal base 128 64 32 16 8 4 2 1
12810 = 1 0 0 0 0 0 0 0
13710 = 1 0 0 0 1 0 0 1
9810 = 0 1 1 0 0 0 1 0
7010 = 0 1 0 0 0 1 1 0
is an
f i 2i =N equation used to sum the 1-bit in binary
i=0
number. Where p, is the total position of binary number and b is the bit
p
(i) b i 2i=0(20 )+1(21)+0 (22)+1(23 )+0(2 4 )+0(25 )+0(26 )+0(27 )=10
i=0
p
(ii)
b i 2i=0+2+0+8+0+0+0+ 0=10
i=0
p
Deci f i= x mal base 128 64 32 16 8 4 2
i=0
1
1010 = 0 0 0 0 1 0 1 0
710 = 0 0 0 0 0 1 1 1
12810 = 1 0 0 0 0 0 0 0
13710 = 1 0 0 0 1 0 0 1
9810 = 0 1 1 0 0 0 1 0
7010 = 0 1 0 0 0 1 1 0
------------- ------------------------------------------------------------------
Total: 45010 2 2 1 0 2 2 1-bit
4 frequency
2
7 6 5 4 3 2 1 positions
0
then it is also the total decimal numbers associated with that specific
position.
110 = 0 0 1
210 = 0 1 0
410 = 1 0 0
310 = 0 1 1
1 2 2 1-bit frequencies
2 1 0 positions,
therefore we can say that there are two numbers associated at position
zero, which are the number 1, & 3 and there are also two numbers
associated at position 1, which are the number 2, & 3 and there is only
Base from the figure 3.2, from all binary numbers we obtain one set of
1-bit frequency. Every frequency numbers has an associations with
and
fi replacing with . Using
2 i=0
1
figure 3.2, f = { 2 , 2 , 1 , 0 , 2 , 2 , 4 , 2} and p be the total positions
fi p
i=0
Equation 3.3.1: frequency to decimal fi number
2 i=0
1
p
(i)
f i (2i )=2(20 )+4 (21)+2(22)+2(23 )+0(2 4)+1(25 )+2(26 )+2(27 )=450
i=0
p
(ii)
f i (2i )=2(1)+ 4 (2)+2(4)+2(8)+0 (16)+1(32)+2 (64)+2(128)=450
i=0
p
(iii)
f i (2i )=2+8+8+ 16+0+32+128+256=450
i=0
with this, we can say that 450 decimal number can be obtain by the
frequency of { 2 , 2 , 1 , 0 , 2 , 2 , 4 , 2} with decimal equivalent of
aspects of our study since it reveals how much value of every positions
the sum of frequency, the sum of frequency is the total number of 1-bit
associated with all those numbers by the theorem 3.2. We can get the
f
p
positions
(I)
f i=2+ 4+2+2+0+1+2+2=15
i=0
exist using the bits combination? The answer is definitely yes! All we
With all those idea and knowledge we discuss above, we now get
most-significant-position
(msp)
to one important point of our study which is the probability. To compute
les-significant-position (lsp)
probability is very important part of our searching study since our
algorithm will defend on it to decide on which subtree to evaluate first
given the probability result. We take it little by little, let us first
consider this:
Total
Frequenc Total Decimal number combinatio
Possible binary numbers
y Positions involved n of decimal
number
12
{1 } 0 1 1
1
1 12 , 0 1
1
{1,1} 11 1 02 1, 2, 3 3
1 1
1 1 12 , 001 010 0
1 1
1 11 1 1 02 1 0 12 1
{1,1,1} 2 1, 2 , 3 , 4, 5, 6, 7 7
0 02
111 111
111
11112 , 0001 0010
0100
1111 111 02 11012
10112
1111 1111
1111 1, 2 , 3 , 4 , 5 , 6 ,7 ,
{1,1,1,1} 3 8 ,9 ,10 ,11 ,12 ,13 , 15
1000 1010 0110 14 ,15
most-significant-position
0011 (msp)
01112 01012 10012
11002
1111 1111 1111
1111
on table 3.4, we will focus only the position since it is being obtain from
the position of 1-bit. Position p must be the number of 1-bit of a certain
number x and with that we can compute the probability of x by:
given the total position p, i.e., what percent that the decimal number 7
exist given that the total position of 1-bit binary number 7 is 2, using
1 1 1 1
( )100=( )100=( )100=( )100=(0.33333)100=33.33333
p
2 1
2
2 1 41 3
so, the probability is 33.33 this is due to the fact that the frequency
there is only one binary number could represent that frequency, the
x x 1010 = 0 0 0
0
f msp i
1 0 f 1 mspi 1
7 =
0
0 0 0 0
i=0 i=0 10
0 1 1 1
12810 = 1 0 0 0 0 0 0 0
--------- ---------------------------------------------------------------
Total: 14510 1 0 0 0 1 1 2 1
7 6 5 4 3 2 1 0
f 1lsp +1 i
a 1+1
i=0
f =( x )(100)=( c
)(100)
x x
f msp a+a 1+2 1
f lsp + f 1lsp +2
i
i i
i=0
1
i=0 i=0
position(msp) is the 1-bit position of an arbitrary binary number when
f lsp +1 i
a+1 f = {
i=0
f 1=( x )(100)=( c
)(100) 1 ,
x x
f 1msp a+a 1+2 1 0 ,
f lsp + f 1lsp +2 0 ,
i
i=0
1
i=0
i
i=0
i 0 ,
1 ,
1, 2,1}
7 6 5 4 3 2 1 0 positions
msp = { 2 , 1 , 0 } , lsp = { 7 , 6 , 5 , 4 , 3 }
frequency, there we can identify also the lsp. Every value of msp and
msp = { 2 , 1 , 0 } lsp = { 7 , 6 , 5 , 4 , 3 }
2 , 1 , 0 positions 4 3 2 1 0 positions
x = 13610 = 1 0 0 0 1 0 0 02
a1+1 6+1 7
( )(100)=( )(100)= (100)=0.4375 (100)=43.75 7 6 5
c
a+a 1+2 1 3
3+6+2 1 16
4 3 2
1 0 positions
a1+1 3+1 4
( )(100)=( )(100)= (100)=0.25 (100)=25
c
a+a 1+2 1 3
3+6+ 2 1 16
f = { 1 ,
0,0,0,1,1, 2,1}
7 6 5 4 3 2 1 0 positions
msp = { 7 , 3 } lsp = { 6 , 5 , 4 , 2 , 1 , 0 }
1 , 0 positions 5 , 4 , 3 , 2 , 1 , 0 positions
The lbf information will be used to represent the left sub-tree of a node
and rbf will be use to represent the right sub-tree of the node as shown
below:
subtree
3.6 Probability theorem
binary tree hast at most two child, the left-child and the right-child.
Theorem 3.6.1
number of number N . Let msp subset of b's 1-bit positions and lsp a
subset of b's 0-bit positions such that lsp msp = b accordingly. Let x
c= and c1 =
percent
percent
If f > f1 with respect to b's msp and lsp then there is a high
percentage that b is within f compared to f1 .
Proof:
1010 = 0 0 0 0 1 0 1 0
710 = 0 0 0 0 0 1 1 1
12810 = 1 0 0 0 0 0 0 0
--------- ---------------------------------------------------------------
Total: 14510 1 0 0 0 1 1 f
2 1
13710 = 1 0 0 0 1 0 0 1
9810 = 0 1 1 0 0 0 1 0
7010 = 0 1 0 0 0 1 1 0
------------- ----------------------------------------------------------------
Total: 30510 1 2 1 0 1 1 f1
2 1
7 6 5 4 3 1-bit
2 frequency
1 0 position
f1={1,2,1,0,1,1,2,1} ,
let N = 10 and N = b2 and let p the set of positions of bits such that:
b = { 0 0 0 0 1 0 1 0 } = 1010
p={7 6 5 4 3 2 1 0 }
msp = { 3, 1 } lsp = {7 , 6 , 5 ,
4,2,0}
c = f msp = { 1, 2 } = 3 a = flsp = {1 , 0 ,
0,0,1,1}=3
c1= f1 msp = { 1, 2 } = 3 a1 = f1lsp = {1 , 2 ,
1,0,2,1}=6
f = percent
f1 = percent
f = 43.75% and
f1 = 25%
on nodes basic structure, the lbf and the rbf data, such that every time
a node is created there are data in a node that ready to hold a bit-
frequency.
Figure 3.7.2 Unordered tree with lbf and rbf data added
Every node will hold bit-frequency, one will represent the left sub-tree
the left-bit-frequency (lbf) and one for the right sub-tree the right-bit-
frequency will be updated every time a node is added but only those
nodes which part of backing process and only one of the two bit-
frequency of a node is updated. When backing came from the left child
sub-tree then only the lbf of parents node is updated, the same happen
to the right child of a parent node if backtrack come from right sub-
tree.
The lbf and rbf are one of the most important part in our study,
since it is our first objective of the study. There are two phases to
obtain and update the lbf and rbf: (1) converting the data N into binary
form and get the msp: (2) update all msp of lbf or rbf all ancestors
node base from the msp of data N by increasing each msp value by 1.
the parent node then the msp of lbf of the parent node will be update
evaluated to obtain a set of 1-bit positions, this set is called the most-
So, if the current data of node is not equal to x, there will be two
zero value on any position lbf msp and rbfmsp then it sure that the data
does not exist on both left sub-tree and right subtree and stop the
search. If only the lbf shows zero value on any msp then prune the left
sub-tree and proceed the search to right sub-tree, the same happen if
only the rbf show zero value on any position of msp. Phase 2, using the
lsp of x as sum of both lbf lsp and rbflsp will compared, if one of them
shows lowest value then the search will proceed to that subtree else by
default the left sub-tree will be searched. The derive theorem will be
It does not mean to say that if both lbf msp and rbfmsp show non-
zero value on any msp the data exist, say for example a frequency of
111, this frequency can be obtain from 001, 010, 100 binary number a
3.6 Algorithm
BFS(R, D)
1. t NULL
2. if R equal to NULL then goto step 20
3. else t R
4. if t data equal to D then goto step 20
5. else if for every n mspx
6. if LBFn equal to zero then
7. prune left subtree
8. if RBFn equal to zero then
9. prune right subtree
10. end if
11. for every n lspx
12. if sum of lbfn <= sum of rbfn
13. BSF(t left sub-tree, D)
14. if t equal to NULL then
15. BSF(t righ sub-tree,D)
16. end if
17. else BSF(t right sub-tree,D)
16. if t equal to NULL then
17. BSF(t left subtree,D)
18. end if
19. end if
20. return t
21. end
3.7 Time complexity: Though the algorithm can reduce the involved
nodes overall but it take extra time for filtering and evaluation process.
Base from the table 3.1, the p represent the number of frequency
positions, the k represent the internal nodes candidates for
backtracking if in case the search is missed.
CHAPTER IV
Base on the result at table 4.1 and the illustration at 4.1 it shows
that for 20 nodes including the root the traditional search takes a
grand total of 210 nodes for searching one-by-one of the data in a tree
this include searching a data which is not found in the tree while Binary
Frequency Search Algorithm takes only 106 node when searching one-
by-one of the data. With this result, BFSA shows that it reduced the
nodes by > 49%.
Base on the result at table 4.2 and the illustration at 4.2 it shows
that for 10 nodes including the root the traditional search takes a
grand total of 56 nodes for searching one-by-one of the data in a tree
this include searching a data which is not found in the tree while Binary
Frequency Search Algorithm takes only 25 nodes. With this result, BFSA
shows that it reduced the nodes by > 54%.
SIMULATIONS RESULTS 4
In order to check the reliability of the algorithm with respect to
how much it reduces the involved nodes and the accuracy during the
choices between the two subtree, the left and the right subtree. A
series of simulation has been made with a a random data to create a
sample tree, random data is used to make sure that every tree that
has been constructed during the simulation are unique. Below the the
graph for the result of the simulations:
58.5
58
57.5
57
56.5
56
0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200
Iteration
Base from the results shown at illustration 4.1, from 10 simulated
trees to 2000 the algorithm shows efficiency in term of reducing the
nodes in overall greater than 59% overall.
Illustration 4.2: Average result of accuracy
68
Accuracy Percentage
67.5
67
66.5
66
65.5
0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200
Sample Unordered Trees
For the accuracy of the algorithm, based from the illustration 4.2,
for the samples of 2000 unordered simulated trees the algorithm show
an accuracy of greater than 66%.
CHAPTER V
CONCLUSIONS AND RECOMMENDATIONS
Base from the results, it shows that the algorithm clearly shows
that it reduces the involve nodes overall at search process and it can
easily determine if a sub-tree must be prune or not. And because of
extra filtering and evaluation processes the algorithm will take extra
time with the time complexity of O(p k log n). The algorithm can be
best used in the field of data mining in searching process when
avoiding to used traversal the entire unordered tree and where the
sets data are unique.
For further studies, the researcher recommend to enhance the
algorithm in such it will eliminate the k in the O(p k log n) time
complexity, for this case the algorithm become faster and precise to its
searching than traversal. For mining frequent subtree or embedded
subtree, our algorithm can be used.
APPENDICES
A. Glossary of Terms