You are on page 1of 62

West University of Timisoara Faculty of Mathematics and Computer Science Department of Computer Science Specialization Computer Science in English

BACHELOR THESIS
A C++ tree libray

Graduate Student: Diana-Mirona Lupeiu

Scientific Coordinator: Lector. Dr. Stelian Mihala

Timioara, 2011

ABSTRACT
The main goal of this thesis was the implementation of a tree library in C++, library which was subsequently used in an application implementing an embedding algorithm for rooted trees. The library organizes data in the form of a so-called n-ary tree. This is a tree in which every node is connected to an arbitrary number of child nodes. Nodes at the same level of the tree are called siblings", while nodes that are below a given node are called its children". At the top of the tree, there is a set of nodes which are characterised by the fact that they do not have any parents. The collection of these nodes is called the head" of the tree or of the forrest. The thesis is structured in six chapters, each with a great role in elaborating a briefly analyse of the tree library. It is given a concise definition of trees, different types of trees are analyzed and it is shown how a tree can be implemented. Also it is offered an overview of the algorithm and how it was created. It contains a description of all the methods used in the library and how are used. An application which uses this library is described in chapter 6. In the end is presented a short summary and suggested further deveolpments.

CONTENT

INTRODUCTION .......................................................................................... 5 1.1 1.2 1.3 The purpose of this library ......................................................................... 5 The Visual Studio 2010 development environment ....................................... 6 Using the library...................................................................................................7

TREES, CLSSIFICATION AND REPRESENTATION ..................................... 8 2.1 2.2 2.3 2.4 Trees ....................................................................................................... 8 Tree representations ................................................................................ 10 Binary trees ............................................................................................ 12 Types of binary trees ............................................................................... 19

THE CLASS HIERARCHY........................................................................... 21 3.1 Inheritance , classes and subclasses ........................................................... 21 3.2 The Tree class hierarchy .......................................................................... 24 3.3 The Tree class ........................................................................................ 26 3.3.2 General description ............................................................................... 26 3.4 The TreeNode class .................................................................................... 27 3.4.1 The class diagram ............................................................................. 27 3.4.2 General description ........................................................................... 27 3.5 The RootedTree class .............................................................................. 28 3.5.1 The class diagram ............................................................................. 28 3.5.2 General description ........................................................................... 28 3.6 The BinaryTree class ............................................................................... 31 3.6.1 The class diagram ............................................................................. 31 3.6.2 General description ........................................................................... 31

ITERATORS................................................................................................ 32 4.1 Iterators ................................................................................................. 32 4.2 The Base Iterator Class ............................................................................ 36 4.2.1 The class diagram ............................................................................. 36 4.2.2 General description ........................................................................... 37 4.3 The PreOrderIter Class ............................................................................ 38 4.3.1 The Class Diagram ............................................................................ 38 4.3.2 General Description .......................................................................... 38 4.4 The PostOrderIter Class ........................................................................... 40 4.4.1 The Class Diagram ............................................................................ 40 4.4.2 General Description .......................................................................... 40 4.5 The SiblingIter Class ............................................................................... 42

4.5.1 4.5.2 5

The Class Diagram ............................................................................ 42 General Description .......................................................................... 42

SERIALIZATION ........................................................................................ 45 5.1 The Graphml File Format ........................................................................ 45 5.1.1 Functional Description....................................................................... 45 5.2 The Serialization Method ......................................................................... 52 5.2.1 Functional description ....................................................................... 52 5.2.2 The method code............................................................................... 54 5.3 The Text File Used for Input .................................................................... 57

AN APPLICATION USING THE TREE LIBRARY ........................................ 58 6.1 6.2 The Binary Embedding Application .......................................................... 58 Classes and methods used ........................................................................ 59

7 8

CONCLUSIONS AND FURTHER DEVELOPMENTS ................................... 61 BIBLIOGRAPHY ......................................................................................... 62

1 INTRODUCTION

1.1 The purpose of this library


I chose and treated this subject with great pleasure and interest because it is wide and on the same time applicable in many programming problems. I also think that data represented as a tree is well-structured because trees store data in a hierarchical manner. Trees are an easy way to represent information which can be hierarchically subdivided, such as: organizational charts, design spaces, directory structures, common data structures like binary search trees, B-trees, and AVL-trees. In the last ten years or so there have been many papers which discuss algorithms for aesthetically laying out hierarchical trees, though few of them are intended to do so dynamically. Trees are often a natural choice (e.g. when writing a chess program) but they are also effective in many situations where data that's received sequentially can be beneficially stored in a more organised fashion - e.g. parsing mathematical expressions, parsing sentences, reading words that later need to be searched, or retrieved in alphabetical order. The major advantage of trees over other data structures is that the related sorting algorithms and search algorithms such as in-order traversal can be very efficient. So for data stored in a tree, at worst it will be as effective for searching and sorting as a linked list, stack etc. At best it will be much faster. Trees play a significant role in the organization of data for efficient information retrieval and are ideal candidates for fast searches, insertions, deletions, and sequential access. Efficiency and speed are possible because trees "spread out" the data they store, so that different paths in the tree lead quickly to the relevant data. They are also convenient for conceptualizing algorithms. A tree library is actually a collection of methods that together form an algorithm used to store and organise data. The tree library for C++ provides an STL-like container class for n-ary trees, template over the data stored at the nodes. Various types of

iterators are provided (post-order, pre-order, and others). Where possible the access methods are compatible with the STL or alternative algorithms are available.

1.2 The Visual Studio 2010 development environment


Microsoft Visual It Studio is can be an integrated used to development environment (IDE) user

from Microsoft.

develop console and graphical Forms applications, web

interface applications along

with Windows

sites, web

applications, and web services in both native code together withmanaged code for all platforms supported by Microsoft Windows, Windows Mobile, Windows CE, .NET Framework, .NET Compact Frameworkand Microsoft Silverlight.It ensures quality code throughout the entire application lifecycle, from design to deployment. Whether youre developing applications for SharePoint, the web, Windows, Windows Phone, and beyond, Visual Studio is your ultimate all-in-one solution. Without Visual Studio, you would need to open a text editor, write all of the code, and then run a command-line compiler to create an executable application. The issue with the text editor and command-line compiler is that you would lose a lot of productivity through manual processes. Fortunately, you have Visual Studio to automate many of the mundane tasks that are required to develop applications. The following sections explain what Visual Studio will do for you and why it is all about developer productivity. Visual Studio includes a suite of project types that you can choose from. Whenever you start a new project, and will automatically generate skeleton code that can compile and run immediately .Each project type has project items that you can add, and project items include skeleton code. In the next chapter, youll learn how to create projects, add project items, and view automatically generated code. VS offers many premade controls, which include skeleton code, saving you from having to write your own code for repetitive tasks. Many of the more complex controls contain wizards that help you customize the controls behavior, generating code based on wizard options you choose. The Visual Studio editor optimizes your coding experience. Much of your code is colorized; you have Intellisense, tips that pop up as you type; and keyboard shortcuts for performing a multitude of tasks. There are a few refactorings, features that help you

quickly improve the organization of your code while youre coding. For example, the Rename refactoring allows you to change an identifier name where it is defined, which also changes every placen in the program that references that identifier. It introduces even more features, such as a call hierarchy, which lets you see the call paths in your code; snippets, which allow you to type an abbreviation that expands to a code template; and action lists for automatically generating new code. A plethora of tools are available to aid you in your quest to rapidly create quality software. You have the Toolbox jam-packed with controls, a Server Explorer for working with operating system services and databases, a Solution Explorer for working with your projects, testing utilities, and visual designers. You can customize many parts of the Visual Studio environment, including colors, editor options, and layout. The options are so extensive that youll need to know where to look to find them all. If the out-of-the-box development environment doesnt offer a feature you need, you can write your own macros to automate a series of tasks you find yourself repeating. For more sophisticated customization, it exposes an application programming interface (API) for creating add-ins and extensions. Several third-party companies have chosen to integrate their own applications with Visual. For example, Embarcaderos Delphi language and development environment is hosted in Visual Studio. The rich and customizable development environment in Visual Studio helps you work the way you want to.

1.3 Using the library


A library is simply a collection of prewritten routines that supports and extends the language in which is created by providing standard code units that the developer can incorporate into his programs to carry out common operations. The operations implemented by routines in the various libraries greatly enhance productivity by saving the effort of writing and testing the code for such operations. It is very convinient to use a tree library, firstly, because it stores data in an organized way and secondly because it is easy to implement and to follow. In this case, the library was created to serve for the Binary Embending for Hierarchical Taxonomies algorithm. The library represents actually a tool for this algorithm , and it plays an important role in its implementation.

2 TREES, CLSSIFICATION AND REPRESENTATION

2.1 Trees
A tree is a widely-used data structure that emulates a hierarchical tree structure with a set of linked nodes. Trees are structurally more complex than lists, they have a more extensive nomenclature for referring to their various subparts. Generally speaking, the names are direct analogies drawn from two sources: the parts of arboreal trees and genealogical relationships. The root of a tree data structure is the most important single point of the tree structure. By definition a (nonempty) tree has only one root (e.g., tournament winner, CEO, or founding father), and that root is the reference point for the entire tree, with all other points defined relative to the root. Computer scientists emphasize this aspect by drawing trees "upside down," thus accenting the all-important root at the top. The junction of two branches is a node. Although this biological term is less well known than the others, it is one of the most important for data structures. In a tree structure, data is associated, or stored, with the node. Conceptually, the root is a node, as are all of its subordinates. Each node in a tree has exactly one predecessor and each one has at most a particular number of succesors.If there is no limit on the number of successors that a node can have , the tree is called general tree.If there is a maximum number N of successors for a node , then the tree is called an N-ary tree.In particular a binary tree is a tree in which each node has either 0,1 or 2 successors.A node with no successors is called a leaf and there will usually be many leaves in a tree.A free tree is a tree that is not rooted. Each node in a tree has zero or more child nodes , which are below it in the tree.A node that has a child is called the childs parent node.A node has at most one parent.

The branches, or edges, of a tree connect two nodes together. Schematically, they are drawn as lines; conceptually, they represent the logical relationship between two nodes (e.g., winner of the game, supervisor-subordinate, parent-child, etc.). Branch is a better word than edge because branches are directional: one end is the superior and one the subordinate. The further one gets from the root, the more numerous (and less significant) are the branches. The smallest (farthest from the root) branches are sometimes called twigs. A series of branches connecting a parent to its child to the grandchild and so on is called a path. The number of branches on a single path is called its pathlength. The longest path (including root and leaf) dictates the height of the tree: for nonempty trees, the height is one more than the longest path. A tree with just one node (and no path) has height 1, while the empty tree will have height zero. For example, the World Series has height 4, but only three rounds in the tournamentor branches in the path. The level of the node is the same as its height. Note that this means that the root is at level 0, and the furthest leaves at level h. Beware of one side effect of this convention in some mathematical derivations: a node at a higher numeric level is visually located below those with lower levels, and vice versa. An internal node or inner node in any node that has child nodes and is not a leaf node.Similary , an external node or outer node is any node that does not have child nodes and is a leaf. A subtree of a tree T is a tree consisting of a node in T and all of its descendants in T. The subtree corresponding to the root node is the entire tree; the subtree corresponding to any other node is called a proper subtree. A forest is a set (usually an ordered set) of zero or more disjoint trees.Another way to phrase part (b) of the definition of tree would be to say that the nodes of a tree excluding the root form a forest.There is very little distinction between abstract forests and trees. If we delete the root of a tree, we have a forest; conversely, if we add just one node to any forest and regard the trees of the forest as subtrees of the new node, we get a tree. Therefore the words tree and forest are often used almost interchangeably during informal discussions about data structures. The children of a node are usually ordered from left-to-right. If we wish explicitly to ignore the order of children, we shall refer to a tree as an unordered

tree. The "left-to-right" ordering of siblings (children of the same node) can be extended to compare any two nodes that are not related by the ancestor-descendant relationship. The relevant rule is that if a and b are siblings, and a is to the left of b, then all the descendants of a are to the left of all the descendants of b. A simple rule, given a node n, for finding those nodes to its left and those to its right, is to draw the path from the root to n. All nodes branching off to the left of this path, and all descendants of such nodes, are to the left of n. All nodes and descendants of nodes branching off to the right are to the right of n. There are several useful ways in which we can systematically order all nodes of a tree. The three most important orderings are called preorder, inorder and postorder; these orderings are defined recursively as follows. If a tree T is null, then the empty list is the preorder, inorder and postorder listing of T. If T consists a single node, then that node by itself is the preorder, inorder, and postorder listing of T.

2.2 Tree representations


There are many different ways to represent trees; common representations represent the nodes as records allocated on the heap with pointers to their children, their parents, or both, or as items in an array, with relationships between them determined by their positions in the array. Drawing a tree has some rules.The root is drawn at the top ; below it are its children.An arc connects a node to each of its children.Then we continue in the same manner , the children of each node are drawn below the node.

10

Fig 2.1 In general, each child of a node is the root of a tree within the big tree. For example, B is the root of a little tree (B,D,E), so is C. These inner trees are called subtrees. The subtrees of a node are the trees whose roots are the children of the node. A path in a tree is any linear subset of a tree.For example A-B-E and C-F are paths.The length of a path could be counted as either the number of nodes on the path.A-B-E has the length 3.There is a unique path from the rooth to any node.Simple as this property seems , it is extremely important : all algorithms for processing trees depend upon it.The depth or level of a node is actually the length of this path.The depth or height of a tree is the maximum depth of the nodes in the tree. A tree is ordered if there is more significance to the order of the subtrees.

Fig. 2.2 If this is a family tree , there could be no significance to left and right.In this case the tree is unordered and we could redraw the tree exchanging subtrees without affecting the meaning of the tree. On the other hand, there may be some significance to left and right - maybe the left child is younger than the right... or (as is the case here) maybe the left child has the name that is earlier in the alphabet. Then, the tree is ordered and we are not free to move around the subtrees.

11

One way to implement a tree would be to have in each node, besides its data, a pointer to each child of the node. However, since the number of children per node can vary so greatly and is not known in advance, it might be infeasible to make the children direct links in the data structure, because there would be too much wasted space. The solution is simple: Keep the children of each node in a linked list of tree nodes. The following declaration is typical.
typedef struct tree_node *tree_ptr; struct tree_node { element_type element; tree_ptr first_child; tree_ptr next_sibling; };

Fig. 2.3

Figure 2.3 shows how a tree might be represented in this implementation. Arrows that point downward are first_child pointers. Arrows that go left to right are next_sibling pointers. Null pointers are not drawn, because there are too many. In the tree of Figure 5, node E has both a pointer to a sibling (F) and a pointer to a child (I), while some nodes have neither.

2.3 Binary trees


A binary tree is made of nodes, where each node contains a "left" pointer, a "right" pointer, and a data element.The "root" pointer points to the topmost node in the tree. The left and right pointers recursively point to smaller "subtrees" on either side. A null pointer represents a binary tree with no elements -- the empty tree. The

12

formal recursive definition is: a binary tree is either empty (represented by a null pointer), or is made of a single node, where the left and right pointers (recursive definition ahead) each point to a binary tree. The root node of a binary tree is the node with no parents.There is at most one root node in a rooted tree and a leaf has no children.Siblings in a binary tree are nodes that share the same parent node and a node p is an ancestor of a node q if it exists on the path from q to the root.The node q is then termed a descendandat of p.In-degree of a node is the number of edges arriving at that node and out-degree of a node is the number of edges leaving that node. Familiar examples of binary trees are the family tree (pedigree) with a person's father and mother asdescendants , the history of a tennis tournament with each game being a node denoted by its winner and the two previous games of the combatants as its descendants, or an arithmetic expression with dyadic operators, with each operator denoting a branch node with its operands as subtrees. Binary trees have several properties : The number of nodes n in a perfect binary tree can be found using the formula : n=2h+1-1 where h is the height of the tree. The number of nodes n in a complete binary tree is minimum: n=2h and maximum: n=2h+1-1 where h is the height of the tree. The number of leaf nodes L in a perfect binary tree can be found using this formula : L=2h where h is the height of the tree. The number of leaf nodes L in a perfect binary tree can be found using the formula : n=2L-1 where L is the number of leaf nodes in the tree. The number of NULL links in a Complete Binary Tree of n-node is (n+1). The number of leaf nodes in a Complete Binary Tree of n-node is [n/2]. For any non-empty binary tree with n0 leaf nodes and n2 nodes of degree 2, n0=n2+1.

13

Operations with binary trees


Tree traversal One of the most important operations on a binary tree is traversal. Tree traversal is the process of visiting each node in the tree exactly one time. Traversal may be interpreted as putting all nodes on one line or linearizing a tree. The definition of traversal specifies only one condition visiting each node only one time but it does not specify the order in which the nodes are visited.Hence , there are as many tree traversals as there are permutations of nodes ; for a tree with n nodes , there are n! different traversals.Most of them , however , are rather chaotic and do not indicare much regularity so that implementing such traversals , lacks generality: For each n , a separate set of traversal procedures must be implemented , and only a few of them can be used for a different number of data. For example , two possible traversals of the tree in the Figure 1 that may be of some use are the sequence 2, 10, 12, 20, 13, 25, 29, 31 and the sequence 29, 31, 20, 12, 2, 25, 10, 13.The first sequence lists even numbers and then odd numbers in ascending order.

Fig. 2.4 The second sequence lists all nodes from level to level right to left , starting from the lowest level up to the root.The sequence 13, 31, 12, 2, 10, 29, 20, 25 does not indicate any regularity in the order of numbers or in the order of the traversed nodes.It is just a random jumping from node to node that in all likelihood is of no use. Nevertheless, all these sequences are the results of three legitimate traversals out of 8!=40,320. Breadth-first traversal is visiting each node starting from the lowest (or highest) level and moving down (or up) level by level , visiting nodes on each level from left

14

to right (or from right to left). There are thus four possibilities , and one such possibility a top-down , left-to-right breadth-first traversal of the tree in Fig.1 results in the sequence 13, 10, 25, 2, 12, 20, 31, 29. Implementation of this kind of traversal is straightforward when a queue is used.Consider a top-down left-to-right , breadth-first traversal. After a node is visited, its children, if any, are placed at the end of the queue, and the node at the beginning of the queue is visited.Considering that for a node on level n , its children are on level n+1, by placing these children at the end of the queue , they are visited after all nodes from level n are visited. Thus, the restriction that all nodes on level n must be visited before visiting any nodes on level n+1 is accomplished. Depth-first traversal proceeds as far as possible to the left (or right) , then backs up until the first crossroad , goes one step to the right (or left) , and again as far as possible to the left (or right).We repeat this process until all nodes are visited, This definition , however , does not clearly specify exactly when nodes are visited : before proceeding down the tree or after backing up. There are some variations of the depth-first traversal: V- visiting a node L-traversing the left subtree R-traversing the right subtree An orderly traversal takes place if these tasks are performed in the same order for each node.The three tasks can themselves be ordered in 3! = 6 ways , so there are six possible ordered depth-first traversals: VLR VRL LVR RVL LRV RLV If the number of different orders still seems like a lot, it can be reduced to three traversals where the move is always from left to right and attention is focused on the first column.The three traversals are given these standard names: VLR preorder tree traversal LVR inorder tree traversal LRV postorder tree traversal Searching a binary tree does not modify the tree. It scans the tree in

15

predetermined way to access some or all of the keys in the tree, but the tree itself remains undisturbed after such an operation. Tree traversal can change the tree but they may also leave it in the same condition. Whether or not the tree is modified depends on the actions prescribed by visit(). There are certain operations that always make some systematic changes in the tree , such as adding nodes , deleting them , modifying elements , merging trees, and balancing trees to reduce their height.To insert a new node , called n_node , a tree node , called t_node , with a dead end has to be reached, and the new node has to be attached to it. A t_node is found using the same technique that tree searching used; the key of the n_node to be inserted is compared to the value of a node, denoted as c_node , currently being examined during a tree scan. If it is less than that value, the left child (if any) is tried; otherwise, the right child is tested. If the child of the c_node to be tested is empty, the scanning is discontinued and the n_node becomes this child. In analyzing the problem of traversing binary trees, three approaches have been presented: traversing with the help of a stack , traversing with the aid of threads, and traversing through tree transformation. The first approach does not change the tree during the process.The third approach changes it, but restores it to the same condition as before it started. Only the second approach needs some preparatory operations on the tree to become feasible: it requires threads. These threads may be created each time before the traversal procedure starts its task and removed each time it is finished.If the traversal is performed infrequently, this becomes a viable option.Another approach is to maintain the threads in all operations on the tree when inserting a new element in the binary tree. Deletion Deleting a node is another operation necessary to maintain a binary tree.The level of complexity in performing the operation depends on the position of the node to be deleted in the tree. It is by far more difficult to delete a node having two subtrees than to delete a leaf; the complexity of the deletion algorithm is proportional to the number of children the node has.There are three cases of deleting a node from the binary tree: The node is a leaf; it has no children.This is the

easiest case to deal with.The appropriate pointer of its parent is set to null and the node is diposed of by delete.(Fig 2.5)

16

Fig. 2.5 The node has one child. This case is not complicated. The parents pointer to the node is reset to point to the nodes child. In this way, the nodes children are lifted up by one level and all great-greatgrandchildren lose one great from their kinship designations.For example , the node containing 20 (Fig 2.6) is deleted by setting the right pointer of its parent containing 15 to point to 20s only child, which is 16.

Fig. 2.6 The node has two children.In this case, no one-step operation can be performed since the parents right or left pointer cannot point to both nodes children at the same time. Deletion by merging solution makes one tree out of the two subtrees of the node and then attaches it to the nodes parent. This technique is called deleting by merging. By the nature of binary trees, every value of the roght subtree is greater than every value of the left subtree, so the best thing to do is to find in the left subtree the node with the greatest value and make it a parent of the right subtree. Symmetrically, the node with the lowest value can be found in the right subtree and made a parent of the left subtree. The desired node is the rightmost node of the left subtree. It can be located by moving along this subtree and taking right pointers until null is encountered. This means that this node will not have a right child, and there is no danger of violating the property of binary tree in the original tree by setting that rightmost nodes right pointer to the right subtree. The same could be done by setting the left pointer of the

17

leftmost node of the right subtree to the left subtree. Another solution is deletion by copying and it was proposed by Thomas Hibbard and Donald Knuth: If the node has two children, it can be reduced to one of two simple cases: The node is a leaf or the node has only one nonempty child. This can be done by replacing the key being deleted with its immediate predecessor (or successor). A keys predecessor is the key in the rightmost node in the left subtree (and analogically, its immediate successor is the key in the leftmost node in the right subtree). First, the predecessor has to be located. This is done, again, by moving one step to the left by first reaching the root of the nodes left subtree and then moving as far to the right as possible. Next, the key of the located node replaces the key to be deleted. And that is where one of two simple cases comes into play.If the rightmost node is a leaf , the first case applies; however, if it has one child, the second case is relevant. In this way, deletion by copying removes a key k1 by overwriting it by another key k2 and then removing a key k1 along with the node that holds it. This algorithm does not increase the height of the tree , but it still causes a problem if it is applied many times along with insertion. The algorithm is asymmetric; it always deletes the node of the immediate predecessor of information in node, possibly reducing the height of the left subtree and leaving the right subtree unaffected. Programmers use a binary tree as a model to create a data structure to encode logic used to make complex decisions. Heres how this works. Lets say that a stem consists of a set of program instructions. At the end of the stem, the program evaluates a binary expression. Youll recall that a binary expression evaluates to either a Boolean true or false. Based on the evaluation, the program proceeds down one of two branches. Each branch has its own set of program instructions. The basic concept of a binary tree isnt new to you because it uses Boolean logic that you learned to implement using an if statement in your program. An if statement evaluates an expression that results in a Boolean value. Depending on the Boolean value, the if statement executes one of two sets of instructions.
There is a one-to-one mapping between general ordered trees and binary trees, which in particular is used by Lisp to represent general ordered trees as binary trees. To convert a general ordered tree to binary tree, we only need to represent the general tree in left child-

18

sibling way. The result of this representation will be automatically binary tree, if viewed from a different perspective. Each node N in the ordered tree corresponds to a node N' in the binary tree; the left child of N' is the node corresponding to the first child of N, and the right child of N' is the node corresponding to N 's next sibling --- that is, the next node in order among the children of the parent of N. This binary tree representation of a general order tree is sometimes also referred to as a left child-right sibling binary tree (LCRS tree), or a doubly chained tree, or a Filial-Heir chain. One way of thinking about this is that each node's children are in a linked list, chained together with their right fields, and the node only has a pointer to the beginning or head of this list, through its left field.For example, in the tree on the left, A has the 6 children {B,C,D,E,F,G}. It can be converted into the binary tree on the right.

Fig. 2.7

2.4Types of binary trees

There are several types of binary trees: 1. Rooted Binary Tree is a tree with a root node in which every node has at most two children. 2. Full Binary Tree is a tree in which every node other than the leaves has two children.

19

3. Perfect Binary Tree is a full binary tree in which all leaves are at the same depth or same level 4. Complete Binary Tree is a binary tree in which every level , except possibly the last , is completely filled , and all nodes are as far left as possible. 5. Infinite Complete Binary Tree is a tree with 0 levels ,where for each level d the number of existing nodes at level d is equal to 2d.The cardinal number of the set of all nodes is 0.The cardinal number of the set of all paths is 2 at the power0. 6. Balanced Binary Tree is commonly defined as a binary tree in which the height of the two subtrees of every node never differ by more than 1 , although in general it is a binary tree where no leaf is much farther away from the root than any other leaf. Balanced trees are important in information retrieval applications. 7. Rooted Complete Binary Tree can be identified with a free magma. 8. Degenerate Tree is a tree where for each parent node , there is only one associated child node.This means that in a performance measurement , the tree will behave like a linked list data structure. 9. Tango Tree is a tree optimized for fast searches. 10. Strictly Binary Tree When the tree is fully expanded ,with 2 degree expension.

20

THE CLASS HIERARCHY

3.1 Inheritance , classes and subclasses

Firstly , is it important to know what class hierarchies in C++ are. In any objectoriented language, classes serve as templates for individual objects. Each object is an instance of a particular class, which can serve as a pattern for many different objects. One of the defining characteristics of the object-oriented paradigm is that classes form hierarchies. Any class can be designated as a subclass of some other class, which is called its superclass. As noted on this weeks section handout, most class hierarchies are tree-structured even though C++ permits more complicated structures. A class represents a specialization of its superclass. If you create an object that is an instance of a class, that object is also an instance of all other classes in the hierarchy above it in the superclass chain. When you define a new class in C++, that class automatically inherits the behavior of its superclass. A superclass allows for a generic interface to include specialized functionality through the use of virtual functions. The superclass mechanism is extensively used in object-oriented programming due to the reusability that can be achieved: common features are encapsulated in modular objects. Subclasses that wish to implement special behavior can do so via virtual methods, without having to duplicate (reimplement) the superclass's behavior. Languages may support both abstract and concrete superclasses. A subclass is a class that inherits some properties from its superclass.One can usually think of the subclass as being "a kind of" its superclass, as in "a square is a kind of rectangle".In this way, a subclass is a more specific version of its superclass; While all rectangles have four sides, the square has the more restricted feature that all of its sides have the same length. The subclass-superclass relationship is often confused with that of classes and instances. An "instance of cat" refers to one particular cat. The Manx cat in the table is still a class there are many instances of Manx cats. And if a particular cat (an

21

instance of the cat class) happen to have its tail bitten off by a fox, that does not change the cat class. It's just that particular cat that has changed. Subclasses and superclasses are often referred to as derived and base classes, respectively, terms coined by C++ creator Bjarne Stroustrup, who found these terms more intuitive than the traditional nomenclature. Derivation is the definition of a new class by extending an existing class. The new class is called the derived class and the existing class from which it is derived is called the base class . The base class is the highest class and does not inherit from any other class. Other classes can inherit from a base class. The derived class will inherit all the features of the base class in C++ inheritance. The derived class can also add its own features, data etc., It can also override some of the features (functions) of the base class, if the function is declared as virtual in base class. A derived class can extend the base class in several ways: New instance attributes can be used, new methods can be defined, and existing methods can be overridden . If a method is defined in a derived class that has the same name as a method in a base class, the method in the derived class overrides the one in the base class. An instance of a derived class can be used anywhere in a program where an instance of the base class may be used. C++ inheritance is very similar to a parent-child relationship. When a class is inherited all the functions and data member are inherited, although not all of them will be accessible by the member functions of the derived class. But there are some exceptions to it too. Because expressions have more than one form, a C++ class that represents expressions can be represented most easily by a class hierarchy in which each of the expression types is a separate subclass, as shown in the following diagram:

Fig. 3.1

22

Even though the class hierarchy is organized in terms of the different types of nodes, clients of the expression package will almost always work with pointers to nodes instead. As I did last time, I will therefore give the pointer type the name expressionT. The first step in creating a C++ subclass is to indicate the superclass on the header line, using the following syntax: class subclass: public subclass { body of class definition } In contrast to Java, a subclass cannot automatically override the definition of a method in its superclass. To permit such overriding, both classes must mark the prototype for that method with the keyword virtual. An abstract class is a class that doesnt actually represent any objects but instead serves only as a common superclass for concrete classes that do generate objects. In C++, methods for an abstract class that are always implemented by the concrete subclasses are indicated by including = 0 before the semicolon on the prototype line. The Generalization relationship indicates that one of the two related classes (the subclass) is considered to be a specialized form of the other (the super type) and superclass is considered as 'Generalization' of subclass. In practice, this means that any instance of the subtype is also an instance of the superclass. An exemplary tree of generalizations of this form is found in binomial nomenclature: human beings are a subclass of simian, which are a subclass of mammal, and so on. The relationship is most easily understood by the phrase 'an A is a B' (a human is a mammal, a mammal is an animal). Inheritance is a way to compartmentalize and reuse code by creating collections of attributes and behaviors called objects which can be based on previously created objects. In classical inheritance where objects are defined by classes, classes can inherit other classes. The new classes, known as subclasses (or derived classes), inherit attributes and behavior (i.e. previously coded algorithms) of the pre-existing classes, which are referred to as superclasses (or ancestor classes). The inheritance relationships of classes gives rise to a hierarchy. In prototype-based programming, objects can be defined directly from other objects without the need to define any classes, in which case

23

this feature is called differential inheritance. Inheritance does not entail behavioral subtyping either. It is entirely possible to derive a class whose object will behave incorrectly when used in a context where the parent class is expected; see the Liskov substitution principle.

3.2 The Tree class hierarchy


For a better understanding of the tree class hierarchy I have represented the classes and their relationships in a class diagram.

Fig. 3.2 A class diagram is a type of static structure diagram that describes the structure of a system by showing the system's classes, their attributes, operations(or)methods and the relationships between the classes. It offers a prime example of the structure diagram type, and provides us with an initial set of notation elements that all other structure diagrams use. A thing to remember is that a class diagram is a static view of a system. The structure of a system is represented using class diagrams. Class diagrams are referenced time and again by the developers while implementing the system.

24

In a class diagram, the classes are arranged in groups that share common characteristics. A class diagram resembles a flowchart in which classes are portrayed as boxes, each box having three rectangles inside. The top rectangle contains the name of the class; the middle rectangle contains the attributes of the class; the lower rectangle contains the methods, also called operations, of the class. Lines, which may have arrows at one or both ends, connect the boxes. These lines define the relationships, also called associations, between the classes. A UML class diagram is similar to a family tree. A class diagram consists of a group of classes and interfaces reflecting important entities.The classes and interfaces in the diagram represent the members of a family tree and the relationships between the classes are analogous to relationships between members in a family tree. Interestingly, classes in a class diagram are interconnected in a hierarchical fashion, like a set of parent classes and related child classes under the parent classes. A very important concept in object-oriented design, inheritance, refers to the ability of one class (child class) to inherit the identical functionality of another class (super class), and then add new functionality of its own. To model inheritance on a class diagram, a solid line is drawn from the child class (the class inheriting the behavior) with a closed, unfilled arrowhead (or triangle) pointing to the super class. Inheritance models is a and is like relationships, enabling you to easily reuse existing data and code. When A inherits from B we say that A is the subclass of B and that B is the superclass of A. Furthermore, we say that we have pure inheritance when A inherits all of the attributes and methods of B. The UML modeling notation for inheritance is a line with a closed arrowhead pointing from the subclass to the superclass. In our case we can see that the Tree Class is the base class. Rooted Tree is a subclass of the Tree Class and it inherits it.It can use all the methods use in the parent class and also it can create new methods.In the same way , Binary Tree inherits Rooted Tree Class and is a subclass of this class.It can use all the methods use in the Rooted Tree Class , and also have its own methods. The TreeNode Class is a base class that doesnt have any children or parents and for this reason it doesnt have any relationship with the other classes.

25

3.3 The Tree class


3.3.1. Class Diagram

Fig. 3.3 3.3.2 General description The class Tree Class is called a base class or a supperclass and all other classes are subclasses because they are derived from it.In this class we declare the iterators and the tree node id.It is inhereted by the Rooted Tree Class and is declared like this :

26

class Tree { public: Tree(void); Tree(int); Tree(Iterator &);

3.4 The TreeNode class

3.4.1

The class diagram

Fig. 3.4

3.4.2

General description

The Tree Node Class contains the declaration of the nodes and their id.It is a base class that doesnt have any subclasses.It contains three methods that are defined below:
TreeNode::TreeNode(int id, TreeNode * pa, TreeNode * ll, TreeNode * rr, TreeNode * pr, TreeNode * nn) { nodeId = id;

27

parent = pa; left = ll; right = rr; prev = pr; next = nn; } TreeNode::~TreeNode(void) { } void TreeNode::display() { printf("nodeId = %d myNode = %8x botLeft = %8x rightUp = %8x\n", nodeId, this, left, right);}

3.5

The RootedTree class

3.5.1

The class diagram

Fig. 3.5 3.5.2 General description

Rooted Tree Class is a subclass of the Tree Class.Also we can say that is derives from the base class which is the Tree Class. A derived class inherits all the attributes of

28

its base class. That is, the derived class contains all the class attributes contained in the base class and the derived class supports all the same operations provided by the base class.For this case, besides the three methods that are declared in the class, the Rooted Tree Class can use all the methods from the Tree Class.But a derived class can also have methods of its own and such a class can become a base class for other classes that can be derived from it so that the inheritance can be deliberately extended. The methods used in this class are declared in the code below:
RootedTree::RootedTree(string file_spec) { ifstream ifs; ifs.open(file_spec);

if(!ifs) cout << "Error: file could not be opened" << endl; } string line; char * tok = ""; int nodeId; hash_map <int, void *> idToNodeMap; hash_map <int, void *> :: const_iterator myIter; TreeNode * firstNode = NULL; TreeNode * prevNode = NULL; TreeNode * currNode = NULL; while (getline(ifs, line)) { cout << "[ " << line << " ]" << endl;

char * myLine = _strdup(line.c_str()); tok = strtok_s(myLine, "-", 0); nodeId = atoi(tok); if (firstNode == NULL) { theRoot = new TreeNode(nodeId, NULL, NULL, NULL); firstNode = theRoot; idToNodeMap[nodeId] = theRoot; } else {

29

myIter = idToNodeMap.find(nodeId);

if (myIter == idToNodeMap.end()) { cout << "error in input file " << endl; exit(0); } else { firstNode = (TreeNode *)myIter->second; } } prevNode = firstNode; tok = strtok_s(NULL, ",", 0); if (tok != NULL) { nodeId = atoi(tok); TreeNode firstNode); idToNodeMap[nodeId] = currNode; prevNode->left = currNode; prevNode = currNode; } while (tok = strtok_s(NULL, ",", 0)) { nodeId = atoi(tok); TreeNode firstNode); idToNodeMap[nodeId] = currNode; prevNode->right = currNode; prevNode = currNode; } free(myLine); } cout << " ** tree file loaded" << endl; } RootedTree::~RootedTree(void) { } RootedTree * * currNode = new TreeNode(nodeId, NULL, NULL, * currNode = new TreeNode(nodeId, NULL, NULL,

30

RootedTree::binaryEmbed() { return 0; }

3.6

The BinaryTree class

3.6.1

The class diagram

Fig. 3.6 3.6.2 General description

The BinaryTree Class is a class derived from the RootedTree Class. It has all the methods of the RootedTree Class and it creates two new methods that are only declared in the program. Because it is derived from the RootedTree, this last class becomes base class for the BinaryTree Class. The declaration of the methods look as follows:
BinaryTree::BinaryTree(void) { } BinaryTree::~BinaryTree(void) { }

31

ITERATORS

4.1

Iterators
In C++, an iterator is any object that, pointing to some element in a range of

elements (such as an array or a container), has the ability to iterate through the elements of that range using a set of operators (at least, the increment (++) and dereference (*) operators). An iterator provides a means for visiting one-by-one all the objects in a container. Iterators are an alternative to using the visitor. The basic idea is that for every concrete container class we will also implement a related concrete iterator derived from an abstract Iterator class. Probably the best definition of an interaror is this: Provide a way to access the elements of an aggregate object sequentially without exposing its underlying representation. The most obvious form of iterator is a pointer: A pointer can point to elements in an array, and can iterate through them using the increment operator (++). But other forms of iterators exist. For example, each container type (such as a vector) has a specific iterator type designed to iterate through its elements in an efficient way. Iterators are not unique to C++. The concept of an iterator is something that allows two parties--generally the consumer of some data structure or "client code", and the implementer of the data structure, or "library code"--to communicate without concern for the other's internal details. This principle of intentional ignorance is what lets a collection of elements (in any language) expose those elements to the outside world without revealing the details of the collection's internal implementation, i.e. whether it is a hash table, linked list, tree, or some other sort of data structure. Notice that while a pointer is a form of iterator, not all iterators have the same functionality a pointer has; To distinguish between the requirements an iterator shall have for a specific algorithm, five different iterator categories exist:

32

Fig. 4.1 In this graph, each iterator category implements the functionalities of all categories to its right: Input and output iterators are the most limited types of iterators, specialized in performing only sequential input or ouput operations. Forward iterators have all the functionality of input and output iterators, although they are limited to one direction in which to iterate through a range. Bidirectional iterators can be iterated through in both directions. All standard containers support at least bidirectional iterators types. Random access iterators implement all the functionalities of bidirectional iterators, plus, they have the ability to access ranges non-sequentially: offsets can be directly applied to these iterators without iterating through all the elements in between. This provides these iterators with the same functionality as standard pointers (pointers are iterators of this category). The characteristics of each category of iterators are:

Fig. 4.2

33

Using the knowledge of an iterator's category one can provide optimized implementations of an algorithm. The advance() operation is an example. It increments (or decrements for negative n) an iterator. template <class Iterator, class Distance> inline void advance (Iterator& i, Distance n); Obviously, there are many ways to do this. For a C++ array one would simply perform pointer arithmetic, i.e., add n to the C++ pointer: i += n; For a list, Iterators must step through the sequence and advance step-by-step. if (n >= 0) while (n--) ++i; else while (n++) --i; The iterator category, which is an abstraction that represents a set of requirements to an iterator, is information related to an iterator. It is useful for providing optimized versions of an operation like advance(); There are two types that might vary depending on the iterator type: 1. The Distance Type : An operation like advance() obviously needs an argument that indicates how far to advance the iterator: template inline void advance (Iterator& i, Distance n); The type of this distance argument must represent the distance between any two iterators. Hence the distance type depends on the iterator type. For C++ pointers the distance type is the C++ type ptrdiff_t, which can represent the differenc between any two C++ pointers. Also, ptrdiff_t is the distance type of all other iterators in STL and Standard Library. However, the distance type in STL and the Standard C++ Library is not limited to ptrdiff_t.

34

2. The Value Type An iterator can be dereferenced. It then returns a reference to a value stored in a container. The type of this referenced value also depends on the respective iterator. For example, if the iterator refers to a container holding integers, the value type will be int. More generally, if the iterator refers to a container that stores elements of an arbitrary type T, the value type will be T.Each iterator has two related types, its value type and its distance type. Value type and distance type are sometimes needed to implement algorithms. In STL and Standard C++ Library algorithms are separated from containers, i.e., an algorithm takes an iterator and uses it to access the container. No information about the container itself is available to an algorithm. This clear separation of containers and algorithms is the basic idea of Generic Programming, which is the key design idea behind the STL. Iterator safety is defined separately for the different types of standard containers, in some cases the iterator is very permissive in allowing the container to change while iterating. There are many varieties of iterators each with slightly different behavior , but not every type of container supports every type of iterator. It is possible for users to create their own iterator types by deriving subclasses from the standard std::iterator class template and this is the most convenient way in our case too. There are several reasons to use iterators. Not always possible. Subscripts can not be used on most of the containers (eg, list and map), so you must use iterators in many cases. Flexible. It is easily to change underlying container types. For example, you might decide later that the number of insertions and deletions is so high that a list would be more efficient than a vector. Member functiuons. Many of the member functions for vector use iterators, for example, assign, insert, or erase. Algorithms. The <algorithm> functions use iterators.

Iterators that have greater requirements and so more powerful access to elements may be used in place of iterators with fewer requirements. For my case I decided to represent the iterators that Id used in a class diagram :

35

Fig. 4.3 As we can see , the Iterator Class is the Base Class.All other classes ( PostOrderIter, SiblingIter, BreathFirstIter, PreOrderIter) are inhereted from the Iterator Class. The essential difference between a container with the structure of a tree and the STL containers is that the latter are \linear. While the STL containers thus only have essentially one way in which one can iterate over their elements, this is not true for trees. The tree library provides (at present) four different iteration schemes.

4.2
4.2.1

The Base Iterator Class


The class diagram

Fig 4.4

36

4.2.2

General description The Base Iterator Class is the base class for four other classes.It contains two

fields that are implemented in the program and six methods that are used. We can see how the Iterator is declared as a pointer and is called several times: Iterator the code for each method:
Iterator::Iterator(void) { } Iterator::Iterator(TreeNode * node) { theNode = node; } Iterator::~Iterator(void) { } Iterator & Iterator::operator++() { return *this; } Iterator & Iterator::operator--() { return *this; } void Iterator::skipChildren() { skipCurrChildren = true; } void Iterator::skipChildren(bool skip) { skipCurrChildren = skip; } &.Here

is

37

4.3

The PreOrderIter Class

4.3.1

The Class Diagram

Fig 4.5 4.3.2 General Description The PreOrderIter Class is a class that is inhereted from the Iterator Class.It has all the methods form the Iterator Claas and it defines three new methods. This class is not a base class for any other class.
PreOrderIter::PreOrderIter(void) { }

PreOrderIter::PreOrderIter(TreeNode * node) : Iterator(node) { skipCurrChildren = false; } PreOrderIter::PreOrderIter(const Iterator & iter) : Iterator(iter.theNode) { } PreOrderIter & PreOrderIter::operator++() { assert(this->theNode != 0);

38

if (!this->skipCurrChildren && this->theNode->left != 0) { this->theNode = this->theNode->left; } else { this->skipCurrChildren = false; while (this->theNode->next == 0) { this->theNode = this->theNode->parent; if (this->theNode ==0) return *this; } this->theNode = this->theNode->next; } return *this; } PreOrderIter & PreOrderIter::operator--() { PreOrderIter it = *this; --(*this); return it; } bool PreOrderIter::operator==(PreOrderIter & it) { if(it.theNode == this->theNode) return true; else return false; } bool PreOrderIter::operator!=(PreOrderIter & it) { if(it.theNode != this->theNode) return true; else return false; }

39

4.4

The PostOrderIter Class

4.4.1

The Class Diagram

Fig 4.6 4.4.2 General Description

PostOrderIter Class is also a class derived from the Iterator Class.It is not a base class for any other class and it defines three new method besides the ones from the Iterator Class.
PostOrderIter::PostOrderIter(void) { } PostOrderIter::PostOrderIter(TreeNode * node) : Iterator(node) { } PostOrderIter::PostOrderIter(const Iterator & iter) : Iterator(iter.theNode) { } PostOrderIter & PostOrderIter::operator++() { assert(this->theNode != 0); if (this->theNode->next == 0) { this->theNode = this->theNode->parent;

40

this->skipCurrChildren = false; } else { this->theNode = this->theNode->next; if (this->skipCurrChildren) { this->skipCurrChildren = false; } else { while (this->theNode->left) this->theNode = this->theNode->left; } } return *this; }

PostOrderIter & PostOrderIter::operator--() { assert(this->theNode != 0); if(this->skipCurrChildren || this->theNode->right == 0) { this->skipCurrChildren = false; while(this->theNode->prev == 0) this->theNode = this->theNode->parent; this->theNode = this->theNode->prev; } else { this->theNode = this->theNode->right; } return *this; } bool PostOrderIter::operator==(PostOrderIter & it) { if(it.theNode == this->theNode) return true; else return false;

41

} bool PostOrderIter::operator!=(PostOrderIter & it) { if(it.theNode != this->theNode) return true; else return false; }

4.5
4.5.1

The SiblingIter Class


The Class Diagram

Fig. 4.7 4.5.2 General Description

The SiblingIter Class is a class that is inhereted form the Iterator Class and it defines six new methods.
SiblingIter::SiblingIter(void) { }

42

SiblingIter::SiblingIter(TreeNode * node) : Iterator(node) { setParent(); } SiblingIter::SiblingIter(const Iterator & iter) : Iterator(iter.theNode) { setParent(); } SiblingIter & SiblingIter::operator++() { if(this->theNode) this->theNode = this->theNode->next; return *this; } SiblingIter & SiblingIter::operator--() { if (this->theNode) this->theNode = this->theNode->prev; else { assert(theParent); this->theNode = theParent->right; } return *this; } bool SiblingIter::operator==(SiblingIter & it) { if(it.theNode == this->theNode) return true; else return false; } bool SiblingIter::operator!=(SiblingIter & it) { if(it.theNode != this->theNode) return true; else return false; } void

43

SiblingIter::setParent() { theParent = 0; if (this->theNode == 0) return; if (this->theNode->parent != 0) theParent = this->theNode->parent; }

44

SERIALIZATION

5.1
5.1.1

The Graphml File Format


Functional Description

GraphML is an XML-based file format for graphs. The GraphML file format results from the joint effort of the graph drawing community to define a common format for exchanging graph structure data. It uses an XML-based syntax and supports the entire range of possible graph structure constellations including directed, undirected, mixed graphs, hypergraphs, and application-specific attributes. GraphML Primer is a non-normative document intended to provide an easily readable description of the GraphML facilities, and is oriented towards quickly understanding how to create GraphML documents. This primer describes the language features through examples which are complemented by references to normative texts.

5.1.2.Virtual Presentation

The purpose of a GraphML document is to define a graph. Let us start by considering the graph shown in the figure below. It contains 11 nodes and 12 edges.

Fig. 5.1

45

The graph is contained in the file simple.graphml


<?xml version="1.0" encoding="UTF-8"?> <graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd"> <graph id="G" edgedefault="undirected"> <node id="n0"/> <node id="n1"/> <node id="n2"/> <node id="n3"/> <node id="n4"/> <node id="n5"/> <node id="n6"/> <node id="n7"/> <node id="n8"/> <node id="n9"/> <node id="n10"/> <edge source="n0" target="n2"/> <edge source="n1" target="n2"/> <edge source="n2" target="n3"/> <edge source="n3" target="n5"/> <edge source="n3" target="n4"/> <edge source="n4" target="n6"/> <edge source="n6" target="n5"/> <edge source="n5" target="n7"/> <edge source="n6" target="n8"/> <edge source="n8" target="n7"/> <edge source="n8" target="n9"/> <edge source="n8" target="n10"/> </graph> </graphml>

The GraphML document consists of a graphml element and a variety of subelements: graph, node, edge. The first line of the document is an XML process instruction which defines that the document adheres to the XML 1.0 standard and that the encoding of the document is

46

UTF-8, the standard encoding for XML documents. Of course other encodings can be chosen for GraphML documents. The second line contains the root-element element of a GraphML document: the graphml element. The graphml element, like all other GraphML elements, belongs to the namespacehttp://graphml.graphdrawing.org/xmlns. For this reason we define this namespace as the default namespace in the document by adding the XML Attributexmlns="http://graphml.graphdrawing.org/xmlns" to it. The two other XML Attributes are needed to specify the XML Schema for this document. In our example we use the standard schema for GraphML The documents first located on

the graphdrawing.org server.

attribute,

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance", defines xsi as the XML Schema namespace.The second

attribute, xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd" , defines the XML Schema location for all elements in the GraphML namespace. The XML Schema reference is not required but it provides means to validate the document and is therefore strongly recommended. A graph is, not surprisingly, denoted by a graph element. Nested inside a graph element are the declarations of nodes and edges. A node is declared with a node element, and an egde with an edge element. In GraphML there is no order defined for the appearance of node and edge elements.
<graph id="G" edgedefault="directed"> <node id="n0"/> <node id="n1"/> ... <node id="n10"/> <edge source="n0" target="n2"/> <edge source="n1" target="n2"/> ... <edge source="n8" target="n10"/> </graph>

Graphs in GraphML are mixed, in other words, they can contain directed and undirected edges at the same time. If no direction is specified when an edge is declared, the default direction is applied to the edge. The default direction is declared as the XML

47

Attribute edgedefault of the graph element. The two possible value for this XML Attribute are directed and undirected. Note that the default direction must be specified. Optionally an identifier for the graph can be specified with the XML Attribute id. The identifier is used, when it is necessary to reference the graph. Nodes in the graph are declared by the node element. Each node has an identifier, which must be unique within the entire document, i.e., in a document there must be no two nodes with the same identifier. The identifier of a node is defined by the XML-Attribute id. Edges in the graph are declared by the edge element. Each edge must define its two endpoints with the XML-Attributes source and target. If the value of the source, resp. target, must be the identifier of a node in the same document.Edges with only one endpoint, also called loops, selfloops, or reflexive edges, are defined by having the same value for source and target. The optional XML-Attribute directed declares if the edge is directed or undirected. The value true declares a directed edge, the value false an undirected edge. If the direction is not explicitely defined, the default direction is applied to this edge as defined in the enclosing graph. Optionally an identifier for the edge can be specified with the XML Attribute id. The id XML-Attribute is used, when it is necessary to reference the edge.
... <edge id="e1" directed="true" source="n0" target="n2"/> ...

With the help of the extension GraphML-Attributes one can specify additional information of simple type for the elements of the graph. Simple type means that the information is restricted to scalar values, e.g. numerical values and strings. If you want to add structured content to graph elements you should use the key/data extension mechanism of GraphML. Attributes themselfes are specialized data/key extensions. GraphML-Attributes must not be confounded with XML-Attributes which are a different concept. A GraphML-Attribute is defined by a key element which specifies the identifier, name, type and domain of the attribute. The identifier is specified by the XML-Attribute id and is used to refer to the GraphML-Attribute inside the document. The name of the GraphML-Attribute is defined by the XML-Attribute attr.name and must be unique among all GraphML-Attributes declared in the document. The

48

purpose of the name is that applications can identify the meaning of the attribute. Note that the name of the GraphML-Attribute is not used inside the document, the identifier is used for this purpose.The type of the GraphML-Attribute can be either boolean, int, long, float, double, or string. These types are defined like the corresponding types in the Java(TM)-Programming language. The domain of the GraphML-Attribute specifies for which graph elements the GraphML-Attribute is declared. Possible values include graph, node, edge, and all. It is possible to define a default value for a GraphML-Attribute. The text content of the default element defines this default value.
... <key id="d0" for="node" attr.name="color" attr.type="string"> <default>yellow</default> </key> ...

The value of a GraphML-Attribute for a graph element is defined by a data element nested inside the element for the graph element. The data element has an XML-Attribute key, which refers to the identifier of the GraphML-Attribute. The value of the GraphML-Attribute is the text content of the data element. This value must be of the type declared in the correspondingkey definition. There can be graph elements for which a GraphML-Attribute is defined but no value is declared by a

corresponding data element. If a default value is defined for this GraphML-Attribute, then this default value is applied to the graph element. In the above example no value is defined for the node with identifier n1 and the GraphML-Attribute with name color. Therefore this GraphML-Attribute has the default value, yellow for this node. If no default value is specified, as for the GraphML-Attribute weight in the above example, the value of the GraphML-Attribute is undefined for the graph element. In the above example the value is undefined of the GraphML-Attribute weight for the edge with identifier e3. To make it possible to implement optimized parsers for GraphML documents meta-data can be attached as XML-Attributes to some GraphML elements. All XMLAttributes denoting meta-data are prefixed with parse. There are two kinds of meta-data: information about the number of elements and information how specific data is encoded in the document.For the first kind, information about the number of elements, the

49

following XML-Attributes for the graph element are defined: The XML-Attribute parse.nodes denotes the number of nodes in the graph, the XML-Attribute parse.edgesthe number of edges. The XML-Attribute parse.maxindegree denotes the maximum indegree of the nodes in the graph and the XML-Attribute

parse.maxoutdegree the maximum outdegree. For the node element the XML-Attribute parse.indegree denotes the indegree of the node and the XML-Attribute parse.outdegree the outdegree.For the second kind, information about element encoding, the following XML-Attributes for the graph element are defined: If the XML-Attribute parse.nodeids has the value canonical, all nodes have identifiers following the pattern nX, where X denotes the number of occurences of the node element before the current element. Otherwise the value of the XML-Attribute is free. The same holds for edges for which the corresponding XML-Attribute parse.edgeids is defined, with the only difference that the identifiers of the edges follow the pattern eX. The XML-Attribute parse.order denotes the order in which node and edge elements occur in the document. For the value nodesfirst no node element is allowed to occur after the first occurence of an edge element. For the value adjacencylist, the declariation of a node is followed the declaration of its adjacent edges. For the value free no order is imposed. The following example demonstrates the parse info meta-data on our running example:
<?xml version="1.0" encoding="UTF-8"?> <!-- This file was written by the JAVA GraphML Library.--> <graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd"> <graph id="G" edgedefault="directed" parse.nodes="11" parse.edges="12" parse.maxindegree="2" parse.maxoutdegree="3" parse.nodeids="canonical" parse.edgeids="free" parse.order="nodesfirst"> <node id="n0" parse.indegree="0" parse.outdegree="1"/> <node id="n1" parse.indegree="0" parse.outdegree="1"/>

50

<node id="n2" parse.indegree="2" parse.outdegree="1"/> <node id="n3" parse.indegree="1" parse.outdegree="2"/> <node id="n4" parse.indegree="1" parse.outdegree="1"/> <node id="n5" parse.indegree="2" parse.outdegree="1"/> <node id="n6" parse.indegree="1" parse.outdegree="2"/> <node id="n7" parse.indegree="2" parse.outdegree="0"/> <node id="n8" parse.indegree="1" parse.outdegree="3"/> <node id="n9" parse.indegree="1" parse.outdegree="0"/> <node id="n10" parse.indegree="1" parse.outdegree="0"/> <edge id="edge0001" source="n0" target="n2"/> <edge id="edge0002" source="n1" target="n2"/> <edge id="edge0003" source="n2" target="n3"/> <edge id="edge0004" source="n3" target="n5"/> <edge id="edge0005" source="n3" target="n4"/> <edge id="edge0006" source="n4" target="n6"/> <edge id="edge0007" source="n6" target="n5"/> <edge id="edge0008" source="n5" target="n7"/> <edge id="edge0009" source="n6" target="n8"/> <edge id="edge0010" source="n8" target="n7"/> <edge id="edge0011" source="n8" target="n9"/> <edge id="edge0012" source="n8" target="n10"/> </graph> </graphml>

Work on GraphML was initiated in a workshop during the 2000 Graph Drawing Symposium in Williamsburg, and a proposal for the structural layer was presented at the 2001 Graph Drawing Symposium in Vienna. Since then, extensions have been provided that support basic attribute data types and the embedding of information for light-weight parsers. The next major steps will be extensions for abstract graph layout information and templates to transform such information into a variety of graphics formats. Software to help add GraphML support to several popular tools and libraries is under development.

51

5.2
5.2.1

The Serialization Method


Functional description

Serialization is the process of converting a data structure or object into a format that can be stored (for example, in a file or memory buffer, or transmitted across a network connection link) and "resurrected" later in the same or another computer environment. When the resulting series of bits is reread according to the serialization format, it can be used to create a semantically identical clone of the original object. For many complex objects, such as those that make extensive use of references, this process is not straightforward. It lets you take an object or group of objects, put them on a disk or send them through a wire or wireless transport mechanism, then later, perhaps on another computer, reverse the process: resurrect the original object(s). The basic mechanisms are to flatten object(s) into a one-dimensional stream of bits, and to turn that stream of bits back into the original object(s).

Serialization does not write class variables because they are not part of the state
of the object. It also does not transmit the object's class object (e.g., its method dictionary) because the program deserializing the stream must load that class. Each serializable or externalizable class has a description of its serialization fields and methods. This process of serializing an object is also called deflating or marshalling an object. The opposite operation, extracting a data structure from a series of bytes, is deserialization (which is also called inflating or unmarshalling). Serialization provides: a method of persisting objects which is more convenient than writing their properties to a text file on disk, and re-assembling them by reading this back in. a method of remote procedure calls, e.g., as in SOAP a method for distributing objects, especially in software componentry such as COM, CORBA, etc. a method for detecting changes in time-varying data. For some of these features to be useful, architecture independence must be maintained. For example, for maximal use of distribution, a computer running on a

52

different hardware architecture should be able to reliably reconstruct a serialized data stream, regardless of endianness. This means that the simpler and faster procedure of directly copying the memory layout of the data structure cannot work reliably for all architectures. Serializing the data structure in an architecture independent format means that we do not suffer from the problems of byte ordering, memory layout, or simply different ways of representing data structures in different programming languages. Inherent to any serialization scheme is that, because the encoding of the data is by definition serial, extracting one part of the serialized data structure requires that the entire object be read from start to end, and reconstructed. In many applications this linearity is an asset, because it enables simple, common I/O interfaces to be utilized to hold and pass on the state of an object. In applications where higher performance is an issue, it can make sense to expend more effort to deal with a more complex, non-linear storage organization. Even on a single machine, primitive pointer objects are too fragile to save, because the objects to which they point may be reloaded to a different location in memory. To deal with this, the serialization process includes a step called unswizzling or pointer unswizzling and the deserialization process includes a step called pointer swizzling. Since both serializing and deserializing can be driven from common code, (for example, the Serialize function in Microsoft Foundation Classes) it is possible for the common code to do both at the same time, and thus : 1) detect differences between the objects being serialized and their prior copies 2) provide the input for the next such detection. It is not necessary to actually build the prior copy, since differences can be detected "on the fly". This is a way to understand the technique called differential execution. It is useful in the programming of user interfaces whose contents are timevarying graphical objects can be created, removed, altered, or made to handle input events without necessarily having to write separate code to do those things. Serialization, however, breaks the opacity of an abstract data type by potentially exposing private implementation details. To discourage competitors from making compatible products, publishers of proprietary software often keep the details of their programs' serialization formats a trade secret. Some deliberately obfuscate or even

53

encrypt the serialized data. Yet, interoperability requires that applications be able to understand each other's serialization formats. Therefore, remote method call architectures such as CORBA define their serialization formats in detail. Serialization is a mechanism by which you can save the state of an object by converting it to a byte stream. Two types: 1.Binary Serializable 2.XML Serializable The serializable interface is an empty interface it does not contain any methods. So we do not implement any methods.Whenever an object is to be sent over the network objects need to be serialized. Moreover if the state of an object is to be saved objects need to be serialized. The method used for the serialization in graphml format of a Tree object is Tree::saveGraphml(). The method uses as input a variable of type string containing the name of the file in which the tree is saved. After appending the graphml extension to the name of the file, the file is open for output. A graphml file is an xml file. After writing the initial lines, a <graph> element is open, containing the ID of the graph. For each node (vertex) of the graph, a <node> element is created, containing at least the ID of the node. Usually, elements containing representation information of the nodes is added as well. For each edge (arc) of the tree, an <edge> element is created, containing at least the Ids of the source and target nodes. Once the node and edge lists are exhausted, the <graph> and <graphml> elements are closed. 5.2.2
void Tree::saveGraphml(string fileName) { string treeFileSpec = fileName.append(".graphml"); ofstream mlFile; TreeNode * currNode; mlFile.open(treeFileSpec, ios::out); if (!mlFile.is_open()) {

The method code

54

return; } else { // antet mlFile endl; mlFile mlFile instance\"" ; mlFile << " xmlns:y=\"http://www.yworks.com/xml/graphml\"" ; mlFile << " xmlns:yed=\"http://www.yworks.com/xml/yed/3\"" ; mlFile << " xsi:schemaLocation=\"http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd\">"<<endl ; mlFile mlFile << << "<key "<graph for=\"node\" id=\"d1\" yfiles.type=\"nodegraphics\"/>" ; id=\""<<this->theId<<"\" edgedefault=\"directed\">" << endl ; // nodes and arcs PreOrderIter prIt = this->startPre(); prIt.skipChildren(false); while (prIt != this->endPre()) { currNode = prIt.theNode; mlFile "\">"<<endl; mlFile << " <data key=\"d1\">"<<endl; mlFile << " mlFile width=\"30.0\" />"<<endl; mlFile mlFile << << " " <y:Fill <y:BorderStyle color=\"#FFCC00\" color=\"#000000\" <y:ShapeNode>"<<endl; << " <y:Geometry height=\"30.0\" << "<node id=\"" << currNode->nodeId << << " << "<graphml xmlns=\"http://graphml.graphdrawing.org/xmlns\"" ; xmlns:xsi=\"http://www.w3.org/2001/XMLSchema<< "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" <<

transparent=\"false\"/>"<<endl; type=\"line\" width=\"1.0\"/>"<<endl; mlFile << " mlFile <y:NodeLabel alignment=\"center\" "; << " autoSizePolicy=\"content\"

fontFamily=\"Dialog\" fontSize=\"13\" fontStyle=\"plain\" ";

55

mlFile mlFile <<"

<<"

hasBackgroundColor=\"false\" textColor=\"#000000\"

hasLineColor=\"false\" height=\"20\" modelName=\"internal\" "; modelPosition=\"c\" visible=\"true\" width=\"11.23\" x=\"9\" y=\"5\""; if(currNode->nodeId < 1000 ) { mlFile >nodeId<<"</y:NodeLabel>"<<endl; }else { mlFile <<">"<<"</y:NodeLabel>"<<endl; } if(currNode->nodeId < 1000 ) { mlFile type=\"rectangle\"/>"<<endl; }else { mlFile <<" } mlFile <<" </y:ShapeNode>"<<endl; <y:Shape type=\"diamond\"/>"<<endl; <<" <y:Shape <<">"<<currNode-

mlFile <<" </data> "<<endl ; mlFile << "</node>" << endl; if (currNode->parent != this->theHead) mlFile << "<edge id=\"e" << currNode->nodeId << "\" source=\"" << currNode->parent->nodeId << "\" target=\"" << currNode>nodeId << "\"/>" << endl; ++prIt; } // final mlFile << "</graph>" << endl; mlFile << "</graphml>" << endl; mlFile.flush(); mlFile.close(); } }

56

5.3

The Text File Used for Input


The text file used for input contains on each line the ID of a node, followed by a

dash and a comma separated list of the IDs of its children. In our case, the input text file is called one_tree.txt and has the following content:
1-2,15 2-3,4,5,13 5-6,7,8,10 8-9 10-11 15-16,17,18,25,26 18-19,20,21,24 21-22,23 26-17,28,31,32 28-29,30

57

AN APPLICATION USING THE TREE LIBRARY

6.1

The Binary Embedding Application


The binary embedding of hierarchical taxonomies application is a vast application

that can be applied in many domains like: systematic biology, medicine, market research and articial intelligence.A hierarchical taxonomy is a tree structure of classifications for a given set of objects. At the top of this structure is a single classification, the root node, that applies to all objects. Nodes below this root are more specific classifications that apply to subsets of the total set of classified objects. The progress of reasoning proceeds from the general to the more specific.In scientific
taxonomies, a conflative term is always a polyseme.

The application opens a file structured in a defined way , it parses it, creates in equivalent and represents it in the memory.A taxonomy is stored in the memory and it can be embedded using a raw embedding algorithm, resulting a binary tree. A binary embedding has the following definition: Let T=(V, A, v0) be a rooted tree (hierarchical taxonomy). A binary embedding of T is an application :V Bn such that for any pair ( v1 , v2 ) V x V, (v1 , v2) <=> ((v1),(v2)) . This binary tree is keeping the information about the hierarchy of the nodes and itcan be exported to a Graphml file, to later be visualized and/or edited in any graph editing software that supports the graphml file type. Hierarchical taxonomies have become an important tool in the organization of knowledge in many domains: The US Patent Office class codes, the Library of Congress catalog, and even the ACM Computing Classification System are hierarchical in structure.Taxonomies structured as hierarchies form an easier way to navigate and access the data as well as to maintain and enrich it.

58

6.2

Classes and methods used

The Embedding Algorithm contains three main classes that are represented in the class diagram below :

Fig. 6.1 First class : CAboutDlg contains information about the dialog used for About Application.
class CAboutDlg : public CDialogEx { public: CAboutDlg(); // Dialog Data enum { IDD = IDD_ABOUTBOX }; protected: virtual void DoDataExchange(CDataExchange* pDX); // Implementation protected: DECLARE_MESSAGE_MAP(); }

Second class: CembAlgApp contains information to define the class behaviors for the application.
class CembAlgApp : public CWinApp { public: CembAlgApp(); public:

59

virtual BOOL InitInstance(); string origTreeFileSpec; string embedTreeFileSpec; RootedTree * currTree; RootedTree * origTree; RootedTree * embedTree; DECLARE_MESSAGE_MAP() }; extern CembAlgApp theApp;

The third class: CembAlgDlg contains methods to implement the input file processing, saving the current tree in graphml format and for the raw embedding.It is defined in the following code:
class CembAlgDlg : public CDialogEx { // Construction public: CembAlgDlg(CWnd* pParent = NULL); // standard constructor // Dialog Data enum { IDD = IDD_EMBALG_DIALOG }; protected: virtual void DoDataExchange(CDataExchange* pDX); protected: HICON m_hIcon; virtual BOOL OnInitDialog(); afx_msg void OnSysCommand(UINT nID, LPARAM lParam); afx_msg void OnPaint(); afx_msg HCURSOR OnQueryDragIcon(); DECLARE_MESSAGE_MAP() public: afx_msg void OnBnClickedFileOpen(); afx_msg void OnBnClickedExit(); afx_msg void OnBnClickedSaveBinary(); afx_msg void OnBnClickedClearLog(); afx_msg void OnBnClickedParseFile(); afx_msg void OnBnClickedEmbedTree(); CListBox m_event_log; };

60

CONCLUSIONS AND FURTHER DEVELOPMENTS


Trees are remarkably useful and powerful data structures, with many

applications. Though trees seem complex at first, they are fairly powerful data structures. In this paper I covered some examples of how to use trees. The tree library I have created can be used in the Binary Embedding for Hierarchical Taxonomies Application, and in many other applications. The purpose of this library is to organize data in a well-structured from which is the n-ary tree. For the library to be implemented I have used different classes and iterators.There are four main classes , the one with the great importance being The Rooted Tree Class. Also I have used iterators.The library provides four different iteration schemes. I also have applied several operarions initialising, tree traversal or inserting nodes. Further developments include, but are not limited to: Implementing a breadth first iterator Implementing operations on trees, like appendChild, insertSubtree, replace, moveAfter Implementing invariant computations like getDepth, getMaxDepth, getNumberOfSiblings Implementing allocation and deallocation routines for better memory management

61

BIBLIOGRAPHY

[1] Donald Knuth; The Art of Computer Programming: Fundamental Algorithms, Third Edition. Addison-Wesley, 1997 [2] Adam Drozdek; Data Structures and Algorithms in C++ , Second Edition. Brooks/Cole 2001 , A division by Thomson Learning [3] Stroustrup, Bjarne; The C++ Programming Language , Special Edition. Addison-Wesley, Upper Saddle River, NJ 1997 [4] Nell Dale; C++Data Structures , Third Edition. Copyright 2003 by Jones and Bartlett Publishers [5] Kruse Robert L , Alexander J. Ryba..; Data Structures and Program Design in C++. 2000 by Prentice-Hall, Inc [6] Binary Trees, http://cslibrary.stanford.edu/110/BinaryTrees.html [7] Serialization, http://www.osix.net/modules/article/?id=348 [8] Iterators, http://www.cplusplus.com/reference/std/iterator/

62

You might also like