Professional Documents
Culture Documents
Management Systems
Tim Henderson and Steve Johnson
Case Western Reserve University
December 17, 2009
Abstract
Version control repositories contain vast amounts of data about the history of a
project. To date, people have not tried to extract much useful information from these
repositories, largely because data formats vary and access time is very slow. This
paper presents the foundations for a system which aims to vastly improve read times
for specific kinds of repository data and make queries against this repository easier to
write. In short, we want to treat the repository like a database as much as possible,
including reasonable access time and a powerful query language.
This paper focuses on storing and querying path data in the repository, i.e. parent-
child relationships between nodes. Specifically, it describes a system for storing a par-
allel, efficient representation of the connectivity graph, as well as an efficient algorithm
for finding all nodes on any path between two nodes.
We were not able to rigorously test the real-world performance of our algorithms
due to time constraints and changing designs, but we have opened up many areas for
future research.
1 Introduction
Since the invention of computers, code bases have become progressively larger and their
development has involved more people working in parallel. Early SCMs were created to
deal with this growth, with varying degrees of success. Today, many code bases consist of
hundreds of thousands of lines of code, with thousands of changesets in the repository.
As software projects become larger, it becomes more difficult to find out things about
the source code, such as where bugs are located, or whether duplicate code is being writ-
ten. Source Code Management (SCM) systems have become nearly ubiquitous in software
development environments, and the information stored in them can be useful in answering
certain questions.
The data contained in these repositories contain a wealth of information about the source
code. Currently, that information is only used to perform transformations on text files, not
1
provide information about those files. By treating a repository as a database, providing
efficient and intuitive ways to query it in a reasonable aount of time, we can discover new
ways to measure progress, find bugs, and identify general qualities in a software project.
Before any information can be read from a repository, the search space must be defined.
Since the repository is a directed acyclic graph, one cannot simply specify a linear chain
of commits, since many of those commits are based on branches and merges by different
committers who contribute equally. A more sophisticated and powerful way to define the
search space is to specify two end points and find all nodes that are on any path between
them.
This paper focuses exclusively on defining the search space, or “subgraph selection.” To
do this in a reasonable amount of time, the entire repository must be indexed. This index
must be updated incrementally for each new commit and can be optimized significantly due
to the nature of version control repositories.
Section 2 outlines some special characteristics of repository graphs, which helps to deter-
mine what is possible, what is efficient, and what is correct. Section 3 describes an efficient
way to create and maintain an index of the parent-child relationships in a repository graph.
Section 4 describes an algorithm for subgraph selection on this index. Section 5 describes the
exact file structure of the graph representation introduced in section 3. Section 7 outlines
various approaches we tried and abandoned. Finally, section 8 gives some conclusions of our
current research.
2 Assumptions
Since a repository is a special case of a graph, we can make the following assumptions based
on the nature of version control.
2. The commit time of a node is always less than that of its children
3. A directed graph with all arrows parent → child or child → parent is acyclic
2
2.2 Commit Time
Since a new commit is based on a previous commit, that previous commit must have been
made earlier in time. Therefore, the new commit gets a new time stamp.
This means that a sorting of commits by real-world commit time is also a topological
sort. However, we do not use commit time as a topological sort because individual users’
clocks may differ.
2.5 In-Edges
According to all existing SCMs, a commit is a set of changes to a previous state of the
repository. If those changes are the result of the user, then a commit will have one in-edge
representing the previous commit. If those changes are the result of a merge from another
branch, then the commit will have two in-edges, one for each branch.
This assumption allows us to state that O(E) = O(V ) for all possible repositories, al-
though E will be higher on average than V .
3 Graph Representation
Most SCMs store repository information in a way that is optimized for reading and writing
as little information as possible. In order to optimize queries and to put an abstraction layer
3
over the SCM implementation, we create a separate graph representation of the repository.
Since a repository is a sparse DAG with a low upper bound on incoming edges, the most
efficient representation is adjacency lists.
4
3. Branch: collapsed node terminates at parent, new collapsed nodes start at each child
4. Merge: collapsed nodes terminate at parents, new collapsed node starts at child
Examples of each case are shown on the next page. Any two of these cases can coincide.
b=E−V +h (1)
where h is the number of leaf nodes. This is because each branch-merge pair creates an
extra edge, and every leaf represents an unmerged branch. Now we can calculate the number
of merges as:
m=b−h+1 (2)
=E−V +1 (3)
because each leaf is an unmerged branch. The extra 1 accounts for the root node. In
reality, h (the number of leaf nodes) is negligible since most repositories have only a handful
of leaf nodes, each one representing a branch of development.
At most, the reduced graph will contain a node for every branch, merge, leaf, and root.
Some nodes might be both a branch and a merge, but it is impossible to determine how
many exist using only edge and vertex counts.
The expression E − V is important for our analysis, so it is relevant to note that E − V
is never more than 0.5V . The proof is as follows.
As shown in the beginning of the section, each branch-merge pair creates one extra edge.
This means that there can be at most one extra edge for half of the nodes, those that
represent the merge of a branch-merge pair. Therefore, the following holds:
5
a a c
a ...
b b d
b b
c e e
... c d f f
1 3
1 3
2 5 2 5
4 6
4 6
7
7
8 9
8 9
11 10 11 10
. .
(e) An example graph (f) Reduced graph calcu-
demonstrating cases 4 lated from (e)
and 5
6
We ran a script over the history of a prominent open source product, Mercurial, to
calculate the number of branches and merges. The repository “hg-stable” contains 1081
branches out of 10862 commits, or roughly 10%. There are an equal number of merges.
Therefore, roughly 80% of all commits will be skipped over during a tree traversal, which is
a significant speed improvement.
7
3.5.3 Splitting a Collapsed Node
If a commit A is added which inherits from a commit
in the middle of a collapsed node C, then the collapsed 1
node must be split to maintain correctness. To accom- 2
1
plish this task, we can find the new node’s parent P in 3
2
C, copy the node IDs after P to a new collapsed node
3 6
D, remove the commits after P from C, and update C
4 4
to point to A and D as its children. 6
5 5
This case should arise often for collaborative
projects. This might happen locally if a user writes (a) Before (b) After
a series of commits, then branches from an earlier re-
vision to try a different strategy. The behavior of this Figure 4: Splitting
case is shown in figure 4.
4 Path Selection
The most basic part of a SourceQL query is subgraph selection, similar to the XPath part
of an XQuery for expression. Subgraph selection is done using a modified version of XPath
syntax. XPath has been used elsewhere as a query language for directed graphs [1].
The most difficult XPath queries are expressions of the form
A//B, where A and B are nodes. A subgraph selection finds the ...
subgraph which is the flow network between A and B, i.e. all nodes
which are on a path between A and B.
A (10)
8
function select_path(V, start_node, end_node):
visited = array(V.length)
V_2 = empty_set()
return V_2
on_path = False
for child in node.children:
if child in subgraph:
on_path = True
else:
if not visited[child]:
r = visit_node(child, end_node, visited, subgraph):
on_path = (on_path OR r)
if on_path:
subgraph.add(node)
return True
else:
return False
9
An example of this algorithm is shown in figure 5. Node labels represent topological sort
position. Bold edges were visited by the search, and dotted edges were not visited. In this
example, the edge 19-20 was not visited because 19 is greater than B’s sort position of 18, so
none of its children were traversed. The final output of this algorithm will include all nodes
except 11, 15, 19, and 20.
5 File Structure
In order to efficiently access the information stored in the reduced graph representation we
have designed a domain specific file structures for the information. The challenge associated
with the collapsed graph is since it represents the full graph, we must be able to efficiently
10
function select_path_optimized(start_node, end_node):
visited = array(|V|)
V_2 = empty_set()
on_path = False
for child in collapsed_node.children:
if child in subgraph:
on_path = True
else:
if not visited(child):
r = visit_node(child, end_node, visited, subgraph):
on_path = (on_path OR r)
if on_path:
subgraph.add(collapsed_node)
return True
else:
return False
11
12
and all nodes before this node. From the other collapsed node we will select all nodes that
they represent. Since our storage structure is a sorted list the asymptotic time for the select
all is linear with respect to the number of nodes being selected.
6 Algebraic Operators
Thus far we have defined an implementation of a Subgraph Select Operator. This operator
takes two nodes and outputs all nodes that lie between those two nodes. This is the most
basic operator in SourceQL. In the following section, we will define an algorithm which
outputs a subgraph as a list sorted by the nodes’ given topological sort order. Using the
pre-sorted data, it is possible to define three set operators on top subgraph select that run
in linear time: subgraph union, subgraph intersection, and subgraph difference.
13
6.2 Subgraph Union
Subgraph Union, A ∪ B where both A and B represent subgraphs, is defined as the union
between the sets A.V and B.V , where A.V means the vertices of A. Since we can assure
that both A.V and B.V are sorted we can simply use a merge to construct the union. The
resulting algorithm is the linear time, as will be true for both intersection and difference as
well.
7.1.1 Encoding
let |A| be the length of the array
14
1 2 3 4 - 5 - 6 7 8 - 9 - 1 10 -
9 4
8 10 2 5
7 X X X 1 3 X X
Figure 9: An example tree, the X represents non-existant nodes. The above array is the
encoding of the tree into the array.
It we consider each left, and right subtree as separate trees we can apply the same
calculation to find their locations in the array. If p is the position of the parent node then
the left subarray is defined as [0, p − 1] and the right is defined as [p + 1, |A|] where |A| is
the length of the parent array. See figure 9 for an example. This encoding has the property
of all of the left children being before the root of the subarray and all of the right children
being after the root. By recursively partitioning the array in this manner it will always be
in sorted order if we discard the nulls during a sequential traversal. Therefore, this encoding
satisfies the required property.
15
use the AVL tree over the sorted list when our sorted list has O(1) insert time? The only
reason the sorted list performs better than the AVL tree is the inputs are monotonically
increasing, ie they are already sorted. If we relax our constraint and do not assume that
the inputs will be monotonically increasing the AVL tree will perform better than the sorted
list since insert into the list will be O(N ) while insert into the AVL tree remains O(lgN ).
Finally due to our novel encoding of the AVL tree our select of all nodes between two nodes
A and B still performs in linear time.
8 Conclusions
8.1 Work to Date
We began our research by realizing that version control repositories contain a lot of data, and
that it might be useful to be able to query that data as if it were a database. We recognized
that it is not currently easy to do that due to limitations in current SCMs, so we researched
ways to make the information contained in repositories faster and more intuitive to access.
We began by defining classes of queries we thought might be useful. We tried to come up
with as many use cases as possible, and eventually began to see patterns in the requirements
of all the queries. The most basic things that these queries require is the ability to define
a search space accurately, which we have termed subgraph selection. Through our research,
we have developed algorithms which can perform subgraph selection in linear time based on
the number of commits.
In the process of developing the algorithms, we proved several useful general properties of
repository graphs, including a tight bound on the number of edges. We also proved properties
about our own data structures. We exploited the properties we found and as a result our
algorithms all run in linear time or better.
16
To complete our definition of subgraph selection, we presented 3 algebraic operators
which allow the user to select any subgraph in a repository using only individual commit
IDs.
A.1 Properties
parents
List of one or two parents, which are also collapsed nodes
children
List of one or two children, which are also collapsed nodes
17
head
The node with the lowest topological sort position
tail
The node with the highest topological sort position
size
The number of commits between head and tail, inclusive
sort position
Topological sort position of the head
A.2 Methods
nodes()
Returns a list of nodes contained by the collapsed node in topological sort order
get nodes from(node n)
Return all contained nodes between n and tail
get nodes to(node n)
Return all contained nodes between head and n, inclusive
get nodes between(node a, node b)
Return all contained nodes between a and b, inclusive
References
[1] S. Cassidy. Generalising XPath for Directed Graphs.
[3] A. B. Kahn. Topological sorting of large networks. Communications of the ACM 5 (11),
pages 558–562, 1962.
18