You are on page 1of 18

SourceQL: A Query Language for Source Code

Management Systems
Tim Henderson and Steve Johnson
Case Western Reserve University
December 17, 2009

Abstract
Version control repositories contain vast amounts of data about the history of a
project. To date, people have not tried to extract much useful information from these
repositories, largely because data formats vary and access time is very slow. This
paper presents the foundations for a system which aims to vastly improve read times
for specific kinds of repository data and make queries against this repository easier to
write. In short, we want to treat the repository like a database as much as possible,
including reasonable access time and a powerful query language.
This paper focuses on storing and querying path data in the repository, i.e. parent-
child relationships between nodes. Specifically, it describes a system for storing a par-
allel, efficient representation of the connectivity graph, as well as an efficient algorithm
for finding all nodes on any path between two nodes.
We were not able to rigorously test the real-world performance of our algorithms
due to time constraints and changing designs, but we have opened up many areas for
future research.

1 Introduction
Since the invention of computers, code bases have become progressively larger and their
development has involved more people working in parallel. Early SCMs were created to
deal with this growth, with varying degrees of success. Today, many code bases consist of
hundreds of thousands of lines of code, with thousands of changesets in the repository.
As software projects become larger, it becomes more difficult to find out things about
the source code, such as where bugs are located, or whether duplicate code is being writ-
ten. Source Code Management (SCM) systems have become nearly ubiquitous in software
development environments, and the information stored in them can be useful in answering
certain questions.
The data contained in these repositories contain a wealth of information about the source
code. Currently, that information is only used to perform transformations on text files, not

1
provide information about those files. By treating a repository as a database, providing
efficient and intuitive ways to query it in a reasonable aount of time, we can discover new
ways to measure progress, find bugs, and identify general qualities in a software project.
Before any information can be read from a repository, the search space must be defined.
Since the repository is a directed acyclic graph, one cannot simply specify a linear chain
of commits, since many of those commits are based on branches and merges by different
committers who contribute equally. A more sophisticated and powerful way to define the
search space is to specify two end points and find all nodes that are on any path between
them.
This paper focuses exclusively on defining the search space, or “subgraph selection.” To
do this in a reasonable amount of time, the entire repository must be indexed. This index
must be updated incrementally for each new commit and can be optimized significantly due
to the nature of version control repositories.
Section 2 outlines some special characteristics of repository graphs, which helps to deter-
mine what is possible, what is efficient, and what is correct. Section 3 describes an efficient
way to create and maintain an index of the parent-child relationships in a repository graph.
Section 4 describes an algorithm for subgraph selection on this index. Section 5 describes the
exact file structure of the graph representation introduced in section 3. Section 7 outlines
various approaches we tried and abandoned. Finally, section 8 gives some conclusions of our
current research.

2 Assumptions
Since a repository is a special case of a graph, we can make the following assumptions based
on the nature of version control.

1. New commits are based on existing data

2. The commit time of a node is always less than that of its children

3. A directed graph with all arrows parent → child or child → parent is acyclic

4. Edges and nodes never change once created

5. No node has more than two in-edges

2.1 Commits Based on Existing Data


A commit must be based on data that exists at the time of the commit, meaning that new
information cannot be added later. This restriction is enforced by the interface of each SCM.

2
2.2 Commit Time
Since a new commit is based on a previous commit, that previous commit must have been
made earlier in time. Therefore, the new commit gets a new time stamp.
This means that a sorting of commits by real-world commit time is also a topological
sort. However, we do not use commit time as a topological sort because individual users’
clocks may differ.

2.3 Directed Graph is Acyclic


We can construct a directed graph following either commit parents or children. Such a graph
will have no cycles.
This assumption follows from (2.2). If there were a cycle, then there would be a commit
with a child with an earlier time stamp, violating the topological sort.

2.4 No Changes to Edges or Nodes


This assumption follows from (2.1). Since a commit is based on existing data, if that data
changes, then the commit will no longer refer to valid data.
This assumption does not hold under certain special cases in some SCMs. Most modern
SCMs allow changes to the most recent commit. A few allow the user to convert branches
into a series of linear commits in order to make the history log more readable [2, ch. 6.4].
When this happens, it is impossible for a stateless observer to determine what changes were
made without scanning the entire repository from scratch.
Changes such as these cannot be made after a repository has been merged with another
copy of the repository in a distributed SCM, as this would make further merges impossible.
However, it might present a problem in centralized SCMs if the administrator decides to
make changes at the server level.
In any case, dealing with these edge cases is out of the scope of this paper.

2.5 In-Edges
According to all existing SCMs, a commit is a set of changes to a previous state of the
repository. If those changes are the result of the user, then a commit will have one in-edge
representing the previous commit. If those changes are the result of a merge from another
branch, then the commit will have two in-edges, one for each branch.
This assumption allows us to state that O(E) = O(V ) for all possible repositories, al-
though E will be higher on average than V .

3 Graph Representation
Most SCMs store repository information in a way that is optimized for reading and writing
as little information as possible. In order to optimize queries and to put an abstraction layer

3
over the SCM implementation, we create a separate graph representation of the repository.
Since a repository is a sparse DAG with a low upper bound on incoming edges, the most
efficient representation is adjacency lists.

3.1 Initialization and Update


Almost all SCMs offer “hooks” which allow custom scripts to be run when SCM commands
(e.g. commit, merge) are executed. SourceQL can update its graph representation using
these hooks.
Some SCMs, including Mercurial[5], provide information to certain hooks. Mercurial
provides the first ‘new’ commit in a set of incoming commits, so a list of all new commits
can be obtained by traversing those children which are not already in the graph.
If we have a list of newly-created commits, then all new commits can simply be added
to the vertex list, and then edges can be added to various adjacency lists. This can be done
without regard to order.
If the SCM does not provide this information, then it must be determined through other
means. The easiest way to do this is to start at the last node (a leaf node) and traverse the
parents which are not already in the graph.

3.2 Topological Sort


Maintaining a topological sort order of the repository is useful for a number of reasons. One
is that it allows us to refer to each commit as a 32-bit unsigned integer rather than relying
on whatever the SCM uses, which is usually either a revision number or a SHA1 hash. It
also allows us to make some optimizations to our searching algorithms.
Commit time provides a topological sort order in theory, but in practice, it is subject to
individual machines’ internal clocks, which might be wrong for various reasons. It is better
for us to maintain our own sort order. Fortunately, we can do that in constant time for a
single commit, and in linear time for a set of commits. One algorithm to do this is described
in Kahn[3].
The internals of SourceQL represent all nodes as their topological sort position to save
space and maintain a layer of abstraction over the SCM.

3.3 Reduced Graphs


Since repository graphs have long, linear chains of nodes, we can make some significant
optimizations by compressing chains of nodes into single nodes called collapsed nodes.
Some SCMs already do this[4], but they have not documented their approaches.
We define a collapsed node as a node and all of its children until a node division. There
are 4 places where collapsed nodes are divided:
1. Root of a repository
2. Leaf (head) of a repository

4
3. Branch: collapsed node terminates at parent, new collapsed nodes start at each child

4. Merge: collapsed nodes terminate at parents, new collapsed node starts at child

Examples of each case are shown on the next page. Any two of these cases can coincide.

3.4 Size of the Reduced Graph


In the common case, a vertex has exactly one incoming edge, except the root node. This
is true even when a branch is created. In a graph with no merges, the number of branches
is simply the number of leaf nodes. On the other hand, a merge node has two in-edges,
creating an “extra” edge. Using this information, we can calculate the number of branches
as:

b=E−V +h (1)
where h is the number of leaf nodes. This is because each branch-merge pair creates an
extra edge, and every leaf represents an unmerged branch. Now we can calculate the number
of merges as:

m=b−h+1 (2)
=E−V +1 (3)
because each leaf is an unmerged branch. The extra 1 accounts for the root node. In
reality, h (the number of leaf nodes) is negligible since most repositories have only a handful
of leaf nodes, each one representing a branch of development.
At most, the reduced graph will contain a node for every branch, merge, leaf, and root.
Some nodes might be both a branch and a merge, but it is impossible to determine how
many exist using only edge and vertex counts.
The expression E − V is important for our analysis, so it is relevant to note that E − V
is never more than 0.5V . The proof is as follows.
As shown in the beginning of the section, each branch-merge pair creates one extra edge.
This means that there can be at most one extra edge for half of the nodes, those that
represent the merge of a branch-merge pair. Therefore, the following holds:

max(E) = V + 0.5V = 1.5V (4)


max(E − V ) = 1.5V − V = 0.5V (5)
So the maximum number of branches is 0.5V + h and the maximum number of merges
is 0.5V + 1. The reduced graph will have at most b + m + h + 1 nodes. This is not an
improvement over the asymptotic size of the graph, but it is guaranteed to reduce the size
of any repository with chains of nodes with single in-edges. For most repositores, that is the
majority of nodes.

5
a a c
a ...

b b d

b b
c e e

... c d f f

(a) Case 1 (b) Case 2 (c) Case 3 (d) Case 4

1 3
1 3

2 5 2 5
4 6
4 6

7
7

8 9

8 9
11 10 11 10
. .
(e) An example graph (f) Reduced graph calcu-
demonstrating cases 4 lated from (e)
and 5

Figure 1: Examples of each node division case

6
We ran a script over the history of a prominent open source product, Mercurial, to
calculate the number of branches and merges. The repository “hg-stable” contains 1081
branches out of 10862 commits, or roughly 10%. There are an equal number of merges.
Therefore, roughly 80% of all commits will be skipped over during a tree traversal, which is
a significant speed improvement.

3.5 Maintaining the Reduced Graph


Maintaining the reduced graph is more complicated than maintaining
the full graph because collapsed node divisions must be detected, and
1
because collapsed nodes may be split as the result of a branch. In
2
this section, we present an algorithm for updating the reduced graph
1
with a batch of new nodes.
2
To simplify the algorithm, we will only talk about adding single 3 3
commits at a time. When a new commit is added to the reduced
graph, it can do one of three things: (a) (b) After
Before
1. Continue an existing collapsed node
Figure 2: Appending
2. End an existing collapsed node and start one or more new col- to a collapsed node
lapsed nodes

3. Split a collapsed node and create a new one

3.5.1 Adding to a Collapsed Node


If a new commit has only one parent, and that parent
has only that node as its child, then we can simply
1 1
look up the parent’s containing collapsed node in the
2 2
index (see Section 5), append the new commit to the
collapsed node’s list of commits, update its tail and
size attributes (see Appendix A), and add a new entry 3 ... 3 ...
for the new commit in the index.
(a) Before (b) After
The behavior of this case is shown in figure 2.
Figure 3: Branching
3.5.2 Detecting Branches and Merges
If a new node’s parent has more than one child or the node has more than one parent, then
a new collapsed node is created. The parent collapsed node or nodes are updated to point
to the correct child node.
The behavior of this case is shown in figure 3.

7
3.5.3 Splitting a Collapsed Node
If a commit A is added which inherits from a commit
in the middle of a collapsed node C, then the collapsed 1
node must be split to maintain correctness. To accom- 2
1
plish this task, we can find the new node’s parent P in 3
2
C, copy the node IDs after P to a new collapsed node
3 6
D, remove the commits after P from C, and update C
4 4
to point to A and D as its children. 6
5 5
This case should arise often for collaborative
projects. This might happen locally if a user writes (a) Before (b) After
a series of commits, then branches from an earlier re-
vision to try a different strategy. The behavior of this Figure 4: Splitting
case is shown in figure 4.

4 Path Selection
The most basic part of a SourceQL query is subgraph selection, similar to the XPath part
of an XQuery for expression. Subgraph selection is done using a modified version of XPath
syntax. XPath has been used elsewhere as a query language for directed graphs [1].
The most difficult XPath queries are expressions of the form
A//B, where A and B are nodes. A subgraph selection finds the ...
subgraph which is the flow network between A and B, i.e. all nodes
which are on a path between A and B.
A (10)

4.1 Basic Search Algorithm


Correct subgraph selection can only be performed with a full graph 12 11

searching approach, though some optimizations can be made based


on the characteristics of version control repositories. For the purposes
16 13 15
of this paper, we will use a depth-first approach. We simply traverse
all children of the A. If either of a node’s children is on a path to the
specified B, then that node belongs in the returned subgraph. 19 17

We can make one optimization based on the topological sort. If


a node’s topological sort position is greater than that of B, then we
20 B (18)
can terminate that branch immediately since it cannot possibly be
on a path to B. The full algorithm can be seen in figure 6.
This algorithm will not work for large repositories if run recur-
...
sively on the stack, but the solution to this problem is an implemen-
tation detail. This algorithm also does not remove edges that point
to vertices that are not on a path between A and B, but edges should Figure 5: A run of
be checked later so as not to waste space storing a separate set of the search algorithm
edges as well as vertices.

8
function select_path(V, start_node, end_node):
visited = array(V.length)
V_2 = empty_set()

visit_node(start_node, end_node, visited, V_2)

return V_2

function visit_node(node, end_node, visited, V_2):


visited[node] = True
if node == end_node:
V_2.add(node)
return True
if node.children.length == 0:
return False
if node.sort_position > end_node.sort_position:
return False

on_path = False
for child in node.children:
if child in subgraph:
on_path = True
else:
if not visited[child]:
r = visit_node(child, end_node, visited, subgraph):
on_path = (on_path OR r)
if on_path:
subgraph.add(node)
return True
else:
return False

Figure 6: Algorithm 1: Subgraph selection on a full repository graph

9
An example of this algorithm is shown in figure 5. Node labels represent topological sort
position. Bold edges were visited by the search, and dotted edges were not visited. In this
example, the edge 19-20 was not visited because 19 is greater than B’s sort position of 18, so
none of its children were traversed. The final output of this algorithm will include all nodes
except 11, 15, 19, and 20.

4.2 Reduced Graph Search Algorithm


With the reduced graph, we can write an optimized version of the DFS subgraph selection
algorithm. Rather than looking at each node individually, we look at the graph of collapsed
nodes. Instead of checking that a node is equal to B, we determine if B is contained in each
collapsed node. If it is, then we treat that collapsed node as B itself. Instead of looking at
individual node topological sort positions, we just look at the sort position of the first node
in the collapsed node, which is indexed as shown in Appendix A.
The full algorithm is shown in figure 7.
There is only one real difference between this algorithm and the full graph version: the
call to expand results(). This function turns the reduced graph node results into the full
set of nodes, accounting for the fact that the start and end nodes might appear anywhere in
a collapsed node, or even in the same collapsed node.
The running time of these subgraph selection algorithms is O(V ), since in the worst case
it is a full graph traversal visiting each node once.

4.3 Excluding Nodes


There will undoubtably be cases where a query should be run on nodes on any path con-
necting A and B, except for nodes on a path that go through a set of nodes S. This kind of
power is useful for excluding entire branches of development. It is difficult to describe the
full utility of this type of query without describing the entire query language, but we will
briefly describe two strategies for handling it.
Since we have an index mapping commits to the collapsed nodes that contain them, we
could look up the collapsed node for each item in S, put them in a new set T, and treat each
collapsed node in T as a leaf node in the algorithm described in section 4.2.
If S is large, then looking up each collapsed node will be slow. In this case, instead of
using the collapsed node index, we could see if the topological sort position of each item in
S is in the range of whatever collapsed node is being explored at the current step. This will
cause more processing at each collapsed node, but will eliminate the initial index lookup.

5 File Structure
In order to efficiently access the information stored in the reduced graph representation we
have designed a domain specific file structures for the information. The challenge associated
with the collapsed graph is since it represents the full graph, we must be able to efficiently

10
function select_path_optimized(start_node, end_node):
visited = array(|V|)
V_2 = empty_set()

visit_node(start_node.collapsed_node, end_node, visited, V_2)

return expand_results(start_node, end_node, V_2)

function expand_results(start_node, end_node, V_2)


V_3 = empty_set()
first_cnode = start_node.collapsed_node
last_cnode = end_node.collapsed_node
if first_cnode == last_cnode:
V_3.add(first_cnode.get_nodes_between(start_node, end_node))
else:
V_3.add(first_cnode.get_nodes_from(start_node))
V_3.add(last_cnode.get_nodes_to(end_node))
V_2.remove(first_cnode)
V_2.remove(last_cnode)

for cnode in V_2:


V_3.add(V_2.get_nodes())
return V_3

function visit_node_optimized(collapsed_node, end_node, visited, V_2):


visited(collapsed_node) = True
if collapsed_node == end_node.collapsed_node:
V_2.add(collapsed_node)
return True
if collapsed_node.children.length == 0:
return False
if collapsed_node.sort_position > end_node.sort_position:
return False

on_path = False
for child in collapsed_node.children:
if child in subgraph:
on_path = True
else:
if not visited(child):
r = visit_node(child, end_node, visited, subgraph):
on_path = (on_path OR r)
if on_path:
subgraph.add(collapsed_node)
return True
else:
return False
11

Figure 7: Algorithm 2: Subgraph selection on a reduced graph


find a normal node inside the full graph. The second challenge is since each collapsed node
represents a path, we must be able to select all nodes which are descendants of the searched
node on that path. Finally for joining the results selected from each collapsed node to the
other collapsed nodes it is important to keep the output nodes in sorted order so we can
efficiently join the results via a sort-merge join.

5.1 Relational Schema


GraphToReducedGraph : <time:uint32, node_id:string(40), reduced_node_id:uint32>
time can be any topological sort id.
node_id is the id from the original representation.
reduced_node_id is the id for a node from the reduced graph.
ReducedGraphStructure : <parent:uint32, child:uint32>
ReducedGraphStorage : <reduced_node_id:uint32, nodetimes:bytes(60), length:uint32>
GraphToReducedGraph is kept sorted on the time attribute. There is an additional index
which is kept on the attribute node_id, this allows the information to be efficiently accessed
from either direction (a common occurrence). The relation ReducedGraphStructure is kept
in a B+ Tree index which is indexed on parent. Finally reduced graph storage is indexed on
reduced_node_id and kept in a B+ Tree index as well.
Some explanation is in order for the attribute nodetimes. Its type is set to bytes(60)
indicating it is a blob field 60 bytes long. What it actually is a sorted list of the nodes in
the full graph that represented by the collapsed node. When there are more than 15 nodes
on a particular path, a secondary file is made to store the overflow nodes. Since the time
attribute for nodes is monotonically increasing, and newer nodes always have a greater time
insert into the sorted list is simply appending onto the list, ie. the list is sorted for free.

5.2 Performance Characteristics


The filestructure presented is optimized for the specific use case of doing subgraph selection
against the reduced graph. We are storing the reduced graph in an adjacency list represen-
tation. This representation was chosen because the ratio of edges to vertices will always be
comparatively sparse for version control graphs, with |E| never exceeding 23 |V |. The first
operation we define on our representation is checking whether a node is represented by a
collapsed node. With our representation this can be done in O(lgN ) where N is the number
of nodes represented by the current collapsed node.
The second operation defined is inserting a node into the reduced graph. At the collapsed
node level this will be an constant time operation since it is simply a list append. The harder
part is deciding which collapsed node to insert into, or whether to make a new one. This
has already been discussed in section 3.5. This will rely the operations just discussed.
The final operation is selecting all nodes between two nodes, A and B. To find the
collapsed nodes involved in this operation we will apply algorithm 7. There will be a leaf
collapsed node, and a root collapsed node. From the root collapsed node we want to select
the node A and all nodes after this node. From the leaf collapsed node we will select node B

12
and all nodes before this node. From the other collapsed node we will select all nodes that
they represent. Since our storage structure is a sorted list the asymptotic time for the select
all is linear with respect to the number of nodes being selected.

6 Algebraic Operators
Thus far we have defined an implementation of a Subgraph Select Operator. This operator
takes two nodes and outputs all nodes that lie between those two nodes. This is the most
basic operator in SourceQL. In the following section, we will define an algorithm which
outputs a subgraph as a list sorted by the nodes’ given topological sort order. Using the
pre-sorted data, it is possible to define three set operators on top subgraph select that run
in linear time: subgraph union, subgraph intersection, and subgraph difference.

6.1 Guaranteeing Sorted Order of the Output of the Subgraph


Select Operator
Since each node is labeled with a topological sort order position, we can traverse any subgraph
of the full graph and output a sorted list of nodes in linear time. The algorithm is given in
8.
function output_sorted(V):
working_set = set()
sorted_nodes = list()
for node in V:
if node.parents.length == 0:
working_set.add(node)
while working_set.size > 0:
//get the node in working_set with the lowest sort position
this_node = min(working_set)
working_set.remove(this_node)
sorted_nodes.append(this_node)
for child in this_node.children:
if not sorted_nodes.contains(child):
working_set.add(child)
return sorted_nodes

Figure 8: Output nodes in the given topological sort order.

13
6.2 Subgraph Union
Subgraph Union, A ∪ B where both A and B represent subgraphs, is defined as the union
between the sets A.V and B.V , where A.V means the vertices of A. Since we can assure
that both A.V and B.V are sorted we can simply use a merge to construct the union. The
resulting algorithm is the linear time, as will be true for both intersection and difference as
well.

6.3 Subgraph Intersection


Subgraph Intersection, A∩B, is defined as A.V ∩B.V using the same notation as define above.
Intersection can be implemented by tracking a pointer in each sorted list, incrementing the
pointer at the smaller value at each iteration, and adding the value to the resulting set if the
values match.

6.4 Subgraph Difference


Subgraph Difference, A − B, is defined as A.V − B.V . Difference can be implemented the
same way as intersection, but with the opposite behavior when a match is found. Consider
a difference A − B. Like intersection, at each iteration the pointer at the smaller value is
incremented. If the pointer in A is incremented and the new values match, the value is
dropped. Otherwise, the value in A is added to the resulting set.

7 Dead Ends and False Starts


Through out the course of the semester we explored many different ways to do things.
This section will catalog some of our failed attempts to define algorithms and structures for
subgraph selection.

7.1 AVL Tree Encoding and Performance


Initially we wanted to store the nodes represented by a reduced graph in a self balancing
binary tree. However, upon first consideration of the problem we negelected the constraint
that nodes would only arrive in monotonically increasing order with respect to time. Thus
we developed a novel encoding for a binary tree onto an array to perform the same as a
sorted array on selecting all nodes between two nodes A and B.
The usual way to encode a binary tree to an array uses the relation ships lef tchild = 2i+1
and rightchild = 2i + 2. This encoding is insufficient. To be efficient the encoding must
satisfy the property of the array always being in sorted order (when nulls are neglected).

7.1.1 Encoding
let |A| be the length of the array

14
1 2 3 4 - 5 - 6 7 8 - 9 - 1 10 -

9 4

8 10 2 5

7 X X X 1 3 X X

Figure 9: An example tree, the X represents non-existant nodes. The above array is the
encoding of the tree into the array.

|A| must be of the form |A| = 2n − 1


the root node’s position = |A|−1
2

It we consider each left, and right subtree as separate trees we can apply the same
calculation to find their locations in the array. If p is the position of the parent node then
the left subarray is defined as [0, p − 1] and the right is defined as [p + 1, |A|] where |A| is
the length of the parent array. See figure 9 for an example. This encoding has the property
of all of the left children being before the root of the subarray and all of the right children
being after the root. By recursively partitioning the array in this manner it will always be
in sorted order if we discard the nulls during a sequential traversal. Therefore, this encoding
satisfies the required property.

7.1.2 Storing Time in the AVL Tree


As mentioned previously time represents a topological sort of the graph. Once a time is
assigned to a node, it is assigned permanently. The advantage of storing the references to
the node based on the topological sort rather than the node id the VCS uses, is that time
represents a relationship between nodes. The node id’s that VCS use often is a randomly
assigned idea, or one that has a different meaning (such as the hash of the history of the
project up to that point). These type of id’s do not lend themselves to doing a linear time
select of nodes between two nodes A and B.

7.1.3 Performance of the AVL Tree


Since a AVL tree is a height balancing tree it will always have O(lgN ) lookups. Less obvious
is insertion into an AVL tree will be O(lgN ) with at most O(1) rotations. So why would we

15
use the AVL tree over the sorted list when our sorted list has O(1) insert time? The only
reason the sorted list performs better than the AVL tree is the inputs are monotonically
increasing, ie they are already sorted. If we relax our constraint and do not assume that
the inputs will be monotonically increasing the AVL tree will perform better than the sorted
list since insert into the list will be O(N ) while insert into the AVL tree remains O(lgN ).
Finally due to our novel encoding of the AVL tree our select of all nodes between two nodes
A and B still performs in linear time.

7.2 Graph Patterns


When we initially began work try to define what type of queries we would like to run on
a repository, we explored the idea of using GraphQL-style graph patterns. We soon found
that graph patterns are mostly useful for finding nodes with graph-based characteristics
like branches or merges, rather than data-based characteristics. We also found that XPath
syntax was more intuitive for specifying paths, since a repository is a directed acyclic graph
and lends itself to a hierarchical representation.

7.3 Transitive Closure


For a while, we entertained the idea of storing the transitive closure for the entire graph,
making subgraph selections take much less time. However, this would have given our index
an exponential storage cost, and was therefore unusable.

8 Conclusions
8.1 Work to Date
We began our research by realizing that version control repositories contain a lot of data, and
that it might be useful to be able to query that data as if it were a database. We recognized
that it is not currently easy to do that due to limitations in current SCMs, so we researched
ways to make the information contained in repositories faster and more intuitive to access.
We began by defining classes of queries we thought might be useful. We tried to come up
with as many use cases as possible, and eventually began to see patterns in the requirements
of all the queries. The most basic things that these queries require is the ability to define
a search space accurately, which we have termed subgraph selection. Through our research,
we have developed algorithms which can perform subgraph selection in linear time based on
the number of commits.
In the process of developing the algorithms, we proved several useful general properties of
repository graphs, including a tight bound on the number of edges. We also proved properties
about our own data structures. We exploited the properties we found and as a result our
algorithms all run in linear time or better.

16
To complete our definition of subgraph selection, we presented 3 algebraic operators
which allow the user to select any subgraph in a repository using only individual commit
IDs.

8.2 Future Work


We did not have sufficient time to implement and test our algorithms, so that is our next
step. We did test informally, noting that traversing a 10,000-commit repository in Mercu-
rial takes about two minutes while traversing our index of the same repository took about
three seconds, but we need to perform the same tests on hundreds of repositories. Fortu-
nately, there are more than enough open source projects available for us to run tests on, and
automation is not difficult.
The rest of our future work primarily involves the language, which does not yet have
a formal spec. Fortunately, it is largely based on XQuery, so we just need to find the
appropriate abstractions and extensions. In addition, we need to define a way to get data
to the user. The current version of the system merely reads and writes to the index using
its own data structures.
Beyond simple parent-child relationship data and committer/date/file metadata, we need
to give users a way to access the changes contained within each file in a changeset. Much can
be done with this data when paired with static analysis tools. Being able to search individual
changeset diffs would also allow us to implement history-wide search, which would make it
easy for developers to follow code refactoring.

8.3 Major Challenges


We have identified several major challenges for future work. The first is text search within a
changeset as described above. The second is defining an expressive and compact query lan-
guage which doesn’t require too much extra knowledge to be useful. The third is developing
an SCM-agnostic architecture which is easy to port across SCMs, since the choice of SCM in
a given project is largely arbirtray. The fourth is the raw speed barrier inherent in working
with a lot of compressed non-sequential data on disk.

A Reduced Graph API


The following methods and properties are members of the CollapsedNode class.

A.1 Properties
parents
List of one or two parents, which are also collapsed nodes
children
List of one or two children, which are also collapsed nodes

17
head
The node with the lowest topological sort position
tail
The node with the highest topological sort position
size
The number of commits between head and tail, inclusive
sort position
Topological sort position of the head

A.2 Methods
nodes()
Returns a list of nodes contained by the collapsed node in topological sort order
get nodes from(node n)
Return all contained nodes between n and tail
get nodes to(node n)
Return all contained nodes between head and n, inclusive
get nodes between(node a, node b)
Return all contained nodes between a and b, inclusive

References
[1] S. Cassidy. Generalising XPath for Directed Graphs.

[2] S. Chacon. Pro Git. http://progit.org/book/, 2009.

[3] A. B. Kahn. Topological sorting of large networks. Communications of the ACM 5 (11),
pages 558–562, 1962.

[4] M. Mackall. Towards a Better SCM: Revlog and Mercurial. http://


mercurial.selenic.com/wiki/Presentations?action=AttachFile&do=get&target=
ols-mercurial-paper.pdf, 2006.

[5] B. O’Sullivan. Mercurial: The Definiitive Guide. O’Reilly Media, 2009.

18

You might also like