You are on page 1of 12

RDFJoin: A Scalable Data Model for Persistence and Efficient Querying of RDF Datasets

James P. McGlothlin
The University of Texas at Dallas Richardson, TX

Latifur R. Khan
The University of Texas at Dallas Richardson, TX

jpm083000@utdallas.edu ABSTRACT
Recent research has emphasized the design of database schemata and data models constructed to allow scalable and efficient querying of large quantities of semantic web data. The objective is to create a solution that is viable for very large RDF datasets, thus allowing RDF to be used as a solution for development of global databases. The bottleneck in the querying of large RDF datasets is performing joins and unions. Because a RDF dataset is a collection of simple three column tuples, most queries involve many such joins. This paper introduces RDFJoin, a new data model specifically designed to eliminate or greatly reduce the cost of these joins. RDFJoin utilizes bit vectors to efficiently store an entire collection into a single column of a tuple. These bit vectors also allow the use of high speed bit masking operations to join these collections. Using this approach, we create tables that can be accessed and queried more efficiently. RDFJoin provides tables that implement persistent sextuple indexing. Additionally, we introduce join tables that store the results of join executions. We assert that our novel solution reduces the overhead of RDF queries and dramatically improves performance. RDFJoin is a persistent solution using relational database storage. Our experiments demonstrate that RDFJoin consistently outperforms current state-of-the-art technology for querying stored data and even compares favorably to main memory solutions. RDFJoin is truly a scalable and persistent data model for RDF data storage that improves the performance of queries.

lkhan@utdallas.edu
It is simply not feasible to access and query large amounts of RDF data directly from the RDF documents. Therefore, we store the RDF data in a column-store relational database, a storage mechanism that has performed favorably in numerous prior research papers [3] [4] [5] [9] [20] [26] [27]. The primary focus of our solution is to reduce the need for and cost of joins and unions in query implementations. This is vital to supporting scalable query access. It is easy to see that in a database of 100 million RDF tuples, the cost of a query that requires several nested loop joins will be astronomical [5]. RDFJoin capitalizes on previous cutting edge research including vertical partitioning [5] and sextuple indexing [8]. In this paper, we document the improvements we have added with our design. We propose and detail a new persistent data model using bit vectors that is highly efficient and closely resembles the main memory solution provided by the Hexastore project. Our data model also includes join tables that further optimize the performance of join queries. We provide query analysis and experimental results that show the RDFJoin does in fact achieve the goals of efficiency and scalability. We tested seven queries with datasets of more than 44 million RDF triples from the Lehigh University Benchmark (LUBM) [2]. RDFJoin consistently outperformed current state-of-the-art solutions. In all these experiments, RDFJoin increased the performance gain as the size of the dataset increased, displaying the scalability of this solution. The remainder of this paper is organized as follows: In Section 2, we document the prior research upon which we based our design, and we specify the enhancements we have made. In Section 3, we specify the exact structure of our database solution. In Section 4, we describe implementation details including the process used to convert RDF documents into this data format. In Section 5, we evaluate several queries from prior research and benchmarks. We specify how these queries can be implemented efficiently with our solution. Section 6 describes the results of our experiments. Section 7 presents areas for future work. Finally, in Section 8, we make some conclusions.

1.

INTRODUCTION

The World Wide Web Consortium [1] defines the RDF (Resource Description Framework) data format as the standard mechanism for describing and sharing data across the web. All RDF datasets can be viewed as a collection of triples, where each triple consists of a subject, a property and an object. As the semantic web grows in popularity and enters the mainstream of computer technology, RDF datasets are becoming larger and more complex. There is increasing need for and interest in viable solutions to improve the performance and scalability of queries against this data.

2.

RELATED WORK

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Database Endowment. To copy otherwise, or to republish, to post on servers or to redistribute to lists, requires a fee and/or special permissions from the publisher, ACM. VLDB 09, Copyright 2009 VLDB Endowment, ACM 000-0-00000-000-0/00/00.

RDFJoin expands upon the work done in prior research papers. We use column store databases, index tables, vertical partitioning and bit vector representations. We have integrated the advantages of each of these approaches, while providing our own enhancements to the technology.

2.1

Column Store Databases

For this project, we don't attempt to provide our own mechanism for persistent storage. Instead, we store the data in relational databases, and use Java Database Connectivity (JDBC) [28] to

communicate to the database. We have chosen to utilize columnoriented databases, commonly called column stores, to capitalize on the advances and findings of considerable prior research [3] [4] [5] [9] [20] [26] [27]. Prior research [27] shows that the column store approach is especially suitable for data that needs to be optimized for readonly access. In [5] and [3], Abadi et al. show that column store databases provide a good implementation for RDF storage. Many of the features of column store databases provide significant benefit for the RDFJoin data model. While we do support insertion of new RDF triples, we do not allow direct updates or deletions of triples in our database. Therefore, we need a database that is optimized for read access. As all of our tables are ordered off numeric identification numbers, this provides opportunity for column-based optimizations. Also, we store large bit vectors in several of our tables. Column stores provide a good framework for compression of specific columns in a table [4]. Furthermore, we often need to access only one member of a tuple, and column stores are specifically designed to eliminate the cost of projections and the need to access entire tuples.

RDF databases. These indexes are PSO, POS, SPO, SOP, OPS, and OSP (P is for property, O for object and S for subject). These represent every possible ordering of the RDF triples by individual columns. As their paper notes, the values for O in PSO and SPO are the same. So in reality, even though six tables are created only 5 copies of the data are created, because the third columns are duplicated. RDFJoin relies greatly on the Hexastore research. In fact, we reproduce all 6 of these indexes using three tables. The three tables are PS-O SO-P and PO-S. They are indexed on both of the first two columns so they provide all six indexes, while insuring that only one copy of the third column is stored. Our project provides several new features built on top of Hexastore. Hexastore is a main memory solution; we propose a persistent database storage for these tables. We store all the third column tuples in a bit vector, and provide hash indexing based on the first two columns. This reduces storage space and memory usage and improves the performance of both joins and lookups. Additionally, our join tables and use of bit vectors provide further increases in performance. In summary, we utilize the concepts of sextuple indexing from Weiss et al. We enhance this technology by providing persistence, join tables, and bit vector structuring.

2.2

Vertical Partitioning

In Scalable Semantic Web Data Management Using Vertical Partitioning [5], Abadi et al. propose structuring RDF databases by creating an ordered table for each property. RDFJoin draws heavily from this research. As suggested by Abadi et. al, we utilize column-store databases to persist the RDF data. We provide a PSTable that is ordered first by property and then by subject. This table closely matches the vertical partitioning scheme Abadi et al. propose. We also chose vertical partitioning as the baseline with which to compare our results. One of the primary benefits of vertical partitioning is the support for rapid subject-subject joins. This benefit is achieved by sorting the tables via subject so that such joins can be performed as merge joins. Our project seeks to not only increase the performance of subject-subject joins but also of subject-object and object-objectjoins. As Abadi et al. state in their paper, with vertical partitioning these joins are not merge joins as the columns are not sorted. However, by keeping three separate triples tables and normalizing the identification numbers (and thus the standardized ordering), RDFJoin allows subject-object and object-object joins to be implemented as merge joins as well. Additionally, we execute subject-subject, subject-object and object-object joins and store the results as binary vectors in our join tables. This can be viewed as an implementation solution for the materialized path expressions proposed in the vertical partitioning paper. In addition to these improvements, we store much of our data as binary vectors and implement joins and conditions as binary set operations. This implementation provides significant performance improvement over storing each triple as a unique tuple. In summary, we utilize the concepts of column store databases, vertical partitioning by property, and ordered merge joins from Abadi et al. We enhance this technology through merge subjectobject and object-object joins, and join tables.

2.4

BitMat

In BitMat: A Main-memory Bit Matrix of RDF Triples for Conjunctive Triple Pattern Queries [7], Atre et al. propose utilizing a large bit matrix to represent all RDF triples in memory. Each subject and object URI is converted to an integer that represents the corresponding bit index in the bit matrix. The purpose of this representation is to increase the speed of lookups and joins. Run length encoding is used to control the memory requirements for the bit matrix. Our project utilizes bit vectors in much the same way that BitMat uses the bit matrix. We also convert each URI to an integer and use this id number to locate the matching bit. However, BitMat assigns the id numbers to objects and subjects separately. This does produce a slightly smaller memory requirement but it greatly increases the cost of subject-object joins. This is because the subjectid and objectid do not have to match for the same URI, so a lookup or translation is required. We normalize the integer assignment for all subjects and objects, guaranteeing the ids are unique and that a URI will have the same bit location and identification number even if it is the subject in one triple and the object in another. In addition to the need to convert the id numbers, subject-object and object-object joins are cumbersome in the BitMat approach. Objects are represented by columns in the bit matrix. It is easy to retrieve subjects from the bit matrix and perform bit masking to perform subject-subject joins. This is because the subjects are rows in the BitMat and thus in consecutive memory. Since the objects are columns, each bit in the column will be in a different byte in memory, and none of it is continuous. Thus, it is neither efficient nor simple to perform joins involving objects. RDFJoin stores each bit vector separately. So, an object bit vector can be directly accessed and is in contiguous memory. Our solution also provides persistence, whereas the BitMat solution is confined to main memory. By breaking the data up into individual bit vectors and storing it in tables, we are reducing the

2.3

Hexastore

In Hexastore: Sextuple Indexing for Semantic Web Data Management [8], Weiss et al. propose supporting six indexes for

memory requirements. Our solution only requires that we load those bit vectors relevant to the query, not the entire bit matrix. RDFJoin also eliminates the need to store empty bit vectors. This is because we store the bit vectors in relational database tables, and we can avoid an empty vector by simply not including that tuple. Furthermore, we provide join tables that provide bit vectors that are already folded, to use the terminology used by the authors in the BitMat paper. In summary, we utilize the concepts of bit indexing from Atre et al. We enhance this technology by providing persistence, enhanced support for subject-object joins, and join tables.

numbers from SOIDTable and PropertyIDTable. Subjectid and objectid refer to the SOID from the SOIDTable corresponding to a specific subject or object from a triple. Propertyid refers to the value in the PropertyID column of the PropertyTable that corresponds to a particular property. The SOIDTable has two columns: SOUri and SOID. SOUri is a string (VARCHAR) representing a subject or object URI or literal from a RDF triple. The SOID is a sequence. Each SOUri is assigned a unique SOID. Both SOUri and SOID are unique in the table. The SOIDs are a complete sequence with no gaps. In other words if the max SOID is 15, then there is a tuple in the SOIDTable for every SOID from 1 to 15. Table 1 shows the SOIDTable for the example dataset. Table 1: SOIDTable
UTD The University of Texas At Dallas Richardson TX ComputerScience James McGlothlin Graduate Student Latifur Khan Professor Research Assistant CSC7301 Data Mining RDFJoin 1 2 3 4 5 6 7 8 9 10 11 12 13

3.

THE RDFJOIN DATA MODEL

The core of this paper is the design of a relational, columnoriented database schema that will provide high performance query support for very large RDF datasets. In this section, we define the structure of each table in our database and provide examples showing each populated table. Our tables make heavy use of bit vectors. Many of our columns are defined as bit vectors. These vectors represent a collection of subjects, objects or properties. The id numbers defined in Section 3.1 are used as indexes into these vectors. Bit vectors are not actually a type within relational databases. At the data model level, we do not wish to define restrictive specifics for how this data is stored. Section 4.1 provides details for our implementation of these vectors. Throughout this section we provide illustration of the tables populated with example data. All of these tables represent the simple example dataset of 15 RDF triples shown in Figure 1. Note that normally these triples would include complete URIs, but we have simplified them for clarity and ease of reading. <UTD, fullName, The University of Texas at Dallas> <UTD, locationCity, Richardson> <UTD, locationState, TX> <ComputerScience, subOrganizationOf, UTD> <James McGlothlin, worksFor, ComputerScience> <James McGlothlin, position, GraduateStudent> <Latifur Khan, worksFor, ComputerScience> <Latifur Khan, position, Professor> <James McGlothlin, position, ResearchAssistant> <James McGlothlin, advisedBy, Latifur Khan> <James McGlothlin, takesCourse, CS7301> <Latifur Khan, teacherOf, CSC7301> <CSC7301, fullName, Data Mining> <James McGlothlin, authorOf, RDFJoin> <Latifur Khan, authorOf, RDFJoin> Figure 1: Example RDF triples dataset (used for Tables 1-8)

We can now use SOID as an index into any bit vector of subjects or objects. We also use the SOID in each of our triples tables defined in Step 2. This is very similar to the conversion of URI into bit indexes used by the BitMat project. There is one very important difference though. In that project, the subjects and objects were each assigned id numbers separately. So if a URI appears as a subject in one triple and an object in another triple it is given two different identification numbers. In our solution, we combine the subjects and objects. In Table 1, it can be seen that UTD is only assigned one ID number even though this string appears as both a subject and an object in the dataset in Figure 1. In RDFJoin, such a URI appears in the SOIDTable only once, and the same index number is used to represent it both when it is used as a subject and an object. This allows us to join subjects and objects directly with the and operation against the bit vectors. The PropertyIDTable has two columns, PropertyUri and PropertyID, PropertyUri is a string(VARCHAR) representing a property in a RDF triple. PropertyID is a sequence. Each PropertyUri is assigned a unique PropertyID. Both PropertyUri and PropertyID are unique in the table. Table 2 shows the PropertyIDTable for the example dataset. Table 2: PropertyIDTable
fullName 1

3.1

URI Conversion Tables

We define two tables, SOIDTable (Subject-Object ID Table) and PropertyIDTable. There is nothing novel about these tables; they closely match the dictionary encoding of the vertical partitioning and Hexastore projects and the auxiliary mappings tables of the BitMat project. We show them here for completeness and to illustrate how we determine the index for the bit vectors. Throughout the remainder of the paper, we reference the subjectid, the propertyid and the objectid, indicating the id

locationCity locationState subOrganizationOf worksFor position advisedBy takesCourse teacherOf authorOf

2 3 4 5 6 7 8 9 10

3.2

The Triples Tables

The table is ordered by property then subject, secondary indexed by subject and hash indexed by primary key. By ordering it by property, we provide the equivalent of vertical partitioning by property. We can now query over property, and get all the subjects in order and perform merge joins with other properties. This is also the PSO index in the Hexastore project. By indexing it over subject, we provide the SPO index in the Hexastore project. By hash indexing it over subject and property, we provide direct access to the object bit vector. This is one of the advantages of our solution. In a normal triple store, all three columns combine as the key. To collect all the objects for a subject and property, one would have to collect the object from each tuple with that subject and property. Our solution combines these objects into one bit vector. Thus, subject and property now serve as primary key and can be used for hash indexing. The POTable has columns PropertyID, ObjectID and SubjectBitVector. PropertyID and ObjectID combine as the primary key. The table is ordered by property then object, seconday indexed by object and hash indexed by primary key. This provides the merge join support for subject-object and object-object joins. We can access subject or objects in order based on property. Furthermore, the id numbers are standardized across subjects and objects, so even a subject-to-object join can still be implemented as an ordered merge join. Table 4 shows the POTable for the example dataset. Table 4: POTable
PropertyID 1 1 2 3 4 5 6 6 6 7 8 9 10 ObjectID 2 12 3 4 1 5 7 9 10 8 11 11 13 Subjects (bit vector) 1000000000000 0000000000010 1000000000000 1000000000000 0000100000000 0000010100000 0000010000000 0000000100000 0000010000000 0000010000000 0000010000000 0000000100000 0000010100000

We store the RDF triples in three tables: the PSTable, the POTable and the SOTable. The PSTable has columns PropertyID, SubjectID and ObjectBitVector. PropertyID and SubjectID combine as the primary key. They map directly to the PropertyURI and SubjectURI for the triples represented by this tuple. ObjectBitVector is a bit vector with the bits on that correspond to the objectID for each object that appears in a triple with this property and subject. Table 3 shows the PSTable for the example dataset. For illustration purposes, consider the first triple in the example dataset <UTD, fullName, The University of Texas at Dallas>. The property, fullName, has PropertyID=1 according to PropertyIDTable. The subject, UTD, has SOID =1 according to the SOIDTable. The object, The University of Texas at Dallas has SOID=2. Therefore, the PSTable includes a tuple with propertyID=1, subjectID=1, and the second bit in the object bit vector set on. Table 3: PSTable
PropertyID 1 1 2 3 4 5 5 6 6 7 8 9 10 10 SubjectID 1 11 1 1 5 6 8 6 8 6 6 8 6 8 Objects (bit vector) 0100000000000 0000000000010 0010000000000 0001000000000 1000000000000 0000100000000 0000100000000 0000001001000 0000000010000 0000000100000 0000000000100 0000000000100 0000000000001 0000000000000

The SOTable has the columns SubjectID, ObjectID and PropertyBitVector. The table is ordered by subject then object, secondary indexed by object and hash indexed by primary key. One of the criticisms of the vertical partitioning approach has been that it performs poorly when the query does not define the property. This table will allow us to quickly obtain, from a subject or object binding, the set of properties that correlate. Table 5 shows the SOTable for the example dataset. Table 5: SOTable
SubjectID ObjectID Properties (bit vector)

1 1 1 2 6 6 6 6 6 6 8 8 8 11

2 3 4 1 5 7 8 10 11 13 5 11 13 12

1000000000 0100000000 0010000000 0001000000 0000010000 0000010000 0000001000 0000010000 0000000100 0000000001 0000100000 0000000010 0000000001 1000000000

1 1 2 5 5 5 5 6 6 6 7 7 8 9

2 3 3 6 7 8 10 8 9 10 8 10 10 10

1000000000000 1000000000000 1000000000000 0000010100000 0000010000000 0000010000000 0000010000000 0000010000000 0000000100000 0000010100000 0000010000000 0000010000000 0000010000000 0000000100000

All of the RDF triples in the dataset can be rendered from any one of these tables. Note that in each table, the total number of on bits is equal to n, where n is the number of triples. In our example dataset, there are 15 triples. Therefore, even though the SOTable, POTable and PSTable have less than 15 tuples, in each case the number of on bits across all of the bit vectors is 15. Also note that the length of the subject and object vectors is always equal to the max (SOID) in the SOIDTable, and the length of property bit vectors is always equal to the max(PropertyID) in the PropertyIDTable. In Table 1, the max(SOID) is 13. Therefore the number of bits in the object bit vectors in Table 3 and the subject bit vectors in Table 4 is 13. In Table 2, the max(PropertyID) is 10. Therefore, the number of bits in the property bit vectors in Table 5 is 10.

In the SOJoinTable, the JoinBitVector represents ?x in the SPARQL query: SELECT ?x from TRIPLES WHERE {?x property1 ?o ?s property2 ?x} Table 7:SOJoinTable
Property1 1 1 2 3 4 5 6 9 10 Property2 4 9 4 4 5 7 7 7 7 Subjects (bit vector) 1000000000000 0000000000100 1000000000000 1000000000000 0000100000000 0000000100000 0000000010000 0000000010000 0000000010000

3.3

The Join Tables

In most RDF queries, joins are across properties. These joins can be classified as subject-subject joins, object-object joins and subject-object joins. Also, it should be noted that in almost all RDF databases the number of properties is much less than the number of subjects, objects or triples. We have created tables that store the bit vectors that result from these joins. We call these tables the SSJoinTable, the SOJoinTable and the OOJoinTable. Each table includes three columns: Property1, Property2 and the JoinBitVector. Property1 and Property2 combine as the primary key. We create a hash index to map the two propertyids directly to the corresponding bit vector. The JoinBitVector has length equal to the max(SOID) in the SOIDTable. In our example, this is 13, so all of the bit vectors in all three join tables are 13 bits long. In the SSJoinTable, the JoinBitVector represents ?s in the SPARQL query: SELECT ?s from TRIPLES WHERE{ ?s property1 ?o1 ?s property2 ?o2} Table 6: SSJoinTable
Property1 Property2 Subjects (bit vector)

In the OOJoinTable the JoinBitVector represents ?o in the SPARQL query: SELECT ?o from TRIPLES WHERE {?s1 property1 ?o ?s2 property2 ?o} Table 8: OOJoinTable
Property1 8 Property2 9 Objects (bit vector) 0000000000100

At first observation, it may seem like overhead to perform all these joins in case they are executed or the information becomes needed. However, this task is performed during the preprocessing stage in Section 4.2. We need only to perform this step one time for any RDF dataset, and then the results are stored in the relational database where they are quickly accessible when

needed. Section 6 shows that the resulting database is not significantly larger, and that the performance is substantially improved. Note that each property1, property2 pair need only be included in the SSJoinTable and the OOJoinTable one time regardless of order. However, in the SOJoinTable, order is relevant since SO is different that OS. In Table 7, there are tuples with property1=9, property2=7 and property1=10, property2=7. In the SSJoinTable, such a tuple would never exist because it would be the same as 7,9 and 7,10. All of the tuples in Table 6 have property1< property2, and that rule is enforced for all SSJoinTables and OOJoinTables. But in the SOJoinTable, 9,7 is not the same 7,9. Obviously, we also do not include tuples where property1 and property2 are equal. If p is the number of properties, p 2 is the number of combinations of property1 and property2. In this matrix, property1 and propery2 are equal p number of times. Therefore, the number of non-equal combinations is p2-p. This is the maximum number of tuples in the SOJoinTable. Since order does not matter in the SSJoinTable and OOJoinTable, the maximum number of tuples is divided by half, (p2-p)/2. As already noted the number of properties is generally much less than the number of triples. In the LUBM dataset, the number of unique properties is 18. So the SSJoinTable and OOJoinTable will only have 153 tuples and the SOJoinTable will have 306 tuples. In many queries, these tables are sufficient to completely calculate the results of the joins. This solution extends to queries involving multiple joins; they can be solved by retrieving the bit vectors for each join and combining them with the and operation. Even in the cases where the join tables cannot completely calculate the query results, the join tables can greatly reduce the query cost. In a RDF query, joins are against the same collection of triples. Therefore, there is little selectivity information to develop a query plan. It is a well-known strategy of query optimization to push conditional selects lower in the query tree and earlier in the query plan. These join bit vectors can be used as conditionals to reduce the size of the set being joined prior to the join, generally to exactly the set for which the join will succeed. Furthermore, we can even reduce the set prior to one join using the bit vectors for other joins further in the query plan. Additionally, we can choose which joins to perform first. A RDF query can easily involve many joins, all over the same dataset. Some of these joins will be much more selective than others, and the join tables provide the information needed to optimally order the joins in the query plan.

representation and the storage format of the bit vectors. This class internally stores its data as a bit vector, because we have tested and determined this is significantly faster than alternatives like sorted sets. RDFSet supports conversion to multiple storage formats including strings, byte arrays, SQL types (BLOB and CLOB), and input/output streams. It supports compression either for all main memory functions or during the conversion for storage. RDFSet methods include bit mathematics operations (and, or, xor, negate), get and add functions, and a bit counter. RDFSet can be constructed either as an empty set or from the results of a SQL query. Also included are methods to provide the conversions to compressed strings, byte arrays, BLOBs, CLOBs, and iostreams. A RDFSet represents the collection of subjects, objects or properties from a given query. For example, consider the simple SQL query: SELECT object FROM table WHERE property='name'; The resulting collection of objects can be used to construct a RDFSet.

4.2

Building the Database

Creating the RDFJoin database from the RDF dataset is a simple two step process. In step 1, we go through each RDF triple and add it to the URI Conversion tables and the triples tables. In step 2, we query these tables to create the join tables. Step 1: For each RDF Triple in the database: 1. there) 2. 3. 4. Add the subject and object to the SOIDTable (if not Add the property to the PropertyIDTable (if not there) Get the id numbers from these tables. Add the triple to the POTable, PSTable and SOTable.

(Adding the triple consists of retrieving the correct tuple and turning the appropriate bit in the bit vector on.) Step 2: Construct the join tables To construct these tables we performed a nested loop through all the values of property. For each combination of properties we query the subject-subject, object-object and subject-object joins. We use the PSTable to access subjects because it is sorted first by property, then by subject. Similarly, we use the POTable for objects. For the SSJoinTable and OOJoinTable we only need to add one tuple for each combination of property1 and property2. We choose to always put the lower propertyID first. For the SOJoinTable the order matters. Property1 represents the property for the triple with the joined subject.

4.

IMPLEMENTATION DETAILS

In this section we provide detailed information concerning our implementation. Section 4.1 describes how we implemented the bit vectors used in our tables. Section 4.2 provides details on how we transform a dataset of RDF triples into the tables we defined in Section 3.

4.1

Bit Vector Implementation (RDFSet)

A set of subjects, objects or properties can be viewed as a large bit vector. Each entity is converted to an id number that serves as the index in this bit set. Step 1 in Section 4.2 explains this conversion. For the purpose of encapsulating implementation specifics, we have developed a class RDFSet. We use this class to encapsulate, and modify as needed, details involving the memory

5. 5.1

QUERY EVALUATION Representation

For the purpose of clarity we represent hash index lookups in the style of arrays, For example we use PSTable[property, subject] to represent accessing the PSTable, using propertyid and subjectid and retrieving the bit vector of objects. Each of these functions is a simple hash index access and is computationally inexpensive.

To simplify this documentation, we skip the step of converting URIs to and from ids. Converting a URI to an id, is a simple, inexpensive hash index lookup. For purpose of our documentation, we italicize the URI value to indicate that it will be first converted to an id number.

perform a bit wise and operation between these two subject bit vectors (x&y).

5.2.2

Query #2

List professors who advise a particular student. x=POTable[type, Professor]; y=PSTable[advisor, GraduateStudent0]; x&y;

5.2

LUBM Queries

We tested RDFJoin using the Lehigh University Benchmark(LUBM) [2] dataset, the results of which are documented in Section 6.1. LUBM documents 14 queries. Additionally, the Hexastore project[8] defines 5 LUBM queries of its own. Due to space limitations, we cannot specify our implementation and show our results for all 19 of these queries. Furthermore, they still do not provide a clear variety of query types. Therefore, we attempt to choose queries that involve different types of joins or lookups. We have chosen and implemented 7 queries against the LUBM dataset. Four of these come from the queries documented by the LUBM project, and three of these are queries that we created to test different types of joins. As Table 9 illustrates, these queries include all three join types, all selection criteria, and both high and low selectivity factors. Thus, these queries provide a balanced variety of test scenarios. In Section 6, we document the performance results for these queries on data sets of over 44 million tuples. In this section, we document the query plan in terms of the RDFJoin tables. Table 9: Query attributes for section 5.2 queries # 1 2 3 4 5 6 7 Bound variables Join type type, object property, subject-subject Selectivity level high very high medium medium low high high

5.2.3

Query #3

Select professors who wrote any publication. x=POTable[type, Professor]; y=SOJoin[type, publicationAuthor]; x&y;

5.2.4

Query #4

Select graduate students who have no advisor. x=POTable[type, GraduateStudent]; y=SSJoin[type, Advisor]; x&y.negate();

5.2.5

Query #5

Select all undergraduate students x=POTable[type, Undergraduate];

5.2.6

Query #6

Find all the properties that directly relate two specific people. x=SOTable[Person1, Person2];

type, subject, subject-object property type, property type, property type subject, object type, subject subject-object subject-subject none none object-object subject-object

5.2.7

Query #7

Select students that take a course from a specific professor. y=POTable[type, Course]; y&PSTable[teacherOf,Professor0]; y&OOJoin[teacherOf,takesCourse]; At this point y is completely reduced to the answer set, but we still have to join it with x. Set x as an empty bit vector. For each on bit in y objectid= bit index for the bit in y. x|(POTable[takesCourse, objectid]);

5.3

Longwell Queries

5.2.1

Query #1

Select graduate students who take a specific course. x=POTable[type, GraduateStudent]. y=POTable[takesCourse, GraduateCourse0]; x&y; In order to find the graduates students who take GraduateCourse0, we first select the subject bit vector (x in the above notation) from POTable with the propertyid for 'type' and the objectid for 'GraduateStudent'. Then, we select the subject bit vector (y in the above notation) from POTable with the propertyid for 'takesCourse' and the objectid for 'GraduateCourse0. Finally, we

In this section, we present the query plan for several of the example Longwell queries as documented in Scalable Semantic Web Data Management Using Vertical Partitioning [5]. All of these queries are subject-subject joins.

5.3.1

Longwell Query #1

List all types of data and a count of the number of subjects of each type. To determine the types, we perform the following query: select objectid from POTable where propertyid=type; As the POTable is ordered by property, and then object, this will be an efficient query. To determine the count for each type we simple execute

x=POTable[type, objectid]; x.bitCount();

6 7

99.0% 32.6%

99.5% 44.8%

99.5% 54.6%

5.3.2

Longwell Query #2

Display the count of items of type text that define the property language x=POTable[type, text]; x&SSJoin[language, type]; x.bitCount();

5.3.3

Longwell Query #4

Display the count of items of type text that define the property language as French. x=POTable[Language, French]; x&POTable[Type, Text]; x.bitCount()

Many of the results documented in this section, display almost flat lines for the performance of RDFJoin as the dataset increases. This is because the increase in the cost of the RDFJoin in many instances is only dependent on the time required to read in the bit vector and to perform bit mathematic operations. The maximum time to access a table, read in and decompress a bit vector, for 44 million tuples, is approximately 0.08 seconds. The time to perform a bit masking operation against such a bit vector is approximately 0.006 seconds.
900

Time (in milliseconds)

800 700 600 500 400 300 200 100 0 0 5 10 15 20 25 30 35 40 45 50


VP RDFJoin

5.3.4

Longwell Query #7

List subjects that have property Point equal to End, and define property Encoding and Type. x=POLookup[Point, End] x&SSJoin[Point, Encoding]); x&SSJoin[Point, Type];

6.

IMPLEMENTATION RESULTS

All of the experiments here were performed on a system with an Intel Core 2 Duo CPU @ 2.80 GHz, with 8GB RAM, running 64 bit Windows. Our code was developed in Java and SQL, and we tested with the MonetDB column store database and the LucidDB column store database.

Triples (in millions) Figure 2: Performance results for Query 1 The performance results for Query 1 are shown in Figure 2. Query 1 is a subject-subject join with a high level of selectivity. As vertical partitioning supports subject-subject merge joins, it is highly efficient, supporting 44 million tuple in 814 milliseconds. However, vertical partitioning is hampered because the selection criteria is based on object, and vertical partitioning is sorted by
500 450 400

6.1

Experimental Results

Time (in milliseconds)

We created a database using the LUBM dataset with 400 universities and 44,172,502 tuples. We implemented the seven queries documented in Section 5.2. We compared RDFJoin with vertical partitioning by property. Our solution outperformed vertical partitioning (referred to as VP in our graphs) in every experiment. Table 10 shows the percentage by which RDFJoin reduces the performance cost of each query. Not only is RDFJoin faster for all seven queries, but the percentage improvement increases as the dataset increases for all queries. This clearly demonstrates the scalability of the RDFJoin solution. Table 10: Performance improvements Query % faster % faster % faster # 5 million tuples 12 million tuples 44 million tuples 1 2 3 4 5 54.4% 29.1% 99.1% 96.4% 97.9% 83.3% 67.3% 99.4% 97.3% 98.1% 85.4% 75.8% 99.5% 97.7% 99.1%

350 300 250 200 150 100 50 0 0 5 10 15 20 25 30 35 40 45 50 VP RDFJoin

Triples (in millions) Figure 3: Performance results for Query 2

subject. For the dataset above, RDFJoin executes the query in 119 milliseconds, an improvement of 85.4%. The performance results for Query 2 are shown in Figure 3. Vertically partitioning performs better in Query 2, because the selection criteria of the selection is based on property and subject. This is the best performance for vertical partitioning in any of the queries. Vertical partitioning is able to query 44 million tuples in 0.472 seconds. This result is somewhat surprising since this query involves a subject-object join, but the selectivity factor of the bound subject seems to be the cause of the improved performance. However, RDFJoin executes the same query in 0.114 seconds, an improvement of 75.8%.
30 25

to execute on a dataset of 44 million tuples. RDFJoin was able to perform this query in 0.14 seconds, an improvement of 97.7%. The results for Query 5 are shown in Figure 6. Query 5 is a simple selection, with property and object bound, and very low selectivity. It returns more than 3 million tuples. Even though the property is bound, because the selection is based on object, while vertical partitioning is sorted by subject, the performance of vertical partitioning is not remarkable, taking 9.8 seconds for 44 million tuples. RDFJoin on the other hand needs only one single bit vector access to perform this query, so RDFJoin actually shows better performance than in the other queries documented here. RDFJoin performs this query for 44 million tuples in 0.09 seconds, an improvement of 99.1%.
12 10

Time (in seconds)

15 10 5 0 0 5 10 15 20 25 30 35 40 45 50

Time (in s econds)

20

8 6 4 2 0 0 5 10 15 20 25 30 35 40 45 50
VP RDFJoin

VP RDFJoin

Triples (in millions) Figure 4: Performance results for Query 3 The results for query 3 are shown in Figure 4. Query 3 involves a subject-object join and selection based on object. Vertical partitioning can only perform this join with a loop. Thus the performance is significantly better with RDFJoin. RDFJoin improved performance in this experiment 99.5%.
7 6 5

Triples (in m illions)

Figure 6: Performance Results for Query 5 The results for Query 6 are shown in Figure 7. One of the criticisms of vertical partitioning has been that it performs inadequately when the property is not bound [8] [9]. However, in this experiment vertical partitioning performed significantly better when the property was not bound than in Query 5 where the subject was not bound. This is attributable to the fact that Query 6 has much higher selectivity than Query 5, and to the low number of predicates (18) in the LUBM dataset. RDFJoin returns its greatest performance in Query 6. This is because the size of
5

Time (in seconds)

4 3 2 1 0 0 5 10 15 20 25 30 35 40 45 50 VP RDFJoin

4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 0 5 10 15 20 25 30 35 40 45 50 VP RDFJoin

Triples (in millions) Figure 5: Performance results for Query 4 The results for Query 4 are shown in Figure 5. As this is a subject-subject join with the property being the primary bound variable, we would expect vertical partitioning to perform well. Vertical partitioning did perform much better than in Query 3, but it did not perform as well as initially expected, taking 6.1 seconds

Time (in seconds)

Triples (in millions) Figure 7: Performance results for Query 6

the bit vector returned is only 18 bits, one for each predicate. For a dateset of 44 million tuples, vertical partitioning required 4.3 seconds to execute the query. RDFjoin was able to complete the query in 0.02 seconds, an improvement of 99.5%. The performance results for Query 7 are shown in Figure 8. In this experiment, RDFJoin displayed its worst performance results. RDFJoin must execute a join loop to complete this query so the performance results do not result in a flat line as in other queries. However, RDFJoin still outperforms vertical partitioning by 54.6%, showing that the RDFJoin approach is beneficial even when loops cannot be avoided.
4 3.5

6.3

Bit Operation Analysis

We tested the performance of constructing two RDFSets from SQL Query results, and performing the and operation between the sets. Both sets were 7.1 million bits. One set had 270,000 bits on, the other had 1.2 million bits on. The total time to construct the RDFSets and perform the and operation was 0.006 seconds

6.4

Memory Usage

To perform all of the queries and tests outlined here, including those in section 6.1 involving 44 million triples, our maximum memory usage was only 3.574 GB. This usage amount is quite favorable compared to the Hexastore research, which reported memory consumption in excess of 4 GB for only 6 million tuples [8].

Time (in seconds)

3 2.5 2 1.5 1 0.5 0 0 5 10 15 20 25 30 35 40 45 50


VP RDFJoin

6.5

Join Costs

Triples (in millions) Figure 8: Performance results for Query 7

6.2

Compression

In creating the join tables, we perform each type of join. In this section we compare the time needed to execute the query with the time needed to access the results from the join table. For each test, the size of the dataset is identical. We determined that the time to perform the query is not proportional to the number of tuples returned, but rather to the number of tuples selected from the left hand side of the join. For example, if we are joining propertyID=5 and propertyID=7, then the deciding factor in the performance is the number of triples with propertyID = 5. This once again illustrates the usefulness of the bit count operation provided in RDFSet. Depending on the selectivity factor, it is possible for the above scenario that there are 100,000 triples with propertyID=5 but only 1,000 with propertyID =7 and none with matching objectIDs. This selectivity information can be vital for query optimization. RDFJoin offers such selectivity data, whereas in the past this information was not available. These join costs are calculated as the cost of the joins performed during the process, in Section 4.2, of building the join tables. However, by using both PSTable and POTable to do our queries we are performing what the Hexastore [8] project referred to as VP2. That is essentially vertical partitioning with indexes that sort objects and not just subjects. To emulate vertical partitioning, we perform the queries using PSTable alone. In the performance tests in this section, we display the experimental results for both vertical partitioning(VP) and VP2.
4 3.5 3 2.5

We took a bit vector of 7.9 million bits with 1.2 million bits on. We were able to compress this bit vector into 64.7 kilobytes. Our experiments show that, with compression on, the size of the RDFJoin database is comparable to that of a single triples table (+/- 15%). This at first seems counter-intuitive. RDFJoin stores three separate tables that include data about every triple, stores entire bit vectors in each tuple, and additionally stores 3 join tables. However, without a loss of generalization, an analysis of the daraset in Figure 1 will demonstrate how RDFJoin does not increase storage costs. To store these 15 RDF triples in a triples table would involve storing 45 strings. The RDFJoin tables stores 23 strings, 13 in the SOIDTable and 10 in the PropertyIDTable. The PSTable includes 14 tuples, the POTable includes 14 tuples, and the SOTable includes 15 tuples. However, in each of these tables only a total of 15 bits are on in the all of the bit vectors. These bit vectors can easily be compressed as they are so sparse. The three join tables combined include only 24 tuples and again the bit vectors are sparse and highly compressible. Altogether, a triples store would include 22 additional strings, whereas RDFJoin creates a total of 8 tables and includes 67 additional tuples. It is easy to see that 67 tuples, which are all numeric and highly compressible, are not significantly more expensive to store than 22 additional variablelength strings. Furthermore, our example dataset uses simple strings, but in reality the subject, property and object strings are generally complete URIs and thus likely to be much longer.

time in seconds

2 1.5 1 0.5 0 Minimum Average Maximum

RDFJoin VP

Figure 9: S ubject-S ubject Join Times

Since our data set is the same for each of these queries and the results do not correlate to the size of the set returned, we display here the minimum, the maximum and the average query time across all the join queries that were performed in Section 4. The subject-subject join times are shown in Figure 9. Vertical partitioning is optimally sorted for merge joins in this scenario. Nonetheless, executing such a join took as much as 3.7 seconds. RDFJoin provided access to the preprocessed join information consistently in 0.2 seconds or less. Figure 10 shows the cost of executing the object-object joins. Even though the number of tuples returned was significantly lower (for the LUBM dataset) than those in the subject-subject joins, object-object join queries using the vertical partitioning data structure (i.e. the PSTable) took as long as 18.1 seconds. VP2 in this Figure represents the cost of the same queries against the POTable. This table is sorted by object, and thus this query involves sorted merge joins, whereas querying the PSTable requires more expensive join loops. VP2 performed significantly better than VP, which shows the advantage of storing multiple versions of the triples tables. However, the VP2 solution still takes as much as 20 times as long as directly accessing the RDFJoin table.
20 18 16 14 12 10 8 6 4 2 0 Minimum Average Maximum RDFJoin VP VP2

Figure 11 shows the results for the Subject-Object Joins. These results compare closely to those for object-object joins for the exact same reasons. These charts clearly demonstrate the advantages of utilizing the join tables provided by RDFJoin as opposed to executing nested loop or even merge joins.

7.

FUTURE WORK

We plan further experiments so we can demonstrate a wider range of performance results. We would like to investigate using a database that incorporates specialized support for bit indexes. Two such databases are LucidDB and FastBits. We feel there is the opportunity to customize the database schema to take the best advantage of these features. This is especially true because MonetDB takes index instructions as mere suggestions and feels free to ignore them. Most of our database interaction is done via JDBC. We actually chose the query plan that appeared the most efficient and implemented it step by step. One opportunity for future research is to incorporate the logic within the database. It would be desirable to have a database that understood this structure and the join tables and used them in its query optimization algorithms. Perhaps it would even be possible to directly describe and support the bit vector type, which we call RDFSet, within the database. Most promising in this area is the opportunity to improve the design of query plans. Our join tables can allow a query plan that supports greater pipelining and thus provides the user with results more rapidly. It is possible to use the bit vectors and join tables to determine the selectivity of joins and conditionals. Selectivity ratios are vital data in optimizing a query plan. The bitCount() operation on the RDFSet supplies data that can be used in determining the selectivity of a given join or lookup. Using this information to further optimize query plans is left as an area for future study.

time in seconds

8.

CONCLUSIONS

Figure 10: Object-Object Join Times

We have proposed a solution for querying RDF datasets that is efficient and scalable. It implements current state-of-the-art solutions from prior research including vertical partitioning and sextuple indexing. RDFJoin provides significant improvements over the prior research. We concentrate primarily on increasing the performance of join and union operations. Subject-object joins and objectobject joins are supported as linear merge joins. Additionally, join tables are constructed that reduce the cost of all join operations. Finally, bit vectors are used to reduce the cost of calculating the join operations and increase the storage efficiency of the data model. Join and union operations are the bottleneck that hinders queries over large RDF datasets. We assert that we have provided a solution that addresses and significantly reduces this bottleneck. Our experimental results show that our solution consistently outperforms vertically partitioning, a highly acclaimed solution for RDF storage. The degree of performance improvement actually increases as the size of the dataset increases, showing that our solution is scalable. RDFJoin stores the data in relational databases using industry standard mechanisms including JDBC and SQL. Our solution provides this persistent solution and still achieves performance comparable to main memory solutions. For

20

15

time in seconds

10

RDFJoin VP VP2

0 Minimum Average Maximum

Figure 11: S ubject-Object Join Times

all these reasons, we conclude that RDFJoin does in fact provide a scalable, efficient and persistent solution.

[17] Sintek, M. and Decker, S. TRIPLE - An RDF Query, Inference, and Transformation Language. In Proceedings of INAP. 2001, 47-56. [18] Gutierrez, C. RDF as a Data Model. In Proceedings of IFIP TCS. 2006, 7-7. [19] Hurtado, C.A., Poulovassilis, A., and Wood, P.T. A Relaxed Approach to RDF Querying. In Proceedings of International Semantic Web Conference. 2006, 314-328. [20] Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O'Neil, E.J., O'Neil, P.E., Rasin, A., Tran, N., and Zdonik, S.B. C-Store: A Column-oriented DBMS. In Proceedings of VLDB. 2005, 553564. [21] Gutierrez, C., Hurtado, C.A., and Mendelzon, A.O. Formal aspects of querying RDF databases. In Proceedings of SWDB. 2003, 293-307. [22] Haase, P., Broekstra, J., Eberhart, A., and Volz, R. A Comparison of RDF Query Languages. In Proceedings of International Semantic Web Conference. 2004, 502-517. [23] Das, S., Chong, E.I., Wu, Z., Annamalai, M., and Srinivasan, J. A Scalable Scheme for Bulk Loading Large RDF Graphs into Oracle. In Proceedings of ICDE. 2008, 1297-1306. [24] Chen, H. Rewriting Queries Using View for RDF/RDFS-Based Relational Data Integration. In Proceedings of ICDCIT. 2005, 243-254. [25] Yan, Y., Wang, C., Zhou, A., Qian, W., Ma, L., and Pan, Y. Efficiently querying RDF data in triple stores. In Proceedings of WWW. 2008, 1053-1054. [26] Abadi, D.J., Madden, S., and Hachem, N. Columnstores vs. row-stores: how different are they really?. In Proceedings of SIGMOD Conference. 2008, 967-980. [27] Harizopoulos, S., Liang, V., Abadi, D.J., and Madden, S. Performance Tradeoffs in Read-Optimized Databases. In Proceedings of VLDB. 2006, 487-498. [28] Java Database Connectivity (JDBC) . http://java.sun.com/javase/technologies/database. [29] Longwell browser. http://simile.mit.edu/longwell. [30] Schmidt, M., Hornung, T., Lausen, G., and Pinkel, C. SP2Bench: A SPARQL Performance Benchmark. In Proceedings of CoRR. 2008. [31] Stocker, M., Seaborne, A., Bernstein, A., Kiefer, C., and Reynolds, D. SPARQL basic graph pattern optimization using selectivity estimation. In Proceedings of WWW. 2008, 595-604. [32] Eberhart, A. Automatic Generation of Java/SQL Based Inference Engines from RDF Schema and RuleML. In Proceedings of International Semantic Web Conference. 2002, 102-116. [33] Baolin, L. and Bo, H. HPRD: A High Performance RDF Database. In Proceedings of NPC. 2007, 364-374. [34] Hausenblas, M., Slany, W., and Ayers, D. A Performance and Scalability Metric for Virtual RDF Graphs. In Proceedings of SFSW. 2007.

9.

REFERENCES

[1] World Wide Web Consortium (W3C). http://www.w3c.org. [2] Lehigh University Benchmark (LUBM) .http://swat.cse.lehigh.edu/projects/lubm. [3] Abadi, D.J. Column Stores for Wide and Sparse Data. In Proceedings of CIDR. 2007, 292-297. [4] Abadi, D.J., Madden, S., and Ferreira, M. Integrating compression and execution in column-oriented database systems. In Proceedings of SIGMOD Conference. 2006, 671-682. [5] Abadi, D.J., Marcus, A., Madden, S., and Hollenbach, K.J. Scalable Semantic Web Data Management Using Vertical Partitioning. In Proceedings of VLDB. 2007, 411-422. [6] Chong, E.I., Das, S., Eadon, G., and Srinivasan, J. An Efficient SQL-based RDF Querying Scheme. In Proceedings of VLDB. 2005, 1216-1227. [7] Atre, M., Srinivasan, J., and Hendler, J.A. BitMat: A Main-memory Bit Matrix of RDF Triples for Conjunctive Triple Pattern Queries. In Proceedings of International Semantic Web Conference (Posters Demos). 2008. [8] Weiss, C, Karras, P, and Bernstein, A. Hexastore: Sextuple Indexing for Semantic Web Data Management. In Proceeding of VLDB. 2008. [9] Sidirourgos, L., Goncalves, R., Kerstten, M., Nes, N. and Manegold. S. Column-Store Support for RDF Management: not all swans are white. In Proceedings of VLDB. 2008. [10] Stockinger, K. Cieslewicz, J., Wu, K. Rotem, D., and Shoshani. A. Using Bitmap Indexing Technology for Combined Numerical and Text Queries. New Trends in Data Warehousing and Data Analysis, Annals of Information Systems. Vol 3. 2008. [11] Rotem, D., Stockinger, K., and Wu, K. Minimizing I/O Costs of Multi-Dimensional Queries with Bitmap Indices. In Proceedings of SSDBM. 2006, 33-44. [12] K. Wu. FastBit: an efficient indexing technology for accelerating data-intensive science. J. Phys.: Conf. Ser 16. [13] Harth, A., Kruk, S.R., and Decker, S. Graphical representation of RDF queries. In Proceedings of WWW. 2006, 859-860. [14] Udrea, O., Pugliese, A., and Subrahmanian, V.S. GRIN: A Graph Based RDF Index. In Proceedings of AAAI. 2007, 1465-1470. [15] Frasincar, F., Houben, G., Vdovjak, R., and Barna, P. RAL: An Algebra for Querying RDF. In Proceedings of World Wide Web. 2004, 83-109. [16] Chen, H., Wu, Z., Wang, H., and Mao, Y. RDF/RDFSbased Relational Database Integration. In Proceedings of ICDE. 2006, 94-94.

You might also like