Professional Documents
Culture Documents
Maria Indrawan-Santiago
Faculty of Information Technology
Monash University
Melbourne, Australia
maria.indrawan@monash.edu
Abstract — The demand to process large sets of data has 1970, there was acceptance from the academic community,
increased in the last few years from both the scientific and however, practitioners were skeptical of the performance of a
business community. To serve this demand, a number of new relational database compared to the existing network and
databases have been introduced that are not based on hierarchical database systems [28]. The adoption of this type
relational models. This group of databases is popularly known of database was very low until the introduction of System R
as NoSQL. The underlying data and transaction models in the in the 80s [4]. In the early 80s, relational DBMS grew very
NoSQL are different from relational databases. Much interest competitive and became the dominant player in database
has been placed by organizations in adopting this new markets to date.
technology and has created a buzz in database research. The
Since the introduction of the relational model, there were
fact that the underlying principles are different to relational
models has placed a dilemma in the database research
a few database models introduced such as the object oriented
community. Would this new technology change the shape of database, object relational database and XML database. The
database research and industry? object oriented and object relational databases were
introduced as the object orientation approach for software
Keywords; database, noSQL, key-value pair, business analytics engineering gained momentum in the 90s. However, these
databases never really become competitive in the
I. INTRODUCTION marketplace. The reasons could be summarized in its lack of
The database industry in the last few years has seen a theoretical foundation and its limited performance gained
number of non-relational DBMS introduced, such as over the relational database [6]. The XML database suffers
MongoDB[21], Riak[32], Neo4J[22]. These families of challenges similar to OODB. It aims to support the
database are popularly known as NoSQL databases. There proliferation of XML documents, however, adoption of a
are many debates over the roles of these databases in serving native XML database is very limited. Major relational
our information needs. The pro-NoSQL camp claims that database vendors such as Oracle, Microsoft and MySQL
this technology is a future of database [20]. On the other have included XML support for their products, but native
hand, the pro relational database camp claims that the XML database such as Tamino [29] has not captured much
NoSQL database has a major drawback of not providing share of the database market. Initial excitement for the XML
strict treatment of data integrity [7] database may be attributed to the ever increasing web
This paper attempts to navigate the roads of current applications and service orientation architecture that uses
database research by exploring the technology and impact of XML as a means to standardize its data exchange format. On
NoSQL databases as the new kid on the block of database the surface, there seems to be many data being used in XML
research. This exploration starts by looking at the trends in format, however, when it comes to storing the data, the
database research in the last 40 years since introduction of format is reverted back to a relational model. XML does not
the relational database. Analysis of the different groups of have a strong theoretical foundation that can guarantee data
NoSQL is given in section 3. Each of the groups will be integrity to the same level that relational models can offer.
compared on the data model, database transaction model and In the last few years, a new group popularly known as
data analytics. NoSQL databases has emerged. Their origin can be traced
back to the introduction of BigTable [26] and MapReduced
II. POST RELATIONAL DATABASE ERA [27] by Google in 2006 and 2004 respectively. The buzz is
still high on this new technology. While a number of these
Relational model was introduced in 1970 by E.F. Codd
database systems have been introduced to the market, there
[8]. It was introduced to overcome some problems with
are still not enough reports on the adoption rate of this new
database systems associated with network and hierarchical
technology in the database market. The technology could
models. A major drawback perceived by practitioners of
still be on climbing slope from the first phase of hype cycle
hierarchical and network models was their minimum support
[13]. It is interesting to note that the hype of this technology
of data independence [9]. Hence, complex programs were
comes mainly from the industry since so much of the
often written to answer a simple query. From academia
information on the technology is available on blogs and
perspective, the network and hierarchical model were
opinion pieces. There are limited academia based papers on
inadequate since they did not have accepted theoretical
foundation. When E.F. Codd introduced relational model in
46
by applications before state of full consistency is eventually
reached. If the soft states in the database transaction are
Availability recognized and pattern of these transactions are detected,
applications can be designed to be aware of these patterns
and manage the soft states accordingly. In the eBay’s
implementation of BASE, a message queue concept is used
PA to eventually resolve possible conflicts that may rise during
CA
the soft state of transaction.
C. Data Models
1) Key Value Pair
Partition Consistency The idea of key-value pair has existed in computing for
many decades, it is a common data structure or concept used
CP
in the development of file systems. In this type of database,
data is stored as a pair of key and value. Each of the keys is
Figure 1. Brewer’s theorem unique in a collection. Access to the values is achieved by
means of key-value association. The keys needs to be kept in
The PACELC model suggests that the tradeoff between a data store that can be quickly accessed, for example a hash
consistency and availability is not solely based on partition- table. The binding from key to the values varies depending
tolerance, but also on the existence of a network partition on the programming language used. The values do not
itself. necessarily contain raw data, it can contain another set of
“If there is a partition (P), how does the system trade off keys which makes cardinality of the value become one of the
availability and consistency (A and C); else (E), when the major decisions in designing a key-value database. What
system is running normally in the absence of partitions, how would the values represent in the database? Is it going to be
does the system trade off latency (L) and consistency (C)?” attribute, entity or another key?
In addition to the tradeoff among three CAP properties, the One interesting concept that may be too radical for
PACELC model, as depicted in Fig. 2, suggests that latency someone who follows relational databases religiously is the
is an important factor to consider as most distributed key-value treatment of consistency. In a key-value database,
database systems use replication to ensure availability. it is acceptable to have two different values of data at read
time, which implies non-consistency in relational database
theory. The inconsistency in the data is left to the application
program or client to solve. Eventually the inconsistency will
Consistency be removed from the database following a procedure adopted
Latency
by DBMS. The inconsistency is the cost that application
needs to pay as a tradeoff to availability and/or latency as
described in the transaction models. In general, key-value
pair databases are suitable for applications that process a
Parittion single-key transaction and perform a lot of reading. An
example of such applications would be generating product
catalogues on-the-fly. For this type of application the key-
Availability Consistecy value pair database can produce high throughput and low
latency performance.
2) Column-Family
Column-family database could be considered as a
Figure 2. PACELC model specific type of key-value pair model. A column family
database defines the structure of the values as predefined set
An alternative protocol based on the principle of CAD of columns, hence the name column-family. The definition
called BASE, Basically Available Soft-State services with of the column family could be considered as the schema of
Eventual Consistency, was introduced by eBay to replace the database. The main driver of this approach is the
Two-phase-commit protocol [24]. It aims to support partial Google’s HBase [15]. This data model could be one that
failure so that total system failure can be avoided. In other confuses many who are familiar with relational model due to
words, BASE focuses on the availability rather than the same naming of its components. Column-family
consistency of the database. BASE can be considered to take databases are made up of column, column family and super
an optimistic approach to consistency compared to the column.
pessimistic view of ACID-based protocol. By taking an
• Column
optimistic approach, BASE allows best effort and
A column is an atomic unit of information supported by
approximate answers to exist in a database transaction state.
the database. It is expressed as a pair of name and value.
This creates a soft state that needs to be managed carefully
• Super-column
47
Super columns group together associated columns that relationships and relations matter. For some applications
would be retrieved together from disk or have semantic such as social network, depicting the relationship between
association. It is useful for modeling complex data types each entity is important. The popularity of social networks
such as address. has contributed to the resurrection of graph database research
• Column Family that was active in the 80s and early 90s [3].
A column family groups columns and super columns
D. Business Analytics.
together into a highly structured data. It is the closest
resemblance of table in relational model. NoSQL databases were designed to support availability
of data to end-users rather than to help gathering of business
Consider a sample data containing personal details as data for decision making. Hence, at this stage of NoSQL
depicted in Table I. The data can be represented in a development, there is very limited number of querying
column-family database to have: support for business analytics applications. HIVE [16] and
• Super-columns of personal data and demographic. PIG [23] are examples of available querying applications run
• Columns of name, address, birthdate, gender. on top of map-reduced framework. Unlike relational model
• Column family of person and identify by key of based business analytics that support non technical savvy
PersonID. personnel to interrogate the database for some intelligence,
ad-hoc query in NoSQL database demands skilled
TABLE I. COLUM-FAMILY EXAMPLE programmers to code the query as many of the available
database only provide API as interface and they do not have
Row key Personal Data Demographic … high-level query language.
PersonID Name Address Education Gender
1 Smith, H Rome Master F IV. COMPARISON OF NOSQL DATABASES
2 Jones, S NY PhD M
3 Chin, P Sydney Bachelor F In this section, a sample of these DBMS is presented.
4 Santos, J Lima PhD F The list is not meant to be exhaustive, its purpose is to
highlight the different characteristics described in the
The structure of super-columns and column family previous section. The DBMS were chosen because they are
determines the schema for the database. However, it is not either one of the first of its category or the leader in its
strictly fixed. A new column or super column can be added category. These DBMS are compared based on data model,
to the design with ease when the database is already in transaction model, license, indexes, and sharding.
production. As another important different between column • Data Model
family and relational database, each row in the column The supported model in the database that could be
family database does not need to be of the same degree, i.e., one of key-value pair, column-family, document or
it can have variable number of columns/super-columns. graph.
Hence, column family will be very effective in supporting • Transaction Model
highly sparse data collection. The database priority in selecting the trade-off based
on the CAP and PACELC models.
3) Document • License
A document database uses the concept of key-value pair The type of license for the software.
to store data. However, it imposes some structure on how the • Ad-hoc Query
value is stored. Unlike the column-family that stores the Indication of whether the DBMS support ad-hoc
values in a family of columns, the document database stores query. If it is supported, what technique or
the values in a document-like structure such as XML or programming language is used to write queries.
JSON. Hence, it provides more information about the
• Indexes
structure of the data compared to key-value oriented
Indication of whether the DBMS supports automatic
databases and the structure can be exploited to serve more
maintenance of secondary indexing, in contrast to,
query types. In the key-value databases, only query by key or
applications-managed secondary indexes.
key range is possible.
• Sharding
Indication of whether the DBMS support automatic
4) Graph Database sharding. That is, the ability of DBSM to
A graph database is a database that uses graphs as the automatically re-distribute data and replications
means to represent its schema. across servers when there is a change in resources
The graph database differs from relational database on its such as addition or removal of servers.
treatment of relationship. In relational database, what matters
are tuples and its collection called relation. The relationship
between individual tuple is implicitly defined by means of
foreign key and primary key. In a graph database, both
48
TABLE II. NOSQL DATABASES COMPARISON
49
several DBMS listed in Table II that support automatic is challenging. Currently, to the author’s knowledge,
sharding such as Cassandra, MongDB, RavenDB and Riak. no data modeling technique has been prescribed as a
From the four data models, the graph data model is the one method to perform database design for the different
with the most challenges in supporting sharding. There is no data models.
support for sharding in graph databases at this stage, • A new model with the support of a strong theoretical
although some initial development has taken place. [18]. foundation that works well on a large distributed
data set.
V. CONCLUSION NoSQL database was designed to overcome limitations
We have explored and compared different types of of relational database in supporting distributed processing of
NoSQL databases. Two main drivers for these databases are data. Hence, some important aspects that are important in a
the needs of many organizations to process large amounts of relational database may not be relevant in NoSQL, for
data which in some cases has no obvious tabular structure. example, query optimization. A query optimization engine is
The solutions proposed for these drivers are new data models included in RDBMS because relational models impose data
that are non-relational for distributed data processing. independence and provide high level support of ad-hoc
Although it is considered a new model in the database query. In NoSQL databases, the query is implicitly
domain, these new models are being developed based on optimized during the design of the database by considering
existing and known theory. For example, the idea of key- the type of distributed architecture available and pattern of
value stored is a known data structure and has been used in queries to be supported. It does not aim to be very flexible in
many file systems. What makes the development of these serving ad-hoc queries as in a relational database.
databases novel is its design of the DBMS to support Nevertheless, there is a need to serve ad-hoc query more
horizontal scale-up. This is done by relaxing ACID protocol efficiently, but the approach should be different from query
and building a protocol that allow eventually consistent state optimization in relational database.
in the database. Availability is the main concern, not From the point of view of adoption and development, it is
consistency. This departs far from relational database that important to educate the CTOs on the strength and
put consistency as its main focus. So, are we at a crossroad? weaknesses of NoSQL database. These databases should be
Not exactly. seen as a complimentary solution to data management
The two camps of relational and non-relational (NoSQL) problem in the organization to relational database, not as a
are like two different roads to go to the same destination. It is replacement. It is important to understand operational
like having Route 66 and interstate highways. The route 66 is patterns in the organization to allow the development of best
a well known route, may be more scenic, offer multiple stops practices and methodologies that appropriate for NoSQL
and may be a bit slower. On the other hand, interstate database design and implementation. Without design tools, it
highways can make the trip faster, but only when all will be difficult for this new technology to get mass adoption
interstate highways are developed, coordinated and the roads in the marketplace.
are designed to handle the volume and rate of traffic. Many relational database end-users will find NoSQL to
There are still many open challenges in making NoSQL be difficult to use as building queries in NoSQL requires
databases become a mainstream solution. The challenges more sophisticated programming skills. Providing a high-
come from three different domains, research, level query language will be important for the acceptance of
adoption/development and end-user. From research this technology by the end-users. It is important to also
perspective there are still problems to solve such as: educate them that ad-hoc query may take longer in NoSQL
• Understanding latency and its influence in the compared to relational database, hence their way of defining
overall design and performance of a database. The their information needs may need to be altered. It should be
PACELC is a step in the right direction. But a model defined earlier during the development of the database rather
with strong theoretical foundation on understanding than later during the operational stage of the database.
latency is still needed so that a new architecture of NoSQL will not replace the relational database
DBMS can be designed accordingly. completely. Instead, it is complementary to relational
• Sharding for graph database. Unlike most of the databases in providing enhanced data management capability
other models, graph database has mutable structure within an organization.
during run time hence it is difficult to design a
sharding algorithm than provide high elasticity. [1] D.J. Abadi, "Consistency Tradeoffs in Modern Distributed Database
System Design: CAP is Only Part of the Story," Computer, vol. 45,
• Support for ad-hoc query is still limited in many no. 2, pp. 37-42, Jan. 2012
systems, hence support for OLAP or data [2] R. Agrawal, A. Ailamaki, P. A. Bernstein, E. A. Brewer, M. J. Carey,
warehousing queries is still very limited. There is a S. Chaudhuri, et.al.. The Claremont report on database research.
need to find a new model of business analytics or a SIGMOD Rec. 37, 3 (September 2008), 9-19
way to provide simple interface for decision makers [3] R. Angles and C. Gutierrez. 2008. Survey of graph database models.
to perform ad-hoc query. ACM Comput. Surv. 40, 1, Article 1 (February 2008), 39 pages.
• Unlike relational models, NoSQL does not have a [4] M. M. Astrahan, M.W. Blasgen, D. D. Chamberlin, K. P. Eswaran, J.
N. Gray, P.P. Griffiths, W.F. King, R.A. Lorie, P.R. McJones, J. W.
strong theoretical background to the model, hence Mehl, G.R. Putzolu, I.L. Traiger, B.W. Wade, and V. Watson. 1976.
developing a set of methodology for database design
50
System R: relational approach to database management. ACM Trans. [18] http://jim.webber.name/2011/02/16/3b8f4b3d-c884-4fba-ae6b-
Database Syst. 1, 2 (June 1976), 97-137. 7b75a191fa22.aspx
[5] http://ayende.com/blog/tags/nosql [19] A. Lakshman and P. Malik. 2010. Cassandra: a decentralized
[6] S. Bagui: “Achievements and Weaknesses of Object-Oriented structured storage system. SIGOPS Oper. Syst. Rev. 44, 2 (April
Databases”, in Journal of Object Technology, vol. 2, no. 4, July- 2010), 35-40
August 2003, pp. 29-41 [20] N. Leavitt; , "Will NoSQL Databases Live Up to Their Promise?,"
[7] http://cacm.acm.org/blogs/blog-cacm/99512-why-enterprises-are- Computer , vol.43, no.2, pp.12-14, Feb. 2010.
uninterested-in-nosql/fulltext [21] http://www.mongodb.org/
[8] E.F. Codd. 1970. A relational model of data for large shared data [22] http://neo4j.org/
banks. Commun. ACM 13, 6 (June 1970), 377-387. [23] http://pig.apache.org/
[9] T.M. Connoll and C.E. Begg, Database Systems: A Practical [24] D. Pritchett. 2008. BASE: An Acid Alternative. Queue 6, 3 (May
Approach to Design, Implementation and Management, 4th ed., 2008), 48-55.
Addison Wesley, 2005.
[25] http://ravendb.net/
[10] http://couchdb.apache.org/
[26] http://research.google.com/archive/bigtable.html
[11] G. DeCandia, D. Hastorun, M. Jampani, G.Kakulapati, A. Lakshman,
[27] http://research.google.com/archive/mapreduce.html
A. Pilchin, S.Sivasubramanian, P. Vosshall, and W. Vogels. 2007.
Dynamo: amazon's highly available key-value store. In Proceedings [28] A. Silberschatz, H.F. Korth., and S. Sudarshan, Database System
of twenty-first ACM SIGOPS symposium on Operating systems Concepts, 5th ed., McGraw Hill. 2006.
principles (SOSP '07). ACM, New York, NY, USA, 205-220. [29] http://www.softwareag.com/Corporate/products/wm/tamino/default.a
[12] http://fallabs.com/tokyocabinet/ sp
[13] J. Fenn. and M. Raskino, Mastering the Hype Cycle: How to Choose [30] M. Stonebraker and R. Cattell. 2011. 10 rules for scalable
the Right Innovation at the Right Time, Harvard Business Press, performance in 'simple operation' datastores. Commun. ACM 54, 6
2008. (June 2011), 72-80.
[14] S. Gilbert and N. Lynch. 2002. Brewer's conjecture and the feasibility [31] R. Sumbaly, J. Kreps, L. Gao, A. Feinberg, C. Soman, and S. Shah.
of consistent, available, partition-tolerant web services. SIGACT 2012. Serving large-scale batch computed data with project
News 33, 2 (June 2002), 51-59. Voldemort. In Proceedings of the 10th USENIX conference on File
[15] http://hbase.apache.org/ and Storage Technologies (FAST'12). USENIX Association,
Berkeley, CA, USA, 18-18.
[16] http://hive.apache.org/
[32] http://wiki.basho.com/
[17] http://www.igvita.com/2009/02/13/tokyo-cabinet-beyond-key-value-
store/
51