You are on page 1of 12

Cassandra as used by Facebook

Bingwei Wang (bw0338), Si Peng (sp0890), Xiaomeng Zhang (xz0398), Mark Bownes (mb7801), Rob Paton (rp7374) and Farshid Golkarihagh (fg7281) December 15, 2010

Introduction

Cassandra is a distributed NoSQL database which was developed by Facebook as a method for inbox searches. It was authored by Avinash Lakshman, who had previously worked on Amazons Dynamo database, and Prashant Malik in 2007, and written in Java. In 2008 it became an open source project and was picked up in 2009 by Apache, who made it an incubator project. Apache Incubator is a gateway for open source projects to become Apache software, and in 2010 Cassandra was upgraded to be a top level Apache project. Cassandra was created to be a quick, scalable and fault tolerant system and as such boasts a number of important features which distinguish it from its competitors. This report will outline these key features, particularly in the areas of architecture, scalability, fault tolerance and will go beyond this to look at the future of the project.

How Facebook Uses Cassandra

Facebook created Cassandra to power their inbox search, and this is still where it is used today. There are two kinds of search that Facebook allows for, these are item search and interactions. Item search allows for a simple search of keywords, the key for this search is the users ID. There are super columns which are words that make up messages, and columns which are individual message identiers of messages containing the word being searched for. Interaction searches are used to search for a name and nd all messages between the user and the searched for person. Similar to the keyword searching the key used for interaction search is the users ID. However in this case the super columns are the searched for persons ID and the columns are individual message identiers. To speed up searching Facebook has its own special hooks built into its version of Cassandra to do intelligent hashing. Notably as soon as a user clicks into the search bar a message is sent to Cassandras cluster priming it with the users ID. This means that once a search is executed the search results are likely to already be in memory, and so searching becomes a very quick process.

3
3.1

Architecture
Physical

Cassandra is a distributed database, which means its data is spread out over a number of computers (or nodes) which dont need to be in the same geographical area. A group 1

of nodes is called a cluster.

The workings of the cluster are abstracted away when it comes to using it - the Facebook site doesnt need to know about nodes, or which node to access to get the required data. As of 2010 Facebook has a cluster of 150 nodes, spread out over the east and west coasts of the USA. Collectively, the nodes store 50 TB of data. A system called Ganglia is used by Facebook to monitor the nodes for any faults, the most common type of which are hard-drive failures. Sometimes the nodes need to be heavily synchronised, for example during complicated SQL transactions to avoid losing updates. For this, Facebook use a program called Zookeeper. [AL09]

3.2

Logical

The Cassandra system can be broken down into three layers - core, middle and top [Ell]. Core Messaging service Failure detection Cluster state Partitioner Replication Middle Indexes Compaction Commit log Memtable SSTable Top Hinted hand-o Read repair Monitoring Admin tools

The top layer is designed to allow ecient, consistent reads and writes using a simple API. The Cassandra API is made up of simple getter and setter methods and has no reference to the databased distributed nature. Another element in the top layer is Hinted hand-o. This occurs when a node goes down - the successor node becomes a coordinator (temporarily) with some information (hint) about the failed node. The middle layer contains functions for handling the data being written into the database. Compaction tries to combine keys and columns to increase the performance of the system. The dierent ways of storing data such as Memtable and SSTable are also handled here, but will be explained in the NoSQL section. The core layer deals with the distributed nature of the database, and contains functions for communication between nodes, the state of the cluster as a whole (including failure detection) and replication between nodes. These elements will be explained in more detail in the Fault Tolerance and scalability sections.[AL09]

NoSQL

The second important feature of Cassandra is NoSQL. In computing, NoSQL is a term used to designate database management system that dier from classic relational 2

database system in some way. Obviously, Cassandra is designed not to be a traditional relational DBMS. It has totally dierent strategies on how to handle the writing or reading operation and how to store the incoming data. Cassandra chooses a dierent architecture to support the operations performed by the users. A typical structure contains the following parts: CommitLog, Memtable, SSTable. Memtable: located in the memory. One data structure maps to one Memtable object.The place which the data is rstly written into. SSTable: permanent data storage. The data is ushed from Memtable to SSTable when an specic threshold is reached. CommitLog: is used for recovery purposes, record the changes so they could be used in the case of crash or inconsistency. This structure will promote the eciency for read and write operations.The comparison of the operations performance between the classic relational DBMS and Cassandra can be viewed in the below table: MySQL 15ms 0.12ms Cassandra 350ms 300ms

Reading Writing

[Per10] Writing is really fast here, because the system is designed to facilitate the writing as much as possible. When the user performs a writing operation, the system will modify the logs in the Commitlog. After that, the data will be written into the Memtable. The main reason why the writing operation is extremely fast is the fact that all the data is written into the memory rst, instead of the hard disk. After the size of the Memtable exceeding the threshold, the data will then be moved to SSTable, which locates in the hard disk. It is noticeable that Data in the disk is stable, which means it is not modiable, it can just can be deleted or combined. Because of this particular property, the whole system can tolerate concurrent writing operations without blocking the disk resources. So the whole process of massive writing will be quickened. [Pop10b]

[Per10] Reading here is a little bit slower, because the system needs to search not only the Memtable, but also the SSTables. Every piece of written data uses a special key to identify itself in the database. When the system searches for some specic elements, it should take advantage of these special keys. In order to promote the searching eciency, SSTable is specialized to provide wide range of searching algorithms. There are three integral parts in a SSTable, Data eld, Bf eld, and Idx eld. Data eld holds the real content of the stored data. Index is responsible for recording key and its corresponding data address. Filter elds, which is also known as bloomlter, can quickly determine whether a provided key is in this SSTable or not. [Pop10a] With the assistance of these advanced structures, Cassandra can do much faster reading and writing operations without sacricing too much space.

Scalability

One of the important factors when talking about scalability is the method used for dealing with new nodes either due to the expansion of the data processing and storage or due to the node outage (failures or maintenance tasks). Scalability for expansion of the data processing can be divided into the following forms: Vertical Scalability (Scale up): In this approach, resources are added to a single node in the system to increase the throughput. This usually mean addition of CPU or memory power. [Ter07] Horizontal Scalability (Scale out): In this approach,the system is organised in a cluster way, throughput could simply be improved by adding more nodes to the system and allowing the system to perform load balancing to distribute the load between nodes evenly. [Con09] Both method of the scalability have many advantages and disadvantages. The most important advantage for vertical scalability is the minimal administration management as more of the computational power is concentrated in a single node. In contrast , for horizontal scalability, as the computational power is split between the node in the cluster, node outage will not have a major impact on the resources that are available. [Hor07] 4

Facebook uses Cassandra by implementing the Horizontal Scalability[Pfe10]. As more user join the system, nodes are added to the cluster to overcome the extra load on the servers. The Facebook cluster that uses Cassandra could be represented using a ring model network where each of the nodes are placed at a position in the ring. In this architecture, when a node starts for the rst time, a token is randomly picked which identify the position of the node in the ring. Using the gossiped algorithm the token information (position of the node) is spread between dierent node in the cluster which enable all node to know about the position of all other nodes in the ring[Pro10a]. Knowing the position of all other nodes in the cluster will allow each node to route the request to the correct node. When a node joins the cluster,it will try to accommodate some of the load from other nodes that are heavily loaded and the cluster will utilize the new resources automatically [AL09] Data model of Cassandra is another reason for their success in scalability. Unlike relational databases, Cassandra does not have limitation with the number of rows or columns. Its data model could simply be described as a very large table with lot of rows. Each row can be identied using a unique key which is any arbitrary string with no limit on its size. Each row has a column family, column families can have many columns (could be either name, value or timestamp). The picture below will illustrate the structure of a column family:

[Pro08] Super columns are a set of name and/or column(s) which are sorted. Super columns are referred to as locality groups by Amazons Bigtable. Columns are declared by the administrator prior to the start up of the database. However, Columns can be added/deleted dynamically during the run time. The picture below will illustrate the structure of super columns:

[Ham07]

Fault tolerance

For Cassandra fault tolerance is a very important concern, it starts as soon as a piece of data is inputted to the system. This data undergoes self replication and is copied to multiple nodes by way of an automated process. For each replica of the data a timestamp is 5

created, to help keep track of newer versions, something used by the read repair system. Similarly checksums are created for each replica, to give a method by which to ensure authenticity and accuracy of the data after replication. This also means that the system is eventually consistent, it takes some time for the data in each node to be brought up to date, but when it is the system is consistent. This is counter to traditional databases which have strong consistency, meaning after an update all nodes are up to date rather than the delay that eventual consistency has [Lak08]. It can be shown mathematically if we consider the number of nodes storing replicas of the data, N, the number of these replicas that need to acknowledge a write before it is successful, W, and the number of replicas contacted in a read operation on a piece of data, R. Then the following holds: W + R > N = strong consistency W + R <= N = eventual consistency For self replication the strategy used is important, there are three strategies provided by Cassandra to replicate data across nodes: Rack Aware, Rack Unaware and Data Shard. In all of these strategies the rst replica is always placed on a node within the key range of its token. A token exists for each node, and it species what section of the ring a node occupies. This means that the token is a way of showing which keyspaces the node controls. This is done by way of a partition, which can be done with one of two strategies: random and order preserving. The random approach gives the token as an integer in the range 0-2127, giving even distribution. Order preserving gives the token as a string, but does not support even distribution. The range of a token is the distance between a nodes token and the token of the next node, this range is what the keys lie within. Thus for all the replication strategies the rst replica is placed at the node claiming the key for it within its token, it is after this initial replica is placed that the three possible replication strategies become important. The rack aware method is most useful when there are multiple data centres being used, as with all strategies the rst node is placed on the token. The second is placed in a dierent data centre and from then on the nodes are placed on dierent racks in the same data centre. Rack unaware is the opposite of this, in that the replicas are placed on the closest node on the ring, regardless of what data centre it is in. Finally the data shard approach allows for more control over the rack aware approach by allowing the user to indicate the replication strategy for each data centre. [Bla10] [Vog07b] If a node ceases to work correctly then Cassandra deals with it by taking it oine, repairing it and bringing it back online. This means that xing a node requires no system downtime, and thus Cassandra has no single point of failure, if a single node fails the system will lose some capacity, but it will still function and provide access to all stored data.This also links in to the Gossip algorithm used by the system, this is an algorithm keeps all the nodes updated with important information about all other nodes. This also keeps the oine and failed nodes updated with the same information, meaning that as soon as they come online they have the correct information. To check if a node has failed or not Cassandra does not use a simple binary working/not working ag for each node as its failure detector. Rather it abstracts away from the program to utilise an Accrual Style Failure Detector. This is a failure detector which maintains a level of suspicion about each node, which shows how likely it is that the node has failed. This allows it to take into account uctuations in the network. [Sas10] Other than node failure there is one other important factor to making a system fault

tolerant, and that is to keep the data up to date and correct. Errors in the data must be avoided, or xed as quickly and eciently as possible. To this end Cassandra uses read repair whenever a piece of data is requested. Once a read request is issued to a node, all nodes containing replicas of the data being read have their timestamp and checksum pulled up by the system. The checksums are compared, and if there is an inconsistency then the timestamps are checked to nd the latest version of the data. This latest version is then merged with the out of date or incorrect version that has caused the checksums to be dierent. A write request is then automatically sent to the node, allowing it to update the data it holds, once again ensuring the consistency of the database. [Tar10]

[Ho10]

The diagram above shows the keyspace and nodes to show the fault tolerant nature of Cassandra, as it explains how the keys between two nodes are passed around the ring if a node fails. This also demonstrates the fact that the system has no single point of failure, because the keys are passed around the ring if a node fails, thus the data the key points to is still easily accessed by way of another node, whilst the broken node is repaired.

Cassandra vs. Dynamo & Bigtable

Cassandra is described by Je Hammerbacher, who led the Facebook Data team at the time, as a Bigtable data model running on a Dynamo-like infrastructure. Bigtable is a fast and extremely large-scale Data Base Management System which is used by a number of Google applications. Dynamo is a highly available, scalable key-value storage system which supports part of Amazon Web Services. Cassandra inherits cluster technology from Dynamo and borrows data model from Bigtable. Like Dynamo, Cassandra is a distributed network service provided by a bunch of nodes which are connected together. In the same time, it provides the concept of column family which is similar to Bigtable. This dierentiates Cassandra from simple key/value data structure of Dynamo.[GD07] [FC06] [AL09]

7.1

Dynamo

There are many similarities between Cassandra and Dynamo. High scalability is one of them - Cassandra uses a ring infrastructure and consistent hashing like Dynamo. In the basic consistent hashing, each node becomes responsible for the region in the ring between it and its predecessor node on the ring, so that departure or arrival of a node only aects its immediate neighbours and other nodes remain unaected. Therefore, its easy to insert or delete notes without data transfer in large scale. However, both Dynamo and Cassandra make improvements to basic consistent hashing. In order to address the problem that the basic algorithm doesnt consider the different load capacity of nodes, Dynamo assigns a node with high capacity to multiple positions in the circle while Cassandra analyzes load information on the ring and has lightly loaded nodes move on the ring to alleviate heavily loaded nodes.[GD07] [AL09] A balance between availability and fault-tolerance is another similarity. A piece of data has many replicas on dierent nodes. Its apparent that the more nodes you need for successful reading/writing, the high fault-tolerance would be, but less ecient the system would be. The way to solve this problem in Cassandra is the same as Dynamo. If R and W represent the least number of nodes for successfully reading and writing a piece of data, and N represents the number of nodes storing data: R>NW This means that youll read at least one the newest version instead of old version. This is called the Quarum (Consistent) Protocol. [AL09] [Vog07a] One dierence between Cassandra and Dynamo is that its not a pure key/value structure like Dynamo. It borrows its data structure from Bigtable (column family) which will be easier to compress data and save storage space. [FC06] Additionally, they also apply dierent ways to maintain data consistency. Cassandra omits vector clock which is used by Dynamo to avoid version conict because it takes long time. Instead it gives each cell a timestamp to decide which data is newer and which should be kept. [AL09]

7.2

Bigtable

The data model of Cassandra is similar to Bigtable, it borrows these features from Bigtable: Column/column family Sequential write (Commitlog -> Memtable -> SStable) Merged read Periodic data compaction. The former two have been explained previously. Merged read means that when reading a piece of data, dierent versions will be merged together to avoid data conict. Periodic data compaction refers to the mechanism of merging SSTables which are scattering around at frequent intervals to save storage space. [FC06] [Ho10] Comparing to Bigtable, super column is a distinctive concept of Cassandra, super column families can be viewed 8

as a column family within a column family. This means that you can access a column family in a super column family in super column family and so on and so on, as ndimensional column families. With super column, Cassandra can represent data in a richer way. [AL09]

8
8.1

Others using Cassandra


Twitter

In March 2010 Ryan King revealed in an interview that Twitter would move from MySQL to Cassandra, rst for storing the statuses table, which contains all tweets and retweets, and that over time, Cassandra would completely replace the current MySQL solution [Pop10c]. According to King, they considered several issues based on which they examined a set of techniques and nally chose Cassandra, for the reasons that it would oer no single point of failure, scalable writes, and that there was a healthy and productive open source community supporting it. In June, Twitter experienced poor performance resulting from over-capacity in internal sub-networks [Twi10a]. In July, they announced that theyd switch back to MySQLbased storage for tweets as a change in strategy [Twi10b]. However, Twitter would still be working on Cassandra where they require a a large-scale data store. Their usage of Cassandra would only grow.

8.2

Digg

John Quinn of Digg announced in March 2010 that they were making large scales of changes to their system, abandoning MySQL in favour of a NoSQL alternative (i.e. Cassandra) [Dig10]. The resulting Digg (version 4) later proved to be unsuccessful in terms of reliability and acceptance, and Quinn himself was no longer employed [Kal10]. A recent post (17 Oct) criticised their rewriting everything from scratch as the problem, leading to bad architecture, and stated that in contrast, Facebook dont make gigantic changes all at once [Pro10b].

Future and Conclusion

These problems were unfortunate, but Cassandra isnt to blame for everything. As Cassandra does have preferable features as explained above, none of these sites is likely to totally discontinue its usage. Riptano, a company established in April 2010, has been backing Cassandra since. They worked with Digg to study the problems, and the founder Matt Pfeil was condent in Cassandra itself, but realised that there was a lot to be done before it is close to where it will compare in production environments to something like MySQL. [Hig10] As work on Cassandra will not cease in the sites mentioned, theres no doubt that Cassandra will retain its popularity and accordingly achieve growth and improvement with implementation and performance. In another interview, Pfeil talked about relationship between NoSQL and traditional relational databases, pointing out that theres denitely room for both in the world, and sometimes even in the same application. [Ros10] 9

Given Cassandras features and immaturity, it seems that currently it should be used as a complement to relational databases, as what most sites mentioned is doing now.

10

References
[AL09] P. Malik A. Lakshman. Cassandra - a decentralized structured storage system. The 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS 09), October 2009. B. Black. Cassandra Replication & Consistency. http://www.slideshare.net/benjaminblack/introduction-to-cassandrareplication-and-consistency, April 2010.

[Bla10]

[Con09] Burleson Consulting. Vertical vs. Horizontal scalability. http://www.dbaoracle.com/, 2009. [Dig10] [Ell] [FC06] [GD07] Digg. Saying Yes to NoSQL, Going http://about.digg.com/node/564, March 2010. Steady with Cassandra.

J. Ellis. Open Source Bigtable + Dynamo. Open Source Convention 2009 (OSCON 09). S. Ghemawat W. C. Hsieh F. Chang, J. Dean. Bigtable: A Distributed Storage System for Structured Data. OSDI, 2006. M. Jampani G. DeCandia, D. Hastorun. Dynamo: Amazons Highly Available Key-value Store. ACM, 2007.

[Ham07] J. Hamilton. Facebook Cassandra Architecture and Design. http://perspectives.mvdirona.com/2009/02/07/FacebookCassandraArchitectureAndDesign.aspx, February 2007. [Hig10] S. Higginbotham. Digg Not Likely to Give Up on Cassandra. http://gigaom.com/2010/09/08/digg-not-likely-to-give-up-on-cassandra/, September 2010. R. Ho. BigTable Model wiht Cassandra and HBase. http://horicky.blogspot.com/2010/10/bigtable-model-with-cassandra-andhbase.html, October 2010. Vertical vs. Horizontal scalability. http://www.scalingout.com/2007/10/verticalscaling-vs-horizontal-scaling.html, October 2007. R. Kalla. Digg v4 Troubles are Symptom of a Bigger Problem. http://www.thebuzzmedia.com/digg-v4-troubles-are-symptom-of-abigger-problem/, September 2010. A. Lakshman. Cassandra A structured storage system on a P2P Network. http://www.facebook.com/note.php?note id=24413138919, August 2008. M. Perham. Cassandra Internals Writing. http://www.mikeperham.com/2010/03/13/cassandra-internals-writing/, March 2010. M. Pfeil. Why does Scalability matter, and how does Cassandra scale? . http://www.riptano.com/blog/why-does-scalability-matter-and-how-doescassandra-scale, October 2010. 11

[Ho10]

[Hor07] [Kal10]

[Lak08] [Per10]

[Pfe10]

[Pop10a] A. Popescu. Cassandra Read Operation Performance Explained. http://nosql.mypopescu.com/post/474623402/cassandra-reads-performanceexplained, March 2010. [Pop10b] A. Popescu. Cassandra Write Operation Performance Explained. http://nosql.mypopescu.com/post/454521259/cassandra-write-operationperformance-explained, March 2010. [Pop10c] A. Popscu. Cassandra @ Twitter: An Interview with Ryan King. http://nosql.mypopescu.com/post/407159447/cassandra-twitter-aninterview-with-ryan-king, February 2010. [Pro08] Project Cassandra: Facebooks Open Source Alternative to Google BigTable. http://www.25hoursaday.com/weblog/CommentView.aspx?guid=c573171e8e62-45b4-b85c-7b411b528e51, July 2008.

[Pro10a] M. Pronschinske. Cassandra NoSQL Database an Apache Top Level Project. http://css.dzone.com/articles/cassandra-nosql-database, February 2010. [Pro10b] Proximity. DIGGing a Hole with Cassandra. http://blog.proximitychicago.com/post/2010/10/17/DIGGing-a-Hole-withCassandra.aspx, October 2010. [Ros10] [Sas10] D. Rosenberg. Apache Cassandra gets boost from http://news.cnet.com/8301-13846 3-20003945-62.html, May 2010. Riptano.

R. Sasirekha. Apache Cassandra - Distributed Database. http://itknowledgeexchange.techtarget.com/enterprise-IT-techtrends/apache-cassandra-distributed-database-part-ii/, December 2010. T. Tarrant. Eventually Consistent. http://wiki.apache.org/cassandra/Operations, November 2010. G. Terrill. Think you know what scalability http://www.infoq.com/news/2007/10/whatisscalability, October 2007. is?

[Tar10] [Ter07]

[Twi10a] Twitter. A Perfect Storm...of Whales. http://engineering.twitter.com/2010/06/perfect-stormof-whales.html, June 2010. [Twi10b] Twitter. Cassandra at Twitter Today. http://engineering.twitter.com/2010/07/cassandra-at-twitter-today.html, July 2010. [Vog07a] W. Vogels. Amazons Dynamo. http://www.allthingsdistributed.com/2007/10/amazons dynamo.html, October 2007. [Vog07b] W. Vogels. Eventually Consistent. http://www.allthingsdistributed.com/2007/12/eventually consistent.html, December 2007.

12

You might also like