You are on page 1of 48

Cassandra: 0-60

Jonathan Ellis / @spyced


Keyspaces & ColumnFamilies
● Conceptually, like “schemas” and
“tables”
Inside CFs, columns are
dynamic
● Twitter: “Fifteen months ago, it took two
weeks to perform ALTER TABLE on the
statuses [tweets] table.”
ColumnFamilies

Columns
“static” Cfs vs “dynamic”
Inserting
● Really “insert or update”
● As much of the row as you want
(remember sstable merge-on-read)
Column indexes
● Name vs range flters
● “reversed=true”
Denormalization
● Whiteboard: Turn, long, skinny tables into
long rows
● Reduces i/o and cpu to perform read
Example: twissandra
● http://twissandra.com
CREATE TABLE users (
id INTEGER PRIMARY KEY,
username VARCHAR(64),
password VARCHAR(64)
);

CREATE TABLE following (


user INTEGER REFERENCES user(id),
followed INTEGER REFERENCES user(id)
);

CREATE TABLE tweets (


id INTEGER,
user INTEGER REFERENCES user(id),
body VARCHAR(140),
timestamp TIMESTAMP
);
Cassandrifed
<Keyspaces>
<Keyspace Name="Twissandra">
<ColumnFamily CompareWith="UTF8Type" Name="User"/>
<ColumnFamily CompareWith="BytesType" Name="Username"/>
<ColumnFamily CompareWith="BytesType" Name="Friends"/>
<ColumnFamily CompareWith="BytesType" Name="Followers"/>
<ColumnFamily CompareWith="UTF8Type" Name="Tweet"/>
<ColumnFamily CompareWith="LongType" Name="Userline"/>
<ColumnFamily CompareWith="LongType" Name="Timeline"/>
</Keyspace>
</Keyspaces>
Connecting
CLIENT = pycassa.connect_thread_local()

USER = pycassa.ColumnFamily(CLIENT, 'Twissandra', 'User',


dict_class=OrderedDict)
Users
'a4a70900-24e1-11df-8924-001ff3591711': {
'id': 'a4a70900-24e1-11df-8924-001ff3591711',
'username': 'ericflo',
'password': '****',
}

username = 'jericevans'
password = '**********'
useruuid = str(uuid())
columns = {'id': useruuid, 'username': username,
'password': password}
USER.insert(useruuid, columns)
Natural keys vs surrogate
Friends and Followers
'a4a70900-24e1-11df-8924-001ff3591711': {
# friend id: timestamp when the friendship was added
'10cf667c-24e2-11df-8924-...': '1267413962580791',
'343d5db2-24e2-11df-8924-...': '1267413990076949',
'3f22b5f6-24e2-11df-8924-...': '1267414008133277',
}

frienduuid = 'a4a70900-24e1-11df-8924-001ff3591711'
FRIENDS.insert(useruuid, {frienduuid: time.time()})
FOLLOWERS.insert(frienduuid, {useruuid: time.time()})
Your row is your index
● Long skinny table vs short, fat
columnfamily
Tweets
'7561a442-24e2-11df-8924-001ff3591711': {
'id': '89da3178-24e2-11df-8924-001ff3591711',
'user_id': 'a4a70900-24e1-11df-8924-001ff3591711',
'body': 'Trying out Twissandra. This is awesome!',
'_ts': '1267414173047880',
}
Userline
'a4a70900-24e1-11df-8924-001ff3591711': {
# timestamp of tweet: tweet id
1267414247561777: '7561a442-24e2-11df-8924-...',
1267414277402340: 'f0c8d718-24e2-11df-8924-...',
1267414305866969: 'f9e6d804-24e2-11df-8924-...',
1267414319522925: '02ccb5ec-24e3-11df-8924-...',
}
Timeline
'a4a70900-24e1-11df-8924-001ff3591711': {
# timestamp of tweet: tweet id
1267414247561777: '7561a442-24e2-11df-8924-...',
1267414277402340: 'f0c8d718-24e2-11df-8924-...',
1267414305866969: 'f9e6d804-24e2-11df-8924-...',
1267414319522925: '02ccb5ec-24e3-11df-8924-...',
}
Adding a tweet
tweetuuid = str(uuid())
body = '@ericflo thanks for Twissandra, it helps!'
timestamp = long(time.time() * 1e6)
columns = {'id': tweetuuid, 'user_id': useruuid, 'body':
body, '_ts': timestamp}
TWEET.insert(tweetuuid, columns)

columns = {struct.pack('>d', timestamp): tweetuuid}


USERLINE.insert(useruuid, columns)

TIMELINE.insert(useruuid, columns)
for otheruuid in FOLLOWERS.get(useruuid, 5000):
TIMELINE.insert(otheruuid, columns)
Reads
timeline = USERLINE.get(useruuid, column_reversed=True)
tweets = TWEET.multiget(timeline.values())

start = request.GET.get('start')
limit = NUM_PER_PAGE
timeline = TIMELINE.get(useruuid, column_start=start,
column_count=limit, column_reversed=True)
tweets = TWEET.multiget(timeline.values())
I can has smarter clients?
● Shouldn't need to pack('>d', int),
Cassandra provides describe_keyspace so
this can be introspected
Raw thrift API: Connecting
def get_client(host='127.0.0.1', port=9170):
socket = TSocket.TSocket(host, port)
transport = TTransport.TBufferedTransport(socket)
transport.open()
protocol =
TBinaryProtocol.TBinaryProtocolAccelerated(transport)
client = Cassandra.Client(protocol)
return client
Raw thrift API: Inserting
data = {'id': useruuid, ...}
columns = [Column(k, v, time.time())
for (k, v) in data.items()]
mutations = [Mutation(ColumnOrSuperColumn(column=c))
for c in columns]
rows = {useruuid: {'User': mutations}}
client.batch_mutate('Twissandra', rows,
ConsistencyLevel.ONE)
Raw thrift API: Fetching
● get, get_slice, get_count, multiget_slice,
get_range_slices
● ColumnOrSuperColumn
● http://wiki.apache.org/cassandra/API
Running twissandra
● cd twissandra
● python manage.py runserver
● Navigate to http://127.0.0.1:8000
Pycassa cheat sheet
● get(key, …)
● multiget(key_list)
● get_range(...)
● insert(key, columns_dict)
● remove(key, ...)
Exercise
● python manage.py shell
● import cass
● help(cass.TWEET.remove)
● Delete the most recent tweet by user
foo
Exercise
● Open cass.py
● Finish save_retweet
Language support
● Python
● Scala
● Ruby
● Speed is a negative
● Java
PHP [thrift] tickets
● https://issues.apache.org/jira/browse/THRIFT-347
● https://issues.apache.org/jira/browse/THRIFT-638
● https://issues.apache.org/jira/browse/THRIFT-780
● https://issues.apache.org/jira/browse/THRIFT-788
Done yet?
● Still doing 1+N queries per page
SuperColumns

SuperColumns
Applying SuperColumns to
Twissandra
ColumnParent
Supercolumns: limitations
UUIDs
● Column names should be uuids, not
longs, to avoid collisions
● Version 1 UUIDs can be sorted by time
(“TimeUUID”)
● Any UUID can be sorted by its raw bytes
(“LexicalUUID”)
● Usually Version 4
● Slightly less overhead
0.7: secondary indexes

Obviate need for Userline (but not


Timeline)
Lucandra
● What documents contain term X?
● … and term Y?
● … or start with Z?
Lucandra ColumnFamilies
<ColumnFamily Name="TermInfo"
CompareWith="BytesType"
ColumnType="Super"
CompareSubcolumnsWith="BytesType"
KeysCached="10%" />
<ColumnFamily Name="Documents"
CompareWith="BytesType"
KeysCached="10%" />
Lucandra data
Term Key col name value
"field/term" => { documentId , position vector }

Document Key
"documentId" => { fieldName , value }
Lucandra queries
● get_slice
● get_range_slices
● No silver bullet
FAQ: counting
● UUIDs + batch process
● Mutex (contrib/mutex or “cages”)
● Use redis or mysql or memcached
● 0.7: vector clocks
Tips
● Insert instead of check-then-insert
● Bulk delete with 'forged' timestamps
● In 0.7: use ttl instead
as notroot/notroot:
git clone http://github.com/ericflo/twissandra.git

as root/riptano:
apt-get update
apt-get install python-setuptools
apt-get install python-django
easy_install -U thrift
rm -r /var/lib/cassandra/*
cp twissandra/storage-conf.xml /etc/cassandra
edit /etc/cassandra/log4j.properties to DEBUG
/etc/init.d/cassandra start
tail -f /var/log/cassandra/system.log
as notroot:
find templates |xargs grep empty
# r/m the {empty} blocks
python manage.py runserver

You might also like