You are on page 1of 85

Social platform

in Erlang
Lessons Learned

Alexey Kachayev, 2013


About me
• CTO at Attendify.com
• Erlang, Go, Clojure, Scala
• СPython & Twitter Storm contributor
• Author of fn.py library
• Hobbies: Haskell, Scheme, Racket,
CRDT, type systems, compilers
Contacts
•@kachayev
• github: kachayev
• kachayev@gmail.com
Will tell you
• project goals and challenges
• tech. stack that we use
• development
• testing, deployment, debugging
• problems & solutions
unstructured
content
too much information
at least 4 different talks
Will not tell
• why Erlang is cool
• how Erlang is cool
• why you should use Erlang
• 1 mln of concurrent users
• 1000+ nodes cluster
attendify.com
social platform
hundreds of mobile apps
“little facebook” inside each one
social platform
hundreds thousands
of mobile apps
social platform
“Yammer” for events
(at least technically)
tons of features
no, really!
tons of features
• profiles • replies
• social network • following
accounts

• chat • tweets tracking


• posts • activity streams
• photos • checkins
• photo albums • notes
• timeline • sharing
• notifications • search
• likes • RSS news tracking
• permissions • blocking & ban
Special thanks
•push notifications
•multi-device synchronization
•offline support for few features
Requirements
•high availability (HA)
•plug-in infrastructure
•RPC for thin clients
•stability & guarantees
•quick development
Requirements
• goals
high availability (HA)
•plug-in infrastructure
•RPC for thin clients
•stability & guarantees
•quick development
Technically
•10k++ mobile applications
•~ 2k profiles for each
•activity spikes (obviously)
•apps should work independently (*)
Prototype
“Delaware”
•prototype, not so many features
•~3 weeks of active development
•wrapper for CouchDB (in Erlang)
•biggest problem: push notifications
•serves ~75 mobile apps and still running
“Delaware”
implementation
check SNS
integration
admin panel switch
Current
“Gomer”
•4 months of active development
•2 engineers
•10 repos
•1,178 commits
“Gomer”
•~46k LOC
• 53 “xxx” comments (incl. 3 ”xxx!!!”)

•47 libraries (incl. 3 forks)


•117 public RPC methods
“Gomer”
•1,148 test cases
• make testall

•390 apps
•1,195 devices
•19,047 log messages
so?
•pretty big project
•very dynamic
•quickly growing in size & features
System design
Data
•graph-oriented (like Facebook)
•Riak for most data: nodes, links, streams
•etcd for consistent cases (Raft
consensus): settings, cluster structure
•in-memory ETS: cache, sync ordering
•pre-built data for reading
Graph
•nodes: id, rev, attrs, system flags
•links: from-id, to-id, type
•holds essential part of logic, i.e. session
is a link from profile to device etc
•Facebook TAO model: fetching nodes and
simplest links-walking
•implemented as independent library
K-ordering
•revision control for each entity
•to ensure all client calls are idempotent
•k-ordering for cursor-based sync (**)
•flake library (snowflake-like)
•one more, riak_id
K-ordering
** client tells server max revision
ever seen (a.k.a. cursor)
server send changed data only
(current rev > client max rev)
K-ordering
•github.com/twitter/snowflake (Scala)
•github.com/boundary/flake (Erlang) *
•github.com/seancribbs/riak_id (Erlang)
Streams
•activitystrea.ms
•Actor, Action, Object, Target
•cases: timelines, activity streams,
chats, notification center
•linked lists
•cursor-based fetch
Streams
Offline support
•cases: follow-ups, notes, cleared
messages etc
•event-sourcing (both server & clients)
•LWW for conflicted rewrites
ETS
• use to avoid state copy in gen_server

• 2 approaches (use both):


• supervisor creates ETS and gives it to
child at start
• server creates ETS and fills it with data
on each gen_server:init
Lesson #1
graph oriented data is a good fit
(most) graph databases are
strange
Lesson #2
data modeling is hard
any kind of consistency is hard
Lesson #3
Erlang is good for async data
processing
Lesson #4
each mobile client is a part of
single distributed system
Processes

v.1
started from “process per device”
•easy to start, client is an Actor
•not really HA
•bad fit to few nodes cluster
•many problems with events routing
•reimplemented
Processes v.2
•riak_core vnodes ring
•riak_pipe vnodes ring
•supervisor for each app
•auth
•profiles ordering
•twitter reader
•rss reader
Lesson #5
obvious solution can be a bad fit
“fail fast” in your decisions
riak_core
•vnodes ring
•“service” and compatibility tracking
•consistent hashing for tasks routing
•handoffs
•join, leave, status, membership
•CLI admin interface
riak_core
• few problems
• great facilities with no docs
• ... but easy to read whole source code
• thanks to the guys from Basho for their
advice
• waiting for 2.0 version
riak_core
riak_core
Lesson #6
riak_core is a good enough
reason for using Erlang
Testing
•2-phase: “unit” and “functional”
•eunit (built-in testing framework)
•etest library for functional tests
•functional tests in separated
modules
•don’t track coverage
Testing
•a lot of high-level helpers
•assert functions over JSON structure
• ?wait_for macro to test async
operations
Mocks
•mocking: external HTTP endpoints, IP
detectors
•meck library: creating modules, history API
•good enough
•strange “random” problems after
recompilation
Lesson #7
•test coverage is a key factor for really
quick development
•concentrate on “negative” cases
•it’s easy to turn this process into fun
Lesson #8
•good types system matters
•too many tests to check input values
•too many tests to check formatting
•too many tests to check protocols
Cluster
•you need to prepare tests for multi-
node system
•(only) then start working on
distribution
•riak_test
•property testing: PropEr
•... both are great, but hard to adopt
Cluster
• make devrel to run 3+ nodes

•it’s fun too


Cluster testing
Lesson #9
•it’s hard to do everything right on the
first try
•it’s impossible to do it on the first try?
•it’s impossible to do it at all?
•more experiments!
Events
•a lot of async operations
•i.e. like → save in DB → update timeline
entry → publish activity stream entry →
add notification → send to device
•started with RabbitMQ and exchanges
for each event types (easy to start)
•reimplemented
Events 2
•2 types: bound & unbound
•bound: known number of subscribers
•i.e. “like”
•converting to “active coordinator”: FSM
under appropriate supervisor
•sourcing for fault-tolerance
Events 3
•unbound cases:
•ban profile → remove all content
•update timeline → send push to all
subscribed devices
•use riak_pipe
riak_pipe
•part of Riak internals
• map/reduce flavored with unix pipes
•declarative fittings
•custom routing
•back-pressure control
•logging and tracing
•handoffs
riak_pipe
riak_pipe
working on workshop
github.com/kachayev/riak-pipe-workshop
Lesson #10
there is no such thing as “exactly
once delivery”
back-pressure control is essential
Meta programming
• it matters!
• cases: RPC definitions, permissions etc
•-define(MACRO, ...)
•... great, but sometimes inconvenient
•parse_transform
•... great, but hard to develop & support
•Elixir? no, thanks
Documentation
•it matters!
•public API description,
at least
•our solution: parser for
test logs (in python)
Documentation
•... external tool not so easy to
support
•edoc ?
• parse_transform ? i.e. -doc()
Lesson #11
meta programming matters
documentation matters
Deployment
• don’t use hot swapping for releases
• reltool to prepare package(s)

• run_erl to run VM as a daemon

• shell script for common operations:


start, stop, restart, attach
• shell script for cluster operations
(wrapper for node calls): join, leave,
status (ring & members)
Deployment
• rebar generate to /opt/
gomer/<version>/*

•shared directory for compiled deps:


much faster get-deps & compile

•zip and store on S3


•download from S3, unzip, relink
• fabric (Python) for automation
Lesson #12
still don’t know what the best
way to deploy application among
the cluster is
Lesson #13
Another to_erl process
already attached to
pipe
Lesson #14
there is a big difference between
^C (stop VM) and ^D (quit)
Debugging
•a lot of log messages
•papertailapp.com for all concerned
•dbg on live server
•few own helpers for most common cases
•“trace_off” on timeout
Debugging
Debugging
•erlang.org/doc/man/dbg.html
•github.com/ferd/recon
•erlang.org/doc/man/os_mon_app.html
Lesson #15
there are few features in Erlang
that you really-really miss when
using other technologies
The team
•2 engineers
•“2 weeks” to start writing production
code
•ha. first feature - on the second day *
•* first day - stumbled by Mac OS
Lesson #16
Erlang is a good technology to
hire good engineers
Thanks to

•guys from Wooga


•guys from Yammer
•guys from Basho
Libraries
• github.com/eproxus/meck

• github.com/uwiger/gproc

• github.com/wooga/etest

• github.com/wooga/etest_http

• github.com/bash/riak_kv

• github.com/basho/riak_core

• github.com/basho/riak_pipe

• github.com/basho/lager

• github.com/marccampbell/flake

• github.com/gleber/erlcloud
Questions #1
Performance
• ~20-25ms for most responses
• 100+ connections without any impact
• faster then Python & Ruby
• not as fast as Scala, Clojure and Go
• ... but do you really care?
Questions #2
Candidates
• Erlang (our choice)
• Scala (jvm)
• Clojure (jvm)
• Python (bad fit)
• Go (too large project)
• Haskell (bad fit)
• Java (oh, common..)
Questions #3
IDE?

•Emacs
•VIM
Notes #1
•we use Go and Clojure for other systems
•do you want to ask “Why”?
•we are still on early production stage
•wait for new lessons coming soon
Ideas?
Questions?

Alexey Kachayev, 2013