You are on page 1of 101

12-Step Program for Scaling Web

Applications on PostgreSQL
Konstantin Gredeskoul
CTO, Wanelo.com




@kig
@kigster
Proprietary and
Condential

What does it mean, to scale on top of


PostgreSQL?

Proprietary and

And why should you care?

Proprietary and

Scaling means supporting more


work load concurrently, where work
is often interchanged with users
But why on PostgreSQL?
Because NoNoSQL is hawt! (again)

Proprietary and

Relational databases are great at


supporting constant change in
software
They are not as great in auto
scaling, like RIAK or Cassandra
So the choice critically depends on
what you are trying to build
Proprietary and

Huge majority of applications are


represented well by the relational
model
So if I need to build a new product
or a service, my default choice
would be PostgreSQL for critical
data, + whatever else as needed
Proprietary and

This presentation is a walk-through


lled with practical solutions
Its based on a story of scaling
wanelo.com to sustain 10s of thousand
concurrent users, 3k req/sec
But lets explore the application
to learn a bit about wanelo
for our scalability journey
Proprietary and

Founded in 2010, Wanelo (wah-nee-loh, from


Want, Need, Love) is a community and a social
network for all of the world's shopping.
Wanelo is a home to 12M products, millions of
users, 200K+ stores, and products on Wanelo
have been saved into collections
over 2B times
Proprietary and

Early on we wanted to:

move fast with product development


scale as needed, stay ahead of the curve
keep overall costs low
but spend where it matters
automate everything
avoid reinventing the wheel
learn as we go
remain in control of our infrastructure

Heroku or Not?
Assuming we want full control of our
application layer, places like Heroku arent a
great t
But Heroku can be a great place to start. It
all depends on the size and complexity of the
app we are building.


Ours would have been cost prohibitive.


Proprietary and

Foundations of web apps

programming language + framework (RoR)


app server (we use unicorn)
scalable web server in front (we use nginx)
database (we use postgresql)
hosting environment (we use Joyent Cloud)
deployment tools (capistrano)
server conguration tools (we use chef)
many others, such as monitoring, alerting
Proprietary and

Lets review Basic Web App


incoming
http

nginx

N x Unicorns
Unicorn
/ Passenger
Ruby
Ruby VM
VM

/home/user/app/current/public

PostgreSQL
Server

/var/pgsql/data

no redundancy, no caching (yet)

can only process N concurrent requests

nginx will serve static assets, deal with slow clients

web sessions probably in the DB or cookie


Proprietary and

First optimizations:
cheap early on, well worth it

Personalization via AJAX, so controller actions


can be cached entirely using caches_action

Page returned unpersonalized, additional AJAX


request loads personalization

Proprietary and

A few more basic performance


tweaks that go a long way

Install 2+ memcached servers for caching and


use Dalli gem to connect to it for redundancy

Switch to memcached-based web sessions. Use


sessions sparingly, assume transient nature

Redis is also an option, but I prefer memcached


for redundancy

Setup CDN for asset_host and any user


generated content. We use fastly.com

Proprietary and

Caching goes a long way


nginx

browser

CDN
cache images, JS

N x Unicorns
Unicorn
/ Passenger
Ruby
Ruby VM
VM

PostgreSQL
Server

memcached

/home/user/app/current/public

geo distribute and cache your UGC and CSS/JS assets

cache html and serialize objects in memcached

can increase TTL to alleviate load, if trac spikes


Proprietary and

Adding basic redundancy

Multiple load balancers require DNS


round robin and short TTL (dyn.com)

Multiple app servers require haproxy


between nginx and unicorn

Multiple long-running tasks (such as


posting to Facebook or Twitter) require
background job processing framework
Proprietary and

incoming http
DNS round robin
or failover / HA solution

Load Balancers

nginx

haproxy

App Servers

Unicorn / Passenger
Ruby VM (times N)

Data stores
Transient to
Permanent
memcached

Background Workers

Sidekiq / Resque

CDN
cache images, JS

redis

Object Store
User Generated
Content

single DB

PostgreSQL

this architecture can horizontally scale up as


far the database at its center
every other component can be scaled by
adding more of it, to handle more trac
Proprietary and

As long as we can scale the data


store on the backend, we can scale
the app!
Mostly :)


At some point we may hit a limit on TCP/IP


network throughput, # of connections, but
this is at a whole another scale level

Proprietary and

The trac keeps climbing

Performance limits are near


First signs of performance problems start creeping up
Symptoms of read scalability problems

Pages load slowly or timeout

Some pages load (cached?), some dont


Users are getting 503 Service Unavailable
Database is slammed (very high CPU or read IO)

Symptoms of write scalability problems

Database write IO is maxed out, CPU is not


Update operations are waiting on each other, piling up
Application locks up, timeouts
Replicas are not catching up
Proprietary and

Both situations may easily result in


downtime

Proprietary and

Even though we
achieved 99.99% uptime
in 2013, in 2014 we had
a couple short
downtimes caused by
overloaded replica that
lasted around 5 minutes.
But users quickly
notice

Proprietary and

Proprietary and

Perhaps not :)

Proprietary and

12-Step Program
for curing your dependency on slow application latency

Common patterns for scaling high trac web


applications, based on wanelo.com
Proprietary and

Whats a good latency?

If your app is high trac (100K+ RPM) I


recommend 80ms or lower

For small / fast HTTP services, 10-12ms or lower

Proprietary and

CPU burn vs Waiting on IO?

Web services + Solr (25ms), memcached (15ms),


database (6ms) are all waiting on IO

RubyVM (30ms) + Garbage collection (6ms) is CPU


burn, easy to scale by adding more app servers

Proprietary and

Step 1:
Add More Cache!

Proprietary and

Moar Cache!!!

Cache is cheap and fast (memcached)


Anything that can be cached, should be
Cache hit = many database hits avoided
Hit rate of 17% still saves DB hits
We can cache many types of things
Proprietary and

Cache many types of things

caches_action in controllers is very eective


fragment caches of reusable widgets
we use gem Compositor for JSON API. We cache
serialized object fragments, grab them from
memcached using multi_get and merge them

Shopify open sourced IdentityCache, which

caches AR models, so you can Product.fetch(id)

https://github.com/wanelo/compositor
https://github.com/Shopify/identity_cache
Proprietary and

But Caching has its issues

Expiring cache is not easy


CacheSweepers in Rails help
We found ourselves doing 4000 memcached
deletes in a single request!

Could defer expiring caches to background jobs,


or use TTL if possible

But we can cache even outside of our app:

we cache JSON API responses using CDN (fastly.com)

Proprietary and

Step 2:
Optimize SQL

Proprietary and

SQL Optimization
Find slow SQL (>100ms) and either remove it, cache
the hell out of it, or x/rewrite the query

Enable slow query log in postgresql.conf:


  
 
 

  

pg_stat_statements is an invaluable contrib module:

Fixing Slow Query

Run explain plan to understand how DB runs the query

Are there adequate indexes for the query? Is the database using
appropriate index? Has the table been recently analyzed?

Can a complex join be simplied into a subselect?

Can this query use an index-only scan?

Can order by column be added to the index?

pg_stat_user_indexes and pg_stat_user_tables for seq scans,


unused indexes, cache info

Proprietary and

SQL Optimization, ctd


Instrumentation software such as NewRelic shows slow queries, with
explain plans, and time consuming transactions

Proprietary and

SQL Optimization: Example

Proprietary and

One day, I noticed lots of temp les


created in the postgres.log

Proprietary and

Lets run this query

This join takes a whole second to return :(

Proprietary and

Follows table
Proprietary and

Stories table
Proprietary and

So our index is partial, only on state = active


But there state isnt used in the query, a bug?
So this query is a full table scan

Lets add state = active


It was meant to be there anyway

Proprietary and

Proprietary and

Step 3:
Upgrade Hardware and RAM

Proprietary and

Hardware + RAM
Sounds obvious, but better or faster hardware is an
obvious choice when scaling out

Large RAM will be used as le system cache

On Joyents SmartOS ARC FS cache is very eective



 should be set to 25% of RAM or 12GB,
whichever is smaller

Using fast SSD disk array can make a huge dierence


Joyents native 16-disk RAID managed by ZFS instead
of controller provides excellent performance

Proprietary and

Hardware in the cloud

SSD oerings from Joyent and AWS

Joyents max SSD node $12.9/hr

AWS max SSD node $6.8/hr


Proprietary and

So whos better?
AWS

JOYENT

8 SSD drives

16 SSD drives: RAID10 + 2

SSD Make: ?

SSD Make: DCS3700

CPU: E5-2670
2.6Ghz

CPU: E5-2690
2.9GHz

Perhaps you get what you pay for after all.


Proprietary and

Step 4:
Scale Reads by Replication

Proprietary and

Scale Reads by Replication


postgresql.conf (both master & replica)

These settings have been tuned for SmartOS and our


application requirements (thanks PGExperts!)

Proprietary and

How to distribute reads?


Some people have success using this setup for reads:
app

haproxy

pgBouncer
pgBouncer

replica
replica

Id like to try this method eventually, but we choose to

deal with distributing read trac at the application level

We tried many ruby-based solutions that claimed to do


this well, but many werent production ready

Proprietary and

Makara is a ruby gem from

TaskRabbit that we ported


from MySQL to PostgreSQL
for sending reads to replicas

automatically retries if replica


goes down
load balances with weights
Was the simplest library to
understand, and port to PG
Was running in production
Worked in the multi-threaded
environment of Sidekiq
Background Workers
Proprietary and

Special considerations
Application must be tuned to support eventual
consistency. Data may not yet be on replica!

Must explicitly force fetch from the master DB when


its critical (i.e. after a user accounts creation)

We often use below pattern of rst trying the fetch, if


nothing found retry on master db

Proprietary and
P

Replicas can specialize


Background Workers can use dedicated replica not
shared with the app servers, to optimize hit rate for
le system cache (ARC) on both replicas
Background Workers

App Servers

Unicorn / Passenger
Ruby VM (times N)

PostgreSQL
Master
Sidekiq / Resque

PostgreSQL
Replica 1

ARC cache warm with


background job queries
PostgreSQL
Replica 2

ARC cache warm with


queries from web traffic

PostgreSQL
Replica 3
Proprietary and

Big heavy reads go there


Long heavy queries should run by the background jobs
against a dedicated replica, to isolate their eect on
web trac

Each type of load will produce a unique set of data


cached by the le system

Background Workers
PostgreSQL
Master
Sidekiq / Resque

PostgreSQL
Replica 1

Proprietary and

Step 5:
Use more appropriate tools

Proprietary and

Leveraging other tools


Not every type of data is well suited for storing in a relational
DB, even though initially it may be convenient

Solr is great for full text search, and deep paginated


sorted lists, such as trending, or related products

Redis is a great data store for transient or semipersistent data with list, hash or set semantics

We use it for ActivityFeed by precomputing each feed at write


time. But we can regenerate it if the data is lost from Redis
We use twemproxy in front of Redis which provides automatic
horizontal sharding and connection pooling.
We run clusters of 256 redis shards across many virtual zones;
sharded redis instances use many cores, instead of one
Proprietary and

Back to PostgreSQL
But we still have single master DB taking
all the writes
True story: applying WAL logs on
replicas creates signicant disk write load

Replicas are unable to both serve live trac and


catch up on replication. They fall behind.
Proprietary and

When replicas fall behind, application generates


errors, unable to nd data it expects

Proprietary and

Step 6:
Move write-heavy tables out:
Replace with non-DB solutions

Proprietary and

Move event log out


We discovered from pg_stat_user_tables top table by
write volume was user_events

We were appending all user events into this table


We were generating millions of rows per day!
We solved it by replacing user event recording system to
use rsyslog, appending to ASCII les

Its cheap, reliable and scalable


We now use Joyents Manta to analyze this data in
parallel. Manta is an object store + native compute on
Proprietary and

For more information about how we migrated


user events to a le-based append-only log, and
analyze it with Manta, please read

http://wanelo.ly/event-collection

Proprietary and

Step 7:
Tune PostgreSQL and your
Filesystem

Proprietary and

Tuning ZFS

Problem: zones (virtual hosts) with write

problems appeared to be writing 16 times


more data to disk, compared to what virtual le
system reports

vfsstat says 8Mb/sec write volume


iostat says 128Mb/sec is actually written to disk
So whats going on?
Proprietary and

Tuning Filesystem

Turns out default ZFS block size is 128Kb,


and PostgreSQL page size is 8Kb.

Every small write that touched a page, had to


write 128Kb of a ZFS block to the disk

This may be good for huge sequential writes,


but not for random access, lots of tiny writes

Proprietary and

Tuning ZFS & PgSQL


Solution: Joyent changed ZFS block size for our zone,
iostat write volume dropped to 8Mb/sec

We also added commit_delay

Proprietary and

Installing and Conguring PG


Many such settings are pre-dened in our open-source

Chef cookbook for installing PostgreSQL from sources

https://github.com/wanelo-chef/postgres
It installs PG in eg /opt/local/postgresql-9.3.2
It congures its data in /var/pgsql/data93
It allows seamless and safe upgrades of minor or major
versions of PostgreSQL, never overwriting binaries

Proprietary and

Additional resources online

Josh Berkuss 5 steps to PostgreSQL

Performance on SlideShare is fantastic

http://www.slideshare.net/PGExperts/ve-steps-perform2013

PostgreSQL wiki pages on performance tuning is


excellent

http://wiki.postgresql.org/wiki/Performance_Optimization
http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

Run pgBench to determine and compare


performance of systems

Proprietary and

Step 8:
Buer and serialize frequent
updates

Proprietary and

Counters, counters

Problem: products.saves_count is

incremented every time someone saves a


product (by 1)

At 200 inserts/sec, thats a lot of updates


Worse: 100s of concurrent requests trying to

obtain a row level lock on the same popular


product

How can we reduce number of writes and


lock contention?
Proprietary and

Buering and serializing

Sidekiq background job framework has two


inter-related features:

scheduling in the future (say 10 minutes ahead)


UniqueJob extension

We increment a counter in redis, and enqueue a job


that says update product in 10 minutes

Once every 10 minutes popular products are updated by

adding a value stored in Redis to the database value, and


resetting Redis value to 0
Proprietary and

Buering explained
Redis Cache

4. Read & Reset to 0

2. increment
counter

Save Product

1. enqueue update
request for product
with a delay

Save Product

Save Product

3. Process Job

5. Update Product

Update Request already


on the queue

PostgreSQL

Proprietary and

Buering conclusions

If we show objects from the database, they

might be sometimes behind on the counter. It


might be ok

If not, to achieve read consistency, we can

display the count as database value + redis


value at read time

Proprietary and

Step 9:
Optimize DB schema

Proprietary and

MVCC does copy on write


Problem: PostgreSQL rewrites the row for most updates (some

exceptions exist, ie non-indexed column, a counter, timestamp)

But we often index these so we can sort by them


So updates can become expensive on wide tables
Rails and Hibernates partial updates are not helping
Are we updating User on each request?

Proprietary and

Schema tricks

Solution: split wide tables into several 1-1


tables to reduce update impact

Much less vacuuming required when smaller


tables are frequently updated

Proprietary and

Users
id
email
encrypted_password
reset_password_token
reset_password_sent_at
remember_created_at
sign_in_count
current_sign_in_at
last_sign_in_at
current_sign_in_ip
last_sign_in_ip
confirmation_token
confirmed_at
confirmation_sent_at
unconfirmed_email
failed_attempts
unlock_token
locked_at
authentication_token
created_at
updated_at
username
avatar
state
followers_count
saves_count
collections_count
stores_count
following_count
stories_count

refactor

UserCounts
user_id
followers_count
saves_count
collections_count
stores_count
following_count
stories_count

Users
id
email
created_at
username
avatar
state

UserLogins
user_id
encrypted_password
reset_password_token
reset_password_sent_at
remember_created_at
sign_in_count
current_sign_in_at
last_sign_in_at
current_sign_in_ip
last_sign_in_ip
confirmation_token
confirmed_at
confirmation_sent_at
unconfirmed_email
failed_attempts
unlock_token
locked_at
authentication_token
updated_at

Dont update anything on each request :)


Proprietary and

Step 10:
Shard Busy Tables Vertically

Proprietary and

Vertical sharding

Heavy tables with too many writes, can be


moved into their own separate database

For us it was saves: now @ 2B+ rows

At hundreds of inserts per second, and 4 indexes,


we were feeling the pain

It turns out moving a single table (in Rails) out is


a not a huge eort: it took our team 3 days

Proprietary and

Vertical sharding - how to

Update code to point to the new database


Implement any dynamic Rails association
methods as real methods with 2 fetches

ie. save.products becomes a method on Save


model, lookup up Products by IDs

Update development and test setup with two


primary databases and x all the tests

Proprietary and

Here the application


connects to main master
DB + replicas, and a single
dedicated DB for the busy
table we moved
Web App

PostgreSQL
Master (Main Schema)

PostgreSQL
Replica (Main Schema)

Vertically Sharded Database

PostgreSQL
Master (Split Table)

Proprietary and

Vertical sharding, deploying

Drop in write IO on the main DB after splitting o


the high IO table into a dedicated compute node
Proprietary and

For a complete and more detailed account of


our vertical sharding eort, please read our
blog post:

http://wanelo.ly/vertical-sharding

Proprietary and

Step 11:
Wrap busy tables with services

Proprietary and

Splitting o services

Vertical Sharding is a great precursor to a


micro-services architecture

We already have Saves in another database,

lets migrate it to a light-weight HTTP service

New service: Sinatra, client and server libs,

updated tests & development, CI, deployment


without changing db schema

2-3 weeks a pair of engineers level of eort


Proprietary and

Adapter pattern to the rescue


Main App
Unicorn w/ Rails
HTTP
Client Adapter

Service App
Unicorn w/Sinatra

PostgreSQL

Native
Client Adaptor

We used Adapter pattern to write two client

adapters: native and HTTP, so we can use the lib,


but not yet switch to HTTP
Proprietary and

Services conclusions

Now we can independently scale service

backend, in particular reads by using replicas

This prepares us for the next inevitable step:


horizontal sharding

At a cost of added request latency, lots of extra


code, extra runtime infrastructure, and 2
weeks of work

Do this only if you absolutely have to


Proprietary and

Step 12:
Shard Services Backend
Horizontally

Proprietary and

Horizontal sharding in ruby

We wanted to stick with PostgreSQL for critical


data such as saves

Really liked Instagrams approach with schemas


Built our own schema-based sharding in ruby,
on top of Sequel gem, and open sourced it

It supports mapping of physical to logical shards,


and connection pooling

https://github.com/wanelo/sequel-schema-sharding
Proprietary and

Schema design for sharding


We needed two lookups, by user_id
and by product_id hence we needed
two tables, independently sharded

ProductSaves Sharded by product_id


product_id
user_id
updated_at
index__on_product_id_and_user_id
index__on_product_id_and_updated_at

Since saves is a join table between


user, product, collection, we did not
need unique ID generated

Composite base62 encoded ID:


fpua-1BrV-1kKEt

UserSaves Sharded by user_id


user_id
product_id
collection_id
created_at

index__on_user_id_and_collection_id

https://github.com/wanelo/sequel-schema-sharding
Proprietary and

Spreading your shards

We split saves into 8192 logical shards,

distributed across 8 PostgreSQL databases

Running on 8 virtual zones

spanning 2 physical SSD


servers, 4 per compute node

Each database has 1024

schemas (twice, because we


sharded saves into two tables)

2 x 32-core 256GB RAM


16-drive SSD RAID10+2
PostgreSQL 9.3

https://github.com/wanelo/sequel-schema-sharding
Proprietary and

Sample
conguration of
shard mapping to
physical nodes
with read
replicas,
supported by the
library

Proprietary and

How can we migrate the data from old nonsharded backend to the new sharded backend
without a long downtime?

Proprietary and

Old Non-Sharded Backend

Create Save

HTTP Service

Read/Write
Enqueue

Sidekiq Queue
New Sharded Backend

Background
Worker

New records go to both


Proprietary and

We migrated several times before we got this right


Old Non-Sharded Backend

Create Save

HTTP Service

Read/Write
Enqueue

Migration Script
Sidekiq Queue
New Sharded Backend

Background
Worker

Migrate old rows


Proprietary and

Old Non-Sharded Backend

Background
Worker

Sidekiq Queue

Enqueue
Create Save

HTTP Service

Read/Write

New Sharded Backend

Swap old and new backends


Proprietary and

Horizontal sharding conclusions

This is the nal destination of any scalable


architecture: just add more boxes

Pretty sure we can now scale to 1,000, or 10,000


inserts/second by scaling out

Took 2 months of 2 engineers, including migration,


but zero downtime. Its an advanced level eort and
our engineers really nailed this.

https://github.com/wanelo/sequel-schema-sharding
Proprietary and

Putting it all together

This infrastructure complexity is not free


It requires new automation, monitoring,

graphing, maintenance and upgrades, and brings


with it a new source of bugs

But the advantages are clear when scaling is


one of the requirements

In addition, micro-services can be owned


by small teams in the future, achieving
organizational autonomy

Proprietary and

MemCached Cluster
4-core 16GB zones
Cluster of MemCached Servers
is accessed via Dali fault tolerant library
one or more can go down

iPhone, Android, Desktop clients

Load Balancers

incoming http
requests

8-core 8GB zones

App Servers + Admin Servers


32-core 32GB high-CPU instances

Makara distributes DB
load across 3 replicas
and 1 master

pgbouncer

nginx

haproxy

Unicorn
Main Web/API App,
Ruby 2.0

memcached

32-core 256GB 16-drive SSD RAID10+2


Supermicro "Richmond"
SSD Make: Intel DCS3700,
CPU: Intel E5-2690, 2.9GHz

Read Replica (SSD)

PostgreSQL
Async Replicas

PostgreSQL 9.2
Master

Unicorn
Saves Service

Primary Database Schema


haproxy

User and Product Saves, Horizontally Sharded, Replicated


32-core 256GB RAM
16-drive SSD RAID10+2
PostgreSQL 9.3

Fastly CDN
cache images, JS

Redis Sidekiq
Jobs Queue / Bus

redis

Read Replicas (non SSD)

Redis Clusters for various custom


user feeds, such as product feed

Background Worker Nodes


Redis Proxy Cluster
32-core 32GB high-CPU instances

16GB high-mem 4-core zones


32 redis instances per server

1-core 1GB zones

Amazon S3
Product Images
User Profile Pictures

haproxy
redis-001

twemproxy
Sidekiq Background
Worker

Unicorn
Saves Service

to DBs
redis-256

pgbouncer

Solr Reads
8GB High CPU zones

8GB High CPU zone

Systems Diagram

Solr Updates

Solr Master

Apache Solr Clusters

Solr Replica

Proprietary
y and
d

Systems Status: Dashboard

Monitoring & Graphing with Circonus, NewRelic, statsd, nagios

Backend Stack & Key Vendors

MRI Ruby, jRuby, Sinatra, Ruby on Rails

Joyent Cloud, SmartOS, Manta Object Store

PostgreSQL, Solr, redis, twemproxy


memcached, nginx, haproxy, pgbouncer
ZFS, ARC Cache, superb IO, SMF, Zones, dTrace, humans

DynDNS, SendGrid, Chef, SiftScience


LeanPlum, MixPanel, Graphite analytics, A/B Testing
AWS S3 + Fastly CDN for user / product images
Circonus, NewRelic, statsd, Boundary,
PagerDuty, nagios: trending / monitoring / alerting
Proprietary and

We are hiring!
DevOps, FullStack, Scaling Experts, iOS & Android


Talk to me after the presentation if you are interested in working


on real scalability problems, and on a product used and loved by millions :)


http://wanelo.com/about/play


Or email play@wanelo.com

Proprietary and

Thanks!
github.com/wanelo
github.com/wanelo-chef

@kig
ki

wanelo technical blog (srsly awsm)

@kig

building.wanelo.com

@kigster

Proprietary and

You might also like