12-Step Program For Scaling Web Application On PostgreSQL PDF

12-Step Program for Scaling Web
Applications on PostgreSQL
Konstantin Gredeskoul
CTO, Wanelo.com

@kig
@kigster
Proprietary and
Condential
What does it mean, to scale on top of

PostgreSQL?
Proprietary and
And why should you care?
Proprietary and
Scaling means supporting more

work load concurrently, where work
is often interchanged with users
But why on PostgreSQL?
Because NoNoSQL is hawt! (again)
Proprietary and
Relational databases are great at

supporting constant change in
software
They are not as great in auto
scaling, like RIAK or Cassandra
So the choice critically depends on
what you are trying to build
Proprietary and
Huge majority of applications are

represented well by the relational
model
So if I need to build a new product
or a service, my default choice
would be PostgreSQL for critical
data, + whatever else as needed
Proprietary and
This presentation is a walk-through

lled with practical solutions
Its based on a story of scaling
wanelo.com to sustain 10s of thousand
concurrent users, 3k req/sec
But lets explore the application
to learn a bit about wanelo
for our scalability journey
Proprietary and
Founded in 2010, Wanelo (wah-nee-loh, from

Want, Need, Love) is a community and a social
network for all of the world's shopping.
Wanelo is a home to 12M products, millions of
users, 200K+ stores, and products on Wanelo
have been saved into collections
over 2B times
Proprietary and
Early on we wanted to:
move fast with product development

scale as needed, stay ahead of the curve
keep overall costs low
but spend where it matters
automate everything
avoid reinventing the wheel
learn as we go
remain in control of our infrastructure
Heroku or Not?
Assuming we want full control of our
application layer, places like Heroku arent a
great t
But Heroku can be a great place to start. It
all depends on the size and complexity of the
app we are building.
Ours would have been cost prohibitive.

Proprietary and
Foundations of web apps
programming language + framework (RoR)

app server (we use unicorn)
scalable web server in front (we use nginx)
database (we use postgresql)
hosting environment (we use Joyent Cloud)
deployment tools (capistrano)
server conguration tools (we use chef)
many others, such as monitoring, alerting
Proprietary and
Lets review Basic Web App

incoming
http
nginx
N x Unicorns
Unicorn
/ Passenger
Ruby
Ruby VM
VM
/home/user/app/current/public
PostgreSQL
Server
/var/pgsql/data
no redundancy, no caching (yet)
can only process N concurrent requests
nginx will serve static assets, deal with slow clients
web sessions probably in the DB or cookie

Proprietary and
First optimizations:
cheap early on, well worth it
Personalization via AJAX, so controller actions

can be cached entirely using caches_action
Page returned unpersonalized, additional AJAX

request loads personalization
Proprietary and
A few more basic performance

tweaks that go a long way
Install 2+ memcached servers for caching and

use Dalli gem to connect to it for redundancy
Switch to memcached-based web sessions. Use

sessions sparingly, assume transient nature
Redis is also an option, but I prefer memcached

for redundancy
Setup CDN for asset_host and any user

generated content. We use fastly.com
Proprietary and
Caching goes a long way

nginx
browser
CDN
cache images, JS
N x Unicorns
Unicorn
/ Passenger
Ruby
Ruby VM
VM
PostgreSQL
Server
memcached
/home/user/app/current/public
geo distribute and cache your UGC and CSS/JS assets
cache html and serialize objects in memcached
can increase TTL to alleviate load, if trac spikes

Proprietary and
Adding basic redundancy
Multiple load balancers require DNS

round robin and short TTL (dyn.com)
Multiple app servers require haproxy

between nginx and unicorn
Multiple long-running tasks (such as

posting to Facebook or Twitter) require
background job processing framework
Proprietary and
incoming http
DNS round robin
or failover / HA solution
Load Balancers
nginx
haproxy
App Servers
Unicorn / Passenger
Ruby VM (times N)
Data stores
Transient to
Permanent
memcached
Background Workers
Sidekiq / Resque
CDN
cache images, JS
redis
Object Store
User Generated
Content
single DB
PostgreSQL
this architecture can horizontally scale up as

far the database at its center
every other component can be scaled by
adding more of it, to handle more trac
Proprietary and
As long as we can scale the data

store on the backend, we can scale
the app!
Mostly :)
At some point we may hit a limit on TCP/IP

network throughput, # of connections, but
this is at a whole another scale level
Proprietary and
The trac keeps climbing
Performance limits are near

First signs of performance problems start creeping up
Symptoms of read scalability problems
Pages load slowly or timeout
Some pages load (cached?), some dont

Users are getting 503 Service Unavailable
Database is slammed (very high CPU or read IO)
Symptoms of write scalability problems
Database write IO is maxed out, CPU is not

Update operations are waiting on each other, piling up
Application locks up, timeouts
Replicas are not catching up
Proprietary and
Both situations may easily result in

downtime
Proprietary and
Even though we
achieved 99.99% uptime
in 2013, in 2014 we had
a couple short
downtimes caused by
overloaded replica that
lasted around 5 minutes.
But users quickly
notice
Proprietary and
Proprietary and
Perhaps not :)
Proprietary and
12-Step Program
for curing your dependency on slow application latency
Common patterns for scaling high trac web

applications, based on wanelo.com
Proprietary and
Whats a good latency?
If your app is high trac (100K+ RPM) I

recommend 80ms or lower
For small / fast HTTP services, 10-12ms or lower
Proprietary and
CPU burn vs Waiting on IO?
Web services + Solr (25ms), memcached (15ms),

database (6ms) are all waiting on IO
RubyVM (30ms) + Garbage collection (6ms) is CPU

burn, easy to scale by adding more app servers
Proprietary and
Step 1:
Add More Cache!
Proprietary and
Moar Cache!!!
Cache is cheap and fast (memcached)

Anything that can be cached, should be
Cache hit = many database hits avoided
Hit rate of 17% still saves DB hits
We can cache many types of things
Proprietary and
Cache many types of things
caches_action in controllers is very eective

fragment caches of reusable widgets
we use gem Compositor for JSON API. We cache
serialized object fragments, grab them from
memcached using multi_get and merge them
Shopify open sourced IdentityCache, which
caches AR models, so you can Product.fetch(id)
https://github.com/wanelo/compositor
https://github.com/Shopify/identity_cache
Proprietary and
But Caching has its issues
Expiring cache is not easy

CacheSweepers in Rails help
We found ourselves doing 4000 memcached
deletes in a single request!
Could defer expiring caches to background jobs,

or use TTL if possible
But we can cache even outside of our app:
we cache JSON API responses using CDN (fastly.com)
Proprietary and
Step 2:
Optimize SQL
Proprietary and
SQL Optimization
Find slow SQL (>100ms) and either remove it, cache
the hell out of it, or x/rewrite the query
Enable slow query log in postgresql.conf:

pg_stat_statements is an invaluable contrib module:
Fixing Slow Query
Run explain plan to understand how DB runs the query
Are there adequate indexes for the query? Is the database using
appropriate index? Has the table been recently analyzed?
Can a complex join be simplied into a subselect?
Can this query use an index-only scan?
Can order by column be added to the index?
pg_stat_user_indexes and pg_stat_user_tables for seq scans,

unused indexes, cache info
Proprietary and
SQL Optimization, ctd

Instrumentation software such as NewRelic shows slow queries, with
explain plans, and time consuming transactions
Proprietary and
SQL Optimization: Example
Proprietary and
One day, I noticed lots of temp les

created in the postgres.log
Proprietary and
Lets run this query
This join takes a whole second to return :(
Proprietary and
Follows table
Proprietary and
Stories table
Proprietary and
So our index is partial, only on state = active

But there state isnt used in the query, a bug?
So this query is a full table scan
Lets add state = active

It was meant to be there anyway
Proprietary and
Proprietary and
Step 3:
Upgrade Hardware and RAM
Proprietary and
Hardware + RAM
Sounds obvious, but better or faster hardware is an
obvious choice when scaling out
Large RAM will be used as le system cache
On Joyents SmartOS ARC FS cache is very eective

should be set to 25% of RAM or 12GB,
whichever is smaller
Using fast SSD disk array can make a huge dierence

Joyents native 16-disk RAID managed by ZFS instead
of controller provides excellent performance
Proprietary and
Hardware in the cloud
SSD oerings from Joyent and AWS
Joyents max SSD node $12.9/hr
AWS max SSD node $6.8/hr

Proprietary and
So whos better?
AWS
JOYENT
8 SSD drives
16 SSD drives: RAID10 + 2
SSD Make: ?
SSD Make: DCS3700
CPU: E5-2670
2.6Ghz
CPU: E5-2690
2.9GHz
Perhaps you get what you pay for after all.

Proprietary and
Step 4:
Scale Reads by Replication
Proprietary and
Scale Reads by Replication

postgresql.conf (both master & replica)
These settings have been tuned for SmartOS and our

application requirements (thanks PGExperts!)
Proprietary and
How to distribute reads?

Some people have success using this setup for reads:
app
haproxy
pgBouncer
pgBouncer
replica
replica
Id like to try this method eventually, but we choose to
deal with distributing read trac at the application level
We tried many ruby-based solutions that claimed to do

this well, but many werent production ready
Proprietary and
Makara is a ruby gem from
TaskRabbit that we ported

from MySQL to PostgreSQL
for sending reads to replicas
automatically retries if replica

goes down
load balances with weights
Was the simplest library to
understand, and port to PG
Was running in production
Worked in the multi-threaded
environment of Sidekiq
Background Workers
Proprietary and
Special considerations
Application must be tuned to support eventual
consistency. Data may not yet be on replica!
Must explicitly force fetch from the master DB when

its critical (i.e. after a user accounts creation)
We often use below pattern of rst trying the fetch, if

nothing found retry on master db
Proprietary and
P
Replicas can specialize

Background Workers can use dedicated replica not
shared with the app servers, to optimize hit rate for
le system cache (ARC) on both replicas
Background Workers
App Servers
Unicorn / Passenger
Ruby VM (times N)
PostgreSQL
Master
Sidekiq / Resque
PostgreSQL
Replica 1
ARC cache warm with

background job queries
PostgreSQL
Replica 2
ARC cache warm with

queries from web traffic
PostgreSQL
Replica 3
Proprietary and
Big heavy reads go there

Long heavy queries should run by the background jobs
against a dedicated replica, to isolate their eect on
web trac
Each type of load will produce a unique set of data

cached by the le system
Background Workers
PostgreSQL
Master
Sidekiq / Resque
PostgreSQL
Replica 1
Proprietary and
Step 5:
Use more appropriate tools
Proprietary and
Leveraging other tools

Not every type of data is well suited for storing in a relational
DB, even though initially it may be convenient
Solr is great for full text search, and deep paginated

sorted lists, such as trending, or related products
Redis is a great data store for transient or semipersistent data with list, hash or set semantics
We use it for ActivityFeed by precomputing each feed at write

time. But we can regenerate it if the data is lost from Redis
We use twemproxy in front of Redis which provides automatic
horizontal sharding and connection pooling.
We run clusters of 256 redis shards across many virtual zones;
sharded redis instances use many cores, instead of one
Proprietary and
Back to PostgreSQL
But we still have single master DB taking
all the writes
True story: applying WAL logs on
replicas creates signicant disk write load
Replicas are unable to both serve live trac and

catch up on replication. They fall behind.
Proprietary and
When replicas fall behind, application generates

errors, unable to nd data it expects
Proprietary and
Step 6:
Move write-heavy tables out:
Replace with non-DB solutions
Proprietary and
Move event log out

We discovered from pg_stat_user_tables top table by
write volume was user_events
We were appending all user events into this table

We were generating millions of rows per day!
We solved it by replacing user event recording system to
use rsyslog, appending to ASCII les
Its cheap, reliable and scalable

We now use Joyents Manta to analyze this data in
parallel. Manta is an object store + native compute on
Proprietary and
For more information about how we migrated

user events to a le-based append-only log, and
analyze it with Manta, please read
http://wanelo.ly/event-collection
Proprietary and
Step 7:
Tune PostgreSQL and your
Filesystem
Proprietary and
Tuning ZFS
Problem: zones (virtual hosts) with write
problems appeared to be writing 16 times

more data to disk, compared to what virtual le
system reports
vfsstat says 8Mb/sec write volume

iostat says 128Mb/sec is actually written to disk
So whats going on?
Proprietary and
Tuning Filesystem
Turns out default ZFS block size is 128Kb,

and PostgreSQL page size is 8Kb.
Every small write that touched a page, had to

write 128Kb of a ZFS block to the disk
This may be good for huge sequential writes,

but not for random access, lots of tiny writes
Proprietary and
Tuning ZFS & PgSQL

Solution: Joyent changed ZFS block size for our zone,
iostat write volume dropped to 8Mb/sec
We also added commit_delay
Proprietary and
Installing and Conguring PG

Many such settings are pre-dened in our open-source
Chef cookbook for installing PostgreSQL from sources
https://github.com/wanelo-chef/postgres
It installs PG in eg /opt/local/postgresql-9.3.2
It congures its data in /var/pgsql/data93
It allows seamless and safe upgrades of minor or major
versions of PostgreSQL, never overwriting binaries
Proprietary and
Additional resources online
Josh Berkuss 5 steps to PostgreSQL
Performance on SlideShare is fantastic
http://www.slideshare.net/PGExperts/ve-steps-perform2013
PostgreSQL wiki pages on performance tuning is

excellent
http://wiki.postgresql.org/wiki/Performance_Optimization
http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
Run pgBench to determine and compare

performance of systems
Proprietary and
Step 8:
Buer and serialize frequent
updates
Proprietary and
Counters, counters
Problem: products.saves_count is
incremented every time someone saves a

product (by 1)
At 200 inserts/sec, thats a lot of updates

Worse: 100s of concurrent requests trying to
obtain a row level lock on the same popular

product
How can we reduce number of writes and

lock contention?
Proprietary and
Buering and serializing
Sidekiq background job framework has two

inter-related features:
scheduling in the future (say 10 minutes ahead)

UniqueJob extension
We increment a counter in redis, and enqueue a job

that says update product in 10 minutes
Once every 10 minutes popular products are updated by
adding a value stored in Redis to the database value, and

resetting Redis value to 0
Proprietary and
Buering explained
Redis Cache
4. Read & Reset to 0
2. increment
counter
Save Product
1. enqueue update
request for product
with a delay
Save Product
Save Product
3. Process Job
5. Update Product
Update Request already

on the queue
PostgreSQL
Proprietary and
Buering conclusions
If we show objects from the database, they
might be sometimes behind on the counter. It

might be ok
If not, to achieve read consistency, we can
display the count as database value + redis

value at read time
Proprietary and
Step 9:
Optimize DB schema
Proprietary and
MVCC does copy on write

Problem: PostgreSQL rewrites the row for most updates (some
exceptions exist, ie non-indexed column, a counter, timestamp)
But we often index these so we can sort by them

So updates can become expensive on wide tables
Rails and Hibernates partial updates are not helping
Are we updating User on each request?
Proprietary and
Schema tricks
Solution: split wide tables into several 1-1

tables to reduce update impact
Much less vacuuming required when smaller

tables are frequently updated
Proprietary and
Users
id
email
encrypted_password
reset_password_token
reset_password_sent_at
remember_created_at
sign_in_count
current_sign_in_at
last_sign_in_at
current_sign_in_ip
last_sign_in_ip
confirmation_token
confirmed_at
confirmation_sent_at
unconfirmed_email
failed_attempts
unlock_token
locked_at
authentication_token
created_at
updated_at
username
avatar
state
followers_count
saves_count
collections_count
stores_count
following_count
stories_count
refactor
UserCounts
user_id
followers_count
saves_count
collections_count
stores_count
following_count
stories_count
Users
id
email
created_at
username
avatar
state
UserLogins
user_id
encrypted_password
reset_password_token
reset_password_sent_at
remember_created_at
sign_in_count
current_sign_in_at
last_sign_in_at
current_sign_in_ip
last_sign_in_ip
confirmation_token
confirmed_at
confirmation_sent_at
unconfirmed_email
failed_attempts
unlock_token
locked_at
authentication_token
updated_at
Dont update anything on each request :)

Proprietary and
Step 10:
Shard Busy Tables Vertically
Proprietary and
Vertical sharding
Heavy tables with too many writes, can be

moved into their own separate database
For us it was saves: now @ 2B+ rows
At hundreds of inserts per second, and 4 indexes,

we were feeling the pain
It turns out moving a single table (in Rails) out is

a not a huge eort: it took our team 3 days
Proprietary and
Vertical sharding - how to
Update code to point to the new database

Implement any dynamic Rails association
methods as real methods with 2 fetches
ie. save.products becomes a method on Save

model, lookup up Products by IDs
Update development and test setup with two

primary databases and x all the tests
Proprietary and
Here the application

connects to main master
DB + replicas, and a single
dedicated DB for the busy
table we moved
Web App
PostgreSQL
Master (Main Schema)
PostgreSQL
Replica (Main Schema)
Vertically Sharded Database
PostgreSQL
Master (Split Table)
Proprietary and
Vertical sharding, deploying
Drop in write IO on the main DB after splitting o

the high IO table into a dedicated compute node
Proprietary and
For a complete and more detailed account of

our vertical sharding eort, please read our
blog post:
http://wanelo.ly/vertical-sharding
Proprietary and
Step 11:
Wrap busy tables with services
Proprietary and
Splitting o services
Vertical Sharding is a great precursor to a

micro-services architecture
We already have Saves in another database,
lets migrate it to a light-weight HTTP service
New service: Sinatra, client and server libs,
updated tests & development, CI, deployment

without changing db schema
2-3 weeks a pair of engineers level of eort

Proprietary and
Adapter pattern to the rescue

Main App
Unicorn w/ Rails
HTTP
Client Adapter
Service App
Unicorn w/Sinatra
PostgreSQL
Native
Client Adaptor
We used Adapter pattern to write two client
adapters: native and HTTP, so we can use the lib,

but not yet switch to HTTP
Proprietary and
Services conclusions
Now we can independently scale service
backend, in particular reads by using replicas
This prepares us for the next inevitable step:

horizontal sharding
At a cost of added request latency, lots of extra

code, extra runtime infrastructure, and 2
weeks of work
Do this only if you absolutely have to

Proprietary and
Step 12:
Shard Services Backend
Horizontally
Proprietary and
Horizontal sharding in ruby
We wanted to stick with PostgreSQL for critical

data such as saves
Really liked Instagrams approach with schemas

Built our own schema-based sharding in ruby,
on top of Sequel gem, and open sourced it
It supports mapping of physical to logical shards,

and connection pooling
https://github.com/wanelo/sequel-schema-sharding
Proprietary and
Schema design for sharding

We needed two lookups, by user_id
and by product_id hence we needed
two tables, independently sharded
ProductSaves Sharded by product_id

product_id
user_id
updated_at
index__on_product_id_and_user_id
index__on_product_id_and_updated_at
Since saves is a join table between

user, product, collection, we did not
need unique ID generated
Composite base62 encoded ID:

fpua-1BrV-1kKEt
UserSaves Sharded by user_id

user_id
product_id
collection_id
created_at
index__on_user_id_and_collection_id
Proprietary and
Spreading your shards
We split saves into 8192 logical shards,
distributed across 8 PostgreSQL databases
Running on 8 virtual zones
spanning 2 physical SSD

servers, 4 per compute node
Each database has 1024
schemas (twice, because we

sharded saves into two tables)
2 x 32-core 256GB RAM

16-drive SSD RAID10+2
PostgreSQL 9.3
Proprietary and
Sample
conguration of
shard mapping to
physical nodes
with read
replicas,
supported by the
library
Proprietary and
How can we migrate the data from old nonsharded backend to the new sharded backend
without a long downtime?
Proprietary and
Old Non-Sharded Backend
Create Save
HTTP Service
Read/Write
Enqueue
Sidekiq Queue
New Sharded Backend
Background
Worker
New records go to both

Proprietary and
We migrated several times before we got this right

Create Save
HTTP Service
Read/Write
Enqueue
Migration Script
Sidekiq Queue
New Sharded Backend
Background
Worker
Migrate old rows

Proprietary and
Background
Worker
Sidekiq Queue
Enqueue
Create Save
HTTP Service
Read/Write
New Sharded Backend
Swap old and new backends

Proprietary and
Horizontal sharding conclusions
This is the nal destination of any scalable

architecture: just add more boxes
Pretty sure we can now scale to 1,000, or 10,000

inserts/second by scaling out
Took 2 months of 2 engineers, including migration,

but zero downtime. Its an advanced level eort and
our engineers really nailed this.
Proprietary and
Putting it all together
This infrastructure complexity is not free

It requires new automation, monitoring,
graphing, maintenance and upgrades, and brings

with it a new source of bugs
But the advantages are clear when scaling is

one of the requirements
In addition, micro-services can be owned

by small teams in the future, achieving
organizational autonomy
Proprietary and
MemCached Cluster
4-core 16GB zones
Cluster of MemCached Servers
is accessed via Dali fault tolerant library
one or more can go down
iPhone, Android, Desktop clients
Load Balancers
incoming http
requests
8-core 8GB zones
App Servers + Admin Servers

32-core 32GB high-CPU instances
Makara distributes DB
load across 3 replicas
and 1 master
pgbouncer
nginx
haproxy
Unicorn
Main Web/API App,
Ruby 2.0
memcached
32-core 256GB 16-drive SSD RAID10+2

Supermicro "Richmond"
SSD Make: Intel DCS3700,
CPU: Intel E5-2690, 2.9GHz
Read Replica (SSD)
PostgreSQL
Async Replicas
PostgreSQL 9.2
Master
Unicorn
Saves Service
Primary Database Schema

haproxy
User and Product Saves, Horizontally Sharded, Replicated

32-core 256GB RAM
16-drive SSD RAID10+2
PostgreSQL 9.3
Fastly CDN
cache images, JS
Redis Sidekiq
Jobs Queue / Bus
redis
Read Replicas (non SSD)
Redis Clusters for various custom

user feeds, such as product feed
Background Worker Nodes

Redis Proxy Cluster
32-core 32GB high-CPU instances
16GB high-mem 4-core zones

32 redis instances per server
1-core 1GB zones
Amazon S3
Product Images
User Profile Pictures
haproxy
redis-001
twemproxy
Sidekiq Background
Worker
Unicorn
Saves Service
to DBs
redis-256
pgbouncer
Solr Reads
8GB High CPU zones
8GB High CPU zone
Systems Diagram
Solr Updates
Solr Master
Apache Solr Clusters
Solr Replica
Proprietary
y and
d
Systems Status: Dashboard
Monitoring & Graphing with Circonus, NewRelic, statsd, nagios
Backend Stack & Key Vendors
MRI Ruby, jRuby, Sinatra, Ruby on Rails
Joyent Cloud, SmartOS, Manta Object Store
PostgreSQL, Solr, redis, twemproxy

memcached, nginx, haproxy, pgbouncer
ZFS, ARC Cache, superb IO, SMF, Zones, dTrace, humans
DynDNS, SendGrid, Chef, SiftScience

LeanPlum, MixPanel, Graphite analytics, A/B Testing
AWS S3 + Fastly CDN for user / product images
Circonus, NewRelic, statsd, Boundary,
PagerDuty, nagios: trending / monitoring / alerting
Proprietary and
We are hiring!
DevOps, FullStack, Scaling Experts, iOS & Android
Talk to me after the presentation if you are interested in working

on real scalability problems, and on a product used and loved by millions :)
http://wanelo.com/about/play
Or email play@wanelo.com
Proprietary and
Thanks!
github.com/wanelo
github.com/wanelo-chef
@kig
ki
wanelo technical blog (srsly awsm)
@kig
building.wanelo.com
@kigster
Proprietary and

12-Step Program For Scaling Web Application On PostgreSQL PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

12-Step Program For Scaling Web Application On PostgreSQL PDF

Uploaded by

Copyright:

Available Formats

12-Step Program for Scaling Web

What does it mean, to scale on top of

And why should you care?

Scaling means supporting more

Relational databases are great at

Huge majority of applications are

This presentation is a walk-through

Founded in 2010, Wanelo (wah-nee-loh, from

Early on we wanted to:

move fast with product development

Ours would have been cost prohibitive.

Foundations of web apps

programming language + framework (RoR)

Lets review Basic Web App

no redundancy, no caching (yet)

can only process N concurrent requests

nginx will serve static assets, deal with slow clients

web sessions probably in the DB or cookie

Personalization via AJAX, so controller actions

Page returned unpersonalized, additional AJAX

A few more basic performance

Install 2+ memcached servers for caching and

Switch to memcached-based web sessions. Use

Redis is also an option, but I prefer memcached

Setup CDN for asset_host and any user

Caching goes a long way

geo distribute and cache your UGC and CSS/JS assets

cache html and serialize objects in memcached

can increase TTL to alleviate load, if trac spikes

Adding basic redundancy

Multiple load balancers require DNS

Multiple app servers require haproxy

Multiple long-running tasks (such as

this architecture can horizontally scale up as

As long as we can scale the data

At some point we may hit a limit on TCP/IP

The trac keeps climbing

Performance limits are near

Pages load slowly or timeout

Some pages load (cached?), some dont

Symptoms of write scalability problems

Database write IO is maxed out, CPU is not

Both situations may easily result in

Common patterns for scaling high trac web

Whats a good latency?

If your app is high trac (100K+ RPM) I

For small / fast HTTP services, 10-12ms or lower

CPU burn vs Waiting on IO?

Web services + Solr (25ms), memcached (15ms),

RubyVM (30ms) + Garbage collection (6ms) is CPU

Cache is cheap and fast (memcached)

Cache many types of things

caches_action in controllers is very eective

Shopify open sourced IdentityCache, which

caches AR models, so you can Product.fetch(id)

But Caching has its issues

Expiring cache is not easy

Could defer expiring caches to background jobs,

But we can cache even outside of our app:

we cache JSON API responses using CDN (fastly.com)

Enable slow query log in postgresql.conf:

pg_stat_statements is an invaluable contrib module: