You are on page 1of 73

SCALING LINKEDIN

A BRIEF HISTORY

Josh Clemm
www.linkedin.com/in/joshclemm

Scaling = replacing all the components


of a car while driving it at 100mph

Via Mike Krieger, Scaling Instagram

LinkedIn started back in 2003 to


connect to your network for better job
opportunities.
It had 2700 members in first week.

First week growth guesses from founding team

400M

400M

350M
300M
250M
200M

Fast forward to today...

150M
100M
50M

32M

0M
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

LINKEDIN SCALE TODAY


LinkedIn is a global site with over 400 million
members
Web pages and mobile traffic are served at
tens of thousands of queries per second
Backend systems serve millions of queries per
second

How did we get there?

Lets start from

the beginning

LEO

DB

LINKEDINS ORIGINAL ARCHITECTURE

LEO
LEO

Huge monolithic app


called Leo
Java, JSP, Servlets,
JDBC

DB

Served every page,


same SQL database
Circa 2003

So far so good, but two areas to improve:


1. The growing member to member
connection graph
2. The ability to search those members

MEMBER CONNECTION GRAPH


Needed to live in-memory for top
performance
Used graph traversal queries not suitable for
the shared SQL database.
Different usage profile than other parts of site

MEMBER CONNECTION GRAPH


Needed to live in-memory for top
performance
Used graph traversal queries not suitable for
the shared SQL database.
Different usage profile than other parts of site
So, a dedicated service was created.
LinkedIns first service.

MEMBER SEARCH
Social networks need powerful search
Lucene was used on top of our member graph

MEMBER SEARCH
Social networks need powerful search
Lucene was used on top of our member graph

LinkedIns second service.

LINKEDIN WITH CONNECTION GRAPH


AND SEARCH

LEO

RPC

Member
Graph
Lucene

DB

Connection / Profile Updates

Circa 2004

Getting better, but the single database was


under heavy load.
Vertically scaling helped, but we needed to
offload the read traffic...

REPLICA DBs
Master/slave concept
Read-only traffic from replica
Writes go to main DB
Early version of Databus kept DBs in sync

Main DB

Databus
relay

Replica
Replica
Replica DB

REPLICA DBs TAKEAWAYS


Good medium term solution
We could vertically scale servers for a while
Master DBs have finite scaling limits
These days, LinkedIn DBs use partitioning

Main DB

Databus
relay

Replica
Replica
Replica DB

LINKEDIN WITH REPLICA DBs

RPC

LEO

R/O

R/W

Main DB

Member
Graph

Connection
Updates

Databus relay

Search

Profile
Updates

Replica
Replica
Replica
DB

Circa 2006

As LinkedIn continued to grow, the


monolithic application Leo was becoming
problematic.
Leo was difficult to release, debug, and the
site kept going down...

IT WAS TIME TO...

Kill LEO

SERVICE ORIENTED ARCHITECTURE


Extracting services (Java Spring MVC) from
legacy Leo monolithic application
Recruiter Web
App

Public Profile
Web App

LEO
Profile Service

Yet another
Service
Circa 2008 on

SERVICE ORIENTED ARCHITECTURE

Profile Web
App

Profile
Service

Goal - create vertical stack of


stateless services
Frontend servers fetch data
from many domains, build
HTML or JSON response
Mid-tier services host APIs,
business logic

Profile DB

Data-tier or back-tier services


encapsulate data domains

EXAMPLE MULTI-TIER ARCHITECTURE AT LINKEDIN


Browser / App

Frontend
Web App

Profile

Mid-tier
Content
Service
Service

Connections
Mid-tier
Content
Service
Service

Groups

Mid-tier
Content
Service
Service

Edu
Data
Data
Service

Kafka

Service

DB

Voldemort

Hadoop

SERVICE ORIENTED ARCHITECTURE COMPARISON

PROS
Stateless services
easily scale

CONS
Ops overhead

Decoupled domains

Introduces backwards
compatibility issues

Build and deploy


independently

Leads to complex call


graphs and fanout

SERVICES AT LINKEDIN
In 2003, LinkedIn had one service (Leo)
By 2010, LinkedIn had over 150 services
Today in 2015, LinkedIn has over 750 services

bash$ eh -e %%prod | awk -F. '{ print $2 }' | sort | uniq | wc -l


756

Getting better, but LinkedIn was


experiencing hypergrowth...

CACHING
Frontend
Web App

Mid-tier
Service
Cache

Cache

DB

Simple way to reduce load on


servers and speed up responses
Mid-tier caches store derived
objects from different domains,
reduce fanout
Caches in the data layer
We use memcache, couchbase,
even Voldemort


There are only two hard problems in
Computer Science:
Cache invalidation, naming things, and
off-by-one errors.

Via Twitter by Kellan Elliott-McCrea


and later Jonathan Feinberg

CACHING TAKEAWAYS
Caches are easy to add in the beginning, but
complexity adds up over time.
Over time LinkedIn removed many mid-tier
caches because of the complexity around
invalidation
We kept caches closer to data layer

CACHING TAKEAWAYS (cont.)


Services must handle full load - caches
improve speed, not permanent load bearing
solutions
Well use a low latency solution like
Voldemort when appropriate and precompute
results

LinkedIns hypergrowth was extending to


the vast amounts of data it collected.
Individual pipelines to route that data
werent scaling. A better solution was
needed...

KAFKA MOTIVATIONS
LinkedIn generates a ton of data
Pageviews
Edits on profile, companies, schools
Logging, timing
Invites, messaging
Tracking
Billions of events everyday
Separate and independently created pipelines
routed this data

A WHOLE LOT OF CUSTOM PIPELINES...

A WHOLE LOT OF CUSTOM PIPELINES...

As LinkedIn needed to scale, each pipeline


needed to scale.

KAFKA
Distributed pub-sub messaging platform as LinkedIns
universal data pipeline
Frontend
service

Frontend
service

Backend
Service

Kafka

DWH

Oracle

Monitoring

Analytics

Hadoop

KAFKA AT LINKEDIN
BENEFITS
Enabled near realtime access to any data source
Empowered Hadoop jobs
Allowed LinkedIn to build realtime analytics
Vastly improved site monitoring capability
Enabled devs to visualize and track call graphs
Over 1 trillion messages published per day, 10 million
messages per second

IL
Y
A
D
ILY
A
D
D
E
H
D
E
IS
H
L
B
IS
U
L
P
B
U
N
P
IO
ILLLION
RIL
R 11 TTR
O
ER
VE
OV

Lets end with

the modern years

REST.LI
Services extracted from Leo or created new
were inconsistent and often tightly coupled
Rest.li was our move to a data model centric
architecture
It ensured a consistent stateless Restful API
model across the company.

REST.LI (cont.)
By using JSON over HTTP, our new APIs
supported non-Java-based clients.
By using Dynamic Discovery (D2), we got
load balancing, discovery, and scalability of
each service API.
Today, LinkedIn has 1130+ Rest.li resources
and over 100 billion Rest.li calls per day

REST.LI (cont.)

Rest.li Automatic API-documentation

REST.LI (cont.)

Rest.li R2/D2 tech stack

LinkedIns success with Data infrastructure


like Kafka and Databus led to the
development of more and more scalable
Data infrastructure solutions...

DATA INFRASTRUCTURE
It was clear LinkedIn could build data
infrastructure that enables long term growth
LinkedIn doubled down on infra solutions like:
Storage solutions
Espresso, Voldemort, Ambry (media)
Analytics solutions like Pinot
Streaming solutions
Kafka, Databus, and Samza
Cloud solutions like Helix and Nuage

DATABUS

LinkedIn is a global company and was


continuing to see large growth. How else
to scale?

MULTIPLE DATA CENTERS


Natural progression of horizontally scaling
Replicate data across many data centers using
storage technology like Espresso
Pin users to geographically close data center
Difficult but necessary

MULTIPLE DATA CENTERS


Multiple data centers are imperative to
maintain high availability.
You need to avoid any single point of failure
not just for each service, but the entire site.
LinkedIn runs out of three main data centers,
additional PoPs around the globe, and more
coming online every day...

MULTIPLE DATA CENTERS

LinkedIn's operational setup as of 2015


(circles represent data centers, diamonds represent PoPs)

Of course LinkedIns scaling story is never


this simple, so what else have we done?

WHAT ELSE HAVE WE DONE?


Each of LinkedIns critical systems have
undergone their own rich history of scale
(graph, search, analytics, profile backend,
comms, feed)
LinkedIn uses Hadoop / Voldemort for insights
like People You May Know, Similar profiles,
Notable Alumni, and profile browse maps.

WHAT ELSE HAVE WE DONE? (cont.)


Re-architected frontend approach using
Client templates
BigPipe
Play Framework
LinkedIn added multiple tiers of proxies using
Apache Traffic Server and HAProxy
We improved the performance of servers with
new hardware, advanced system tuning, and
newer Java runtimes.

Scaling sounds easy and quick to do, right?


Hofstadter's Law: It always takes longer
than you expect, even when you take
into account Hofstadter's Law.

Via Douglas Hofstadter,


Gdel, Escher, Bach: An Eternal Golden Braid

THANKS!
Josh Clemm
www.linkedin.com/in/joshclemm

LEARN MORE
Blog version of this slide deck
https://engineering.linkedin.com/architecture/brief-history-scaling-linkedin

Visual story of LinkedIns history


https://ourstory.linkedin.com/

LinkedIn Engineering blog


https://engineering.linkedin.com

LinkedIn Open-Source
https://engineering.linkedin.com/open-source

LinkedIns communication system slides which


include earliest LinkedIn architecture http://www.slideshare.
net/linkedin/linkedins-communication-architecture

Slides which include earliest LinkedIn data infra work


http://www.slideshare.net/r39132/linkedin-data-infrastructure-qcon-london-2012

LEARN MORE (cont.)


Project Inversion - internal project to enable developer
productivity (trunk based model), faster deploys, unified
services
http://www.bloomberg.com/bw/articles/2013-04-10/inside-operation-inversion-the-codefreeze-that-saved-linkedin

LinkedIns use of Apache Traffic server


http://www.slideshare.net/thenickberry/reflecting-a-year-after-migrating-to-apache-trafficserver

Multi Data Center - testing fail overs


https://www.linkedin.com/pulse/armen-hamstra-how-he-broke-linkedin-got-promotedangel-au-yeung

LEARN MORE - KAFKA


History and motivation around Kafka
http://www.confluent.io/blog/stream-data-platform-1/

Thinking about streaming solutions as a commit log


https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineershould-know-about-real-time-datas-unifying

Kafka enabling monitoring and alerting


http://engineering.linkedin.com/52/autometrics-self-service-metrics-collection

Kafka enabling real-time analytics (Pinot)


http://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinot

Kafkas current use and future at LinkedIn


http://engineering.linkedin.com/kafka/kafka-linkedin-current-and-future

Kafka processing 1 trillion events per day


https://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancingkafka-linkedin

LEARN MORE - DATA INFRASTRUCTURE


Open sourcing Databus
https://engineering.linkedin.com/data-replication/open-sourcing-databus-linkedins-lowlatency-change-data-capture-system

Samza streams to help LinkedIn view call graphs


https://engineering.linkedin.com/samza/real-time-insights-linkedins-performance-usingapache-samza

Real-time analytics (Pinot)


http://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinot

Introducing Espresso data store


http://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-newdistributed-document-store

LEARN MORE - FRONTEND TECH


LinkedIns use of client templates
Dust.js
http://www.slideshare.net/brikis98/dustjs

Profile
http://engineering.linkedin.com/profile/engineering-new-linkedin-profile

Big Pipe on LinkedIns homepage


http://engineering.linkedin.com/frontend/new-technologies-new-linkedin-home-page

Play Framework
Introduction at LinkedIn https://engineering.linkedin.
com/play/composable-and-streamable-play-apps

Switching to non-block asynchronous model


https://engineering.linkedin.com/play/play-framework-async-io-without-thread-pooland-callback-hell

LEARN MORE - REST.LI


Introduction to Rest.li and how it helps LinkedIn scale
http://engineering.linkedin.com/architecture/restli-restful-service-architecture-scale

How Rest.li expanded across the company


http://engineering.linkedin.com/restli/linkedins-restli-moment

LEARN MORE - SYSTEM TUNING


JVM memory tuning
http://engineering.linkedin.com/garbage-collection/garbage-collection-optimization-highthroughput-and-low-latency-java-applications

System tuning
http://engineering.linkedin.com/performance/optimizing-linux-memory-managementlow-latency-high-throughput-databases

Optimizing JVM tuning automatically


https://engineering.linkedin.com/java/optimizing-java-cms-garbage-collections-itsdifficulties-and-using-jtune-solution

WERE HIRING
LinkedIn continues to grow quickly and theres
still a ton of work we can do to improve.
Were working on problems that very few ever
get to solve - come join us!