You are on page 1of 110

How Netix Delivers

Software

July 8
th
, 2014
Email: jedberg@{gmail,netix}.com
Twitter: @jedberg
Web: www.jedberg.net
Facebook: facebook.com/jedberg
Linkedin: www.linkedin.com/in/jedberg
When your software fails...
will your system survive?
The Netix way

Fully automated build tools


to test and make packages

Fully automated machine


image bakery

Fully automated image


deployment

Everything is built for three

Independent teams responsible


for both Dev and Ops

Redundancy through multi-


region deployment
The Netix way
Philosophy

We hire responsible adults


and keep rules and policies
to a minimum

Developers can change any


code in production at any
time

And things dont break


(usually)
Freedom and
Responsibility
Automate all the things!
http://hyperboleandahalf.blogspot.com/2010/06/this-is-why-ill-never-be-adult.html

Application startup

Conguration

Code deployment

System
deployment
Automate all the things!

Standard base image

Tools to manage all


the systems

Reduce errors
through
reproducibility
Automation
Shared state should
be stored in a
shared service

Data on an instance
should be replicated
to other instances
Build for three
We hold a boot camp for new
engineers to teach them how to
build for a highly distributed
environment.
Build for three
We hold a boot camp for new
engineers to teach them how to
build for a highly distributed
environment.
>?< "@5A"@*6
.%B@%,5, C%. 6(D
5" :-3
6%C%*6%*E$%,
!"#$%
'()*+,
-%.,"*(/$0()"*
1*+$*%
2,%. 3*4"
!"#$%
!%5(6(5(
7$8$/(.
!"#$%,
'%#$%9,
:;< =%,5
1*+$*%
?< .%B@%,5, C%.
6(D
$*5" 5F% G%H/$I
:-3
Discovery
API
Streaming
API
!"#$%
'()*+,
-%.,"*(/$0()"*
1*+$*%
2,%. 3*4"
!"#$%
!%5(6(5(
7$8$/(.
!"#$%,
'%#$%9,
:;< =%,5
1*+$*%
Discovery
API
Streaming
API
Content
Encoding
CDN
Management
QOS
Logging
DRM
OpenConnect
Edge Locations
Browse
Play
Watch

Services are built by different


teams who work together to
gure out what each service
will provide.

The service owner publishes


an API that anyone can use.
Highly aligned, loosely
coupled

Easier auto-scaling

Easier capacity planning

Identify problematic code-paths


more easily

Narrow in the effects of a change

More efcient local caching


Advantages to a Service
Oriented Architecture

Developers deploy when


they want

They also manage their own


capacity and autoscaling

And x anything that breaks


at 4am!
Freedom and
Responsibility
All systems
choices assume
some part will fail
at some point.

Simulate things
that go wrong

Find things that


are different
The Monkey Theory
Execution
AWS
Netix OSS
Netix Application Code
AWS
Netix OSS
YOUR Application Code

Instances

Machine Images

Elastic IPs

Load Balancers

Security groups / Autoscaling


What AWS Provides
AWS
AWS
Netix OSS
YOUR Application Code

Service Oriented
Architecture

HTTP/Rest interfaces
between services
Netix built a global PaaS
Netix OSS

Supports all regions and zones

Multiple accounts

Cross region/account replication

Internationalized, localized and GeoIP routed

Advanced key management

Autoscaling with 1000s of instances

Monitoring and alerting on millions of metrics


Netix PaaS features
Netix OSS
Open Source at Netix
Netix OSS
Be liberal in what you accept, strict in what you send
Circuit Breakers
(Hystrix)

Simulate things
that go wrong

Find things that


are different
The Monkey Theory

Chaos -- Kills random


instances

Chaos Gorilla -- Kills


zones

Chaos Kong -- Kills


regions

Latency -- Degrades
network and injects
faults

Conformity -- Looks
for outliers
The simian army

Circus -- Kills and launches


instances to maintain zone
balance

Doctor -- Fixes unhealthy


resources

Janitor -- Cleans up unused


resources

Howler -- Yells about bad things


like Amazon limit violations

Security -- Finds security issues


and expiring certicates
Netix OSS
Blueprint for the rest of
the platform libraries
Pluggable architecture

On instance software load balancer

Zone aware / Zone afnity

Handles retry logic

Global variables

Support for staged rollout

Feature ags
Netix OSS

Application to instance mapping

Heartbeat to keep track of health


DQ Transport Routing
Suro
etc
Eventbus
Druid
Netix OSS
Why Bake?
Generic AMI
Instance
Traditional:
launch OS
install packages
install app
Netix:
launch OS
+app
App AMI
Instance
Getting Baked
Perforce / Git
libraries
source
Ant targets
Ivy
Groovy all over
app bundles
Jenkins
sync
resolve
build compile report
publish test
Artifactory
snapshot / release
libraries / apps
Base
Image
Baking
Yum / Apt
Linux: CentOS, Fedora, Ubuntu
RPMs: Apache, Java...
ec2 slave instances
S3 / EBS
foundation
AMI
base
AMI
Bakery
mount
install
Ready
for
app
bake
snapshot
AWS
App
Image
Baking
Jenkins / Yum /
Artifactory
Linux, Apache, Java, Tomcat
AWS
app bundle
ec2 slave instances
S3 / EBS
base AMI
app
AMI
Bakery
mount
install
Ready
to launch!
snapshot
app
AMI
Linux Base AMI (CentOS or Ubuntu)
Java
Tomcat
Optional
Apache
Monitoring

Log Rotation
to S3
monitoring
GC and
thread dump
logging
Application war le, base
servlet, platform, interface
jars for dependent
services
Healthcheck, status
servelets, JMX interface,
Servo autoscale
Linux Base AMI (CentOS or Ubuntu)
Java
Tomcat
Optional
Apache
Monitoring

Log Rotation
to S3
monitoring
GC and
thread dump
logging
Application war le, base
servlet, platform, interface
jars for dependent
services
Healthcheck, status
servelets, JMX interface,
Servo autoscale
app
AMI
Application war le
Linux Base AMI (CentOS or Ubuntu)
Java
JBoss
Optional
Apache
Monitoring

Log Rotation
to S3
monitoring
GC and
thread dump
logging
Application war le, base
servlet, platform, interface
jars for dependent
services
Healthcheck, status
servelets, JMX interface,
Servo autoscale
app
AMI
Linux Base AMI (CentOS or Ubuntu)
Python
Bottle
Optional
Apache
Monitoring

Log Rotation
to S3
monitoring
logging
Application le, base
server, platform, interface
libs for dependent services
app
AMI
Netix OSS
Deploying Code; Step 1
Auto Scaling
Group
Launch
Conguration
Security
Group
Amazon Machine
Image
Instances
Load
Balancer
Netix has moved
the granularity
from the instance
to the cluster
Data is the most
important asset Netix
has. Its what differentiates
us from our competitors.
Netix OSS
EVCache

Wrapper on top of memcached

Automatically replicates writes to


multiple regions

Pulls cache data intelligently via zone


afnity
Cassandra

Availability over consistency

Writes over reads

We know Java

Open source + support


Why Cassandra?

Priam

Zero touch auto-cong

State management

Token assignment

Node replacement

Backup/restore to/from S3
Using Cassandra at Netix

Astyanax

OO abstraction
to Cassandra

Multi-region
support
Cassandra Architecture
Going Multi-region

100% uptime is theoretically


possible.

You have to replicate your data

This will cost money


Leveraging Multi-region
us-east-1 us-west-2 etc
eu-west-1
us-east-1 us-west-2 etc
eu-west-1
us-east-1 us-west-2 etc
eu-west-1
Whats going
on?!
Atlas

alerting
api
api
Central
Event
Gateway
Paging
Service
Amazon
SES
CORE
Agent
Other
Teams
Agent
CORE
Agent
Alert Systems
Central
Event
Gateway

Parse raw alerts, match application to owner

Add image captures and links to related


graphs for easy mobile use

Send to the right service based on priority

Register the event in Chronos, the timeline


application

Correlate low priority alerts and generate


new high priority alerts
Metrics in Production

796B Daily metric


points

Peaks at 1.4B /
min

50% daily metric


churn
What is a metric?
com.netflix.eds.nccp.successful.requests.uiversion.nccprt-authorization.devtypid-101.clver-PHL_0AB.uiver-UI_169_mid.geo-US
How we built it

Built our own big data


system

Based on S3 and EMR

Less copies, lower


resolution, and slower
speed retrieval based on
age of data
Self Serve is the Key

Developers choose
what metrics to
submit

What graphs they


put on their
dashboards

What to alert on
Example Alert Cong
Atlas
When something breaks..
Breakdown of an outage
Is something wrong? Alerting
Where is the problem? Telemetry and Dashboards
What changed? ???
Breakdown of an outage
Is something wrong? Alerting
Where is the problem? Telemetry and Dashboards
What changed? Change control?
Change control, the good

Tells you what changed

Tells you whats about to


change

Great for coordination


when one change gates
another change
Change control, the bad

Its manual

It expresses intent, not


reality

It forces you to
serialize your changes
to an extent
Breakdown of an outage
Is something wrong? Alerting
Where is the problem? Telemetry and Dashboards
What changed? Chronos
(Some of) Netix is open source:
https://netix.github.io
Just a quick reminder...
Netix is hiring!
If you like what you see here,
feel free to reach out!
Questions?
Getting in touch
Email: jedberg@{gmail,netix}.com
Twitter: @jedberg
Web: www.jedberg.net
Facebook: facebook.com/jedberg
Linkedin: www.linkedin.com/in/jedberg

You might also like