Chicago DevOps Meetup

How Netix Delivers
Software

July 8
th
, 2014
Email: jedberg@{gmail,netix}.com
Twitter: @jedberg
Web: www.jedberg.net
Facebook: facebook.com/jedberg
Linkedin: www.linkedin.com/in/jedberg
When your software fails...
will your system survive?
The Netix way
Fully automated build tools

to test and make packages
Fully automated machine

image bakery
Fully automated image

deployment
Everything is built for three
Independent teams responsible

for both Dev and Ops
Redundancy through multi-

region deployment
The Netix way
Philosophy
We hire responsible adults

and keep rules and policies
to a minimum
Developers can change any

code in production at any
time
And things dont break

(usually)
Freedom and
Responsibility
Automate all the things!
http://hyperboleandahalf.blogspot.com/2010/06/this-is-why-ill-never-be-adult.html
Application startup
Conguration
Code deployment
System
deployment
Automate all the things!
Standard base image
Tools to manage all

the systems
Reduce errors
through
reproducibility
Automation
Shared state should
be stored in a
shared service

Data on an instance
should be replicated
to other instances
Build for three
We hold a boot camp for new
engineers to teach them how to
build for a highly distributed
environment.
Build for three
We hold a boot camp for new
engineers to teach them how to
build for a highly distributed
environment.
>?< "@5A"@*6
.%B@%,5, C%. 6(D
5" :-3
6%C%*6%*E$%,
!"#$%
'()*+,
-%.,"*(/$0()"*
1*+$*%
2,%. 3*4"
!"#$%
!%5(6(5(
7$8$/(.
!"#$%,
'%#$%9,
:;< =%,5
1*+$*%
?< .%B@%,5, C%.
6(D
$*5" 5F% G%H/$I
:-3
Discovery
API
Streaming
API
!"#$%
'()*+,
-%.,"*(/$0()"*
1*+$*%
2,%. 3*4"
!"#$%
!%5(6(5(
7$8$/(.
!"#$%,
'%#$%9,
:;< =%,5
1*+$*%
Discovery
API
Streaming
API
Content
Encoding
CDN
Management
QOS
Logging
DRM
OpenConnect
Edge Locations
Browse
Play
Watch
Services are built by different

teams who work together to
gure out what each service
will provide.
The service owner publishes

an API that anyone can use.
Highly aligned, loosely
coupled
Easier auto-scaling
Easier capacity planning
Identify problematic code-paths

more easily
Narrow in the effects of a change
More efcient local caching

Advantages to a Service
Oriented Architecture
Developers deploy when

they want
They also manage their own

capacity and autoscaling
And x anything that breaks

at 4am!
Freedom and
Responsibility
All systems
choices assume
some part will fail
at some point.
Simulate things
that go wrong
Find things that

are different
The Monkey Theory
Execution
AWS
Netix OSS
Netix Application Code
AWS
Netix OSS
YOUR Application Code
Instances
Machine Images
Elastic IPs
Load Balancers
Security groups / Autoscaling

What AWS Provides
AWS
AWS
Netix OSS
YOUR Application Code
Service Oriented
Architecture
HTTP/Rest interfaces
between services
Netix built a global PaaS
Netix OSS
Supports all regions and zones
Multiple accounts
Cross region/account replication
Internationalized, localized and GeoIP routed
Advanced key management
Autoscaling with 1000s of instances
Monitoring and alerting on millions of metrics

Netix PaaS features
Netix OSS
Open Source at Netix
Netix OSS
Be liberal in what you accept, strict in what you send
Circuit Breakers
(Hystrix)
Simulate things
that go wrong
Find things that

are different
The Monkey Theory
Chaos -- Kills random

instances
Chaos Gorilla -- Kills

zones
Chaos Kong -- Kills

regions
Latency -- Degrades
network and injects
faults
Conformity -- Looks
for outliers
The simian army
Circus -- Kills and launches

instances to maintain zone
balance
Doctor -- Fixes unhealthy

resources
Janitor -- Cleans up unused

resources
Howler -- Yells about bad things

like Amazon limit violations
Security -- Finds security issues

and expiring certicates
Netix OSS
Blueprint for the rest of
the platform libraries
Pluggable architecture
On instance software load balancer
Zone aware / Zone afnity
Handles retry logic
Global variables
Support for staged rollout
Feature ags
Netix OSS
Application to instance mapping
Heartbeat to keep track of health

DQ Transport Routing
Suro
etc
Eventbus
Druid
Netix OSS
Why Bake?
Generic AMI
Instance
Traditional:
launch OS
install packages
install app
Netix:
launch OS
+app
App AMI
Instance
Getting Baked
Perforce / Git
libraries
source
Ant targets
Ivy
Groovy all over
app bundles
Jenkins
sync
resolve
build compile report
publish test
Artifactory
snapshot / release
libraries / apps
Base
Image
Baking
Yum / Apt
Linux: CentOS, Fedora, Ubuntu
RPMs: Apache, Java...
ec2 slave instances
S3 / EBS
foundation
AMI
base
AMI
Bakery
mount
install
Ready
for
app
bake
snapshot
AWS
App
Image
Baking
Jenkins / Yum /
Artifactory
Linux, Apache, Java, Tomcat
AWS
app bundle
ec2 slave instances
S3 / EBS
base AMI
app
AMI
Bakery
mount
install
Ready
to launch!
snapshot
app
AMI
Linux Base AMI (CentOS or Ubuntu)
Java
Tomcat
Optional
Apache
Monitoring

Log Rotation
to S3
monitoring
GC and
thread dump
logging
Application war le, base
servlet, platform, interface
jars for dependent
services
Healthcheck, status
servelets, JMX interface,
Servo autoscale
Java
Tomcat
Optional
Apache
Monitoring

Log Rotation
to S3
monitoring
GC and
thread dump
logging
jars for dependent
services
Healthcheck, status
Servo autoscale
app
AMI
Application war le
Java
JBoss
Optional
Apache
Monitoring

Log Rotation
to S3
monitoring
GC and
thread dump
logging
jars for dependent
services
Healthcheck, status
Servo autoscale
app
AMI
Python
Bottle
Optional
Apache
Monitoring

Log Rotation
to S3
monitoring
logging
Application le, base
server, platform, interface
libs for dependent services
app
AMI
Netix OSS
Deploying Code; Step 1
Auto Scaling
Group
Launch
Conguration
Security
Group
Amazon Machine
Image
Instances
Load
Balancer
Netix has moved
the granularity
from the instance
to the cluster
Data is the most
important asset Netix
has. Its what differentiates
us from our competitors.
Netix OSS
EVCache
Wrapper on top of memcached
Automatically replicates writes to

multiple regions
Pulls cache data intelligently via zone

afnity
Cassandra
Availability over consistency
Writes over reads
We know Java
Open source + support

Why Cassandra?
Priam
Zero touch auto-cong
State management
Token assignment
Node replacement
Backup/restore to/from S3
Using Cassandra at Netix
Astyanax
OO abstraction
to Cassandra
Multi-region
support
Cassandra Architecture
Going Multi-region
100% uptime is theoretically

possible.
You have to replicate your data
This will cost money

Leveraging Multi-region
us-east-1 us-west-2 etc
eu-west-1
eu-west-1
eu-west-1
Whats going
on?!
Atlas

alerting
api
api
Central
Event
Gateway
Paging
Service
Amazon
SES
CORE
Agent
Other
Teams
Agent
CORE
Agent
Alert Systems
Central
Event
Gateway
Parse raw alerts, match application to owner
Add image captures and links to related

graphs for easy mobile use
Send to the right service based on priority
Register the event in Chronos, the timeline

application
Correlate low priority alerts and generate

new high priority alerts
Metrics in Production
796B Daily metric

points
Peaks at 1.4B /
min
50% daily metric

churn
What is a metric?
com.netflix.eds.nccp.successful.requests.uiversion.nccprt-authorization.devtypid-101.clver-PHL_0AB.uiver-UI_169_mid.geo-US
How we built it
Built our own big data

system
Based on S3 and EMR
Less copies, lower

resolution, and slower
speed retrieval based on
age of data
Self Serve is the Key
Developers choose
what metrics to
submit
What graphs they

put on their
dashboards
What to alert on
Example Alert Cong
Atlas
When something breaks..
Breakdown of an outage
Is something wrong? Alerting
Where is the problem? Telemetry and Dashboards
What changed? ???
What changed? Change control?
Change control, the good
Tells you what changed
Tells you whats about to

change
Great for coordination

when one change gates
another change
Change control, the bad
Its manual
It expresses intent, not

reality
It forces you to
serialize your changes
to an extent
What changed? Chronos
(Some of) Netix is open source:
https://netix.github.io
Just a quick reminder...
Netix is hiring!
If you like what you see here,
feel free to reach out!
Questions?
Getting in touch
Email: jedberg@{gmail,netix}.com
Twitter: @jedberg
Web: www.jedberg.net
Facebook: facebook.com/jedberg
Linkedin: www.linkedin.com/in/jedberg

Chicago DevOps Meetup

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chicago DevOps Meetup

Uploaded by

Copyright:

Available Formats

How Netix Delivers

Fully automated build tools

Fully automated machine

Fully automated image

Everything is built for three

Independent teams responsible

Redundancy through multi-

We hire responsible adults

Developers can change any

And things dont break

Standard base image

Tools to manage all

Services are built by different

The service owner publishes

Easier capacity planning

Identify problematic code-paths

Narrow in the effects of a change

More efcient local caching

Developers deploy when

They also manage their own

And x anything that breaks

Find things that

Security groups / Autoscaling

Supports all regions and zones

Cross region/account replication

Internationalized, localized and GeoIP routed

Advanced key management

Autoscaling with 1000s of instances

Monitoring and alerting on millions of metrics

Find things that

Chaos -- Kills random

Chaos Gorilla -- Kills

Chaos Kong -- Kills

Circus -- Kills and launches

Doctor -- Fixes unhealthy

Janitor -- Cleans up unused

Howler -- Yells about bad things

Security -- Finds security issues

On instance software load balancer

Zone aware / Zone afnity

Handles retry logic

Support for staged rollout

Application to instance mapping

Heartbeat to keep track of health

Wrapper on top of memcached

Automatically replicates writes to

Pulls cache data intelligently via zone

Availability over consistency

Writes over reads

Open source + support

Zero touch auto-cong

100% uptime is theoretically

You have to replicate your data

This will cost money

Parse raw alerts, match application to owner

Add image captures and links to related

Send to the right service based on priority

Register the event in Chronos, the timeline

Correlate low priority alerts and generate

796B Daily metric

50% daily metric

Built our own big data

Based on S3 and EMR