SlideShare a Scribd company logo
Learning about NetflixOSS
For Oct 2013 @TriangleDevops

Andrew Spyker
@aspyker
Some content from @ma4jpb
Agenda
• How did I get here?
•
•
•
•
•

Netflix and Netflix OSS platform overview
Runtime components
Management components
Build components
Automated test and cleanliness components
2
About me …
• IBM STSM of Performance Architect and Strategy
• Eleven years in performance in WebSphere
–
–
–
–

Led the App Server Performance team for years
Small sabbatical focused on IBM XML technology
Work in Emerging Technology Institute and CTO Office
Starting to look at cloud service operations

• Email: aspyker@us.ibm.com
–
–
–
–

Blog: http://ispyker.blogspot.com/
Linkedin: http://www.linkedin.com/in/aspyker
Twitter: http://twitter.com/aspyker
Github: http://www.github.com/aspyker

• Triangle dad that enjoys technology as well as running, wine and poker
3
Develop or maintain a service today?
• Develop – starting
• Maintain – starting
• More on this later ….
http://www.flickr.com/photos/stevendepolo/

4
What qualifies me to talk?
• My shirt?
• Of cloud prize ~ 25 nominees
– Personally
• Best example mash-up sample

– My IBM team
• Best portability enhancement

– More on this coming …
•

http://techblog.netflix.com/2013/09/netflixoss-meetup-s1e4-cloud-prize.html
5
Seriously, how did I get here?
• Plenty of experience with performance and scale on
standardized benchmarks (SPEC/TPC)
– Non representative of how to (web) scale
• Pinning, biggest monolithic DB “wins”, hand tuned for fixed size

– Out of date on modern architecture for mobile/cloud

• Created Acme Air
– http://bit.ly/acmeairblog

• Demonstrated that we could achieve (web) scale runs
– 4B+ Mobile/Browser request/day
– With modern mobile and cloud best practices

6
Demo

7
What was shown?
• Peak performance and scale – You betcha!
• Operational visibility – Only during the run via
nmon collection and post-run visualization
•
•
•
•

True operational visibility - nope
Devops – nope
HA and DR – nope
Manual and automatic elastic scaling - nope
8
What next?
• Went looking for what best industry practices around
devops and high availability at web scale existed
– Many have documented via research papers and on
highscalability.com – Google, Twitter, Facebook, Linkedin,
etc.

• Why Netflix?
– Documented not only on their tech blog, but also have
released working OSS on github
– Also, given dependence on Amazon, they are a clear
bellwether of web scale public cloud availability
9
Steps to NetflixOSS understanding
• Recoded Acme Air application to make use of NetflixOSS
runtime components
• Worked to implement a NetflixOSS devops and high
availability setup around Acme Air (on EC2) run at previous
levels of scale and performance
• Worked to port NetflixOSS runtime and devops/high
availability servers to IBM Cloud (SoftLayer) and RightScale

• Through public collaboration with Netflix technical team
– Google groups, github and meetups
10
Why?
• To prove that advanced cloud high availability
and devops platform wasn’t “tied” to Amazon
• To understand how we can advance IBM cloud
platforms for our customers
• To understand how we can host our IBM
public cloud services better
11
Agenda
• How did I get here?
• Netflix and Netflix OSS platform overview
•
•
•
•

Runtime components
Management components
Build components
Automated test and cleanliness components
12
My view of Netflix goals
• As a business
– Be the best streaming media provider in the world
– Make best content deals based on real data/analysis

• Technology wise
– Have the most availability possible
– Measure all things by “stream starts per unit of time”
• Any dip in that relates back to the business

– Do this at web scale
13
Standing on the shoulder of a giants
• Public Cloud (Amazon)
– When adding streaming, Netflix decided they
• Shouldn’t invest in building data centers worldwide
• Had to plan for the streaming business to be very big

– Embraced cloud architecture paying only for what they need

• Open Source
– Many parts of runtime depend on open source
• Linux, Apache Tomcat, Apache Cassandra, etc.

– Realized that Amazon wasn’t enough
• Started a cloud platform on top that would
eventually be open sourced - NetflixOSS
http://en.wikipedia.org/wiki/
File:Andre_in_the_late_%2780s.jpg

14
Faleure
• What is failing?
– Underlying IaaS problems
• Instances, racks, availability zones, regions

– Software issues
• Operating system, servers, application code

Inspiration

– Surrounding services
• Other application services, DNS, user registries, etc.

• How is a component failing?
–
–
–
–

Fails and disappears altogether
Intermittently fails
Works, but is responding slowly
Works, but is causing users a poor experience
15
Overview of Amazon EC2
•

Amazon launches instances into availability zones
– Instances of various sizes (compute, storage, etc.)

•

Regions independent of each other
Regions only connected over the Internet
Regions contain availability zones
Availability zones are isolated from each over
Availability zones are connected /w low-latency links

Availability
Zone

Availability
Zone

Internet

This gives a high level of resilience to outages
– Unlikely to affect multiple availability zones or regions

•

Availability
Zone

Organized into regions and availability zones
–
–
–
–
–

•

EC2 Region
(US East)

Amazon requires customer be aware of this
topology to take advantage of its benefits within
their application

EC2 Region
(US West)

Availability
Zone

Availability
Zone

Availability
Zone

16
NetflixOSS
• “Technical
indigestion as a
service” - @adrianco
• netflix.github.io
• 30+ OSS projects
• Expanding every day

17
NetflixOSS – for today
• For today
– Focus on mid tier web
app and micro service
servers
– Devops servers and tools
– Skipping some just for
simplicity

• For another time
– Big data
– Data tier
– Caching

18
Agenda
• How did I get here?
• Netflix and Netflix OSS platform overview
• Runtime components
• Management components
• Build components
• Automated test and cleanliness components
19
Acme Air As A Sample

ELB

Web App
Front End
(REST services)

App Service
(Authentication)

Data Tier

Greatly simplified …

20
Micro-services architecture
• Decompose system into isolated services that can be developed
separately
• Why?
– They can fail independently vs. fail together monolythically
– They can be developed and released with difference velocities by
different teams

• To show this we created separate “auth service” for Acme Air
• In a typical customer facing application any single front end
invocation could spawn 20-30 calls to services and data sources

21
How do services advertise themselves?
• Upon web app startup, Karyon server is started
– Karyon will configure (via Archaius) the application
– Karyon will register the location of the instance with Eureka
• Others can know of the existence of the service
• Lease based so instances continue to check in updating list of available instances

– Karyon will also expose a JMX console, healthcheck URL
• Devops can change things about the service via JMX
• The system can monitor the health of the instance

App Service
(Authentication)

Name, Port
IP address,
Healthcheck url

Karyon
Tomcat

Eureka
Eureka
Server(s)
Eureka
Server(s)
Eureka
Server(s)
Server(s)

config.properties, auth-service.properties
Or remote Archaius stores
22
How do consumers find services?
• Service consumers query eureka at startup and
periodically to determine location of dependencies
– Can query based on availability zone and cross
availability zone
Web App
Front End
(REST services)
Eureka client
Tomcat

What “auth-service”
instances exist?
Eureka
Eureka
Server(s)
Eureka
Server(s)
Eureka
Server(s)
Server(s)

23
Demo

24
How does the consumer call the service?
• Protocols impls have eureka aware load balancing support build in
– In client load balancing -- does not require separate LB tier

• Ribbon – REST client
– Pluggable load balancing scheme
– Built in failure recovery support (retry next server, mark instance as failing, etc.)

• Other eureka enabled clients – memcached (EVCache), asystanax coming
(Priam and Cassandra)
Web App
Front End
(REST services)

Call
“auth-service”

Ribbon
REST
client
Eureka
client

App Service
App Service
(Authentication)
App Service
(Authentication)
App Service
(Authentication)
(Authentication)

25
How to deploy this with HA?
Instances?
• Deploy across AZs
• Using AutoScalingGroups in
EC2 managed by Asgard

Eureka?
•
•

DNS and Elastic IP trickery
Deployed across AZs

•

For clients to find eureka servers
–

– ASG manages recovery

–

•

For new eureka servers
–
–
–

•

DNS TXT record for domain lists AZ TXT
records
AZ TXT records have list of Eureka servers

Look for list of eureka servers IP’s for the AZ
it’s coming up in
Look for unassigned elastic IP’s, grab one and
assign it to itself
Sync with other already assigned IP’s that
likely are hosting Eureka server instances

Simpler configurations with less HA are
available
26
Protect yourself from unhealthy services
• Wrap all calls to services with Hystrix command pattern
– Hystrix implements circuit breaker pattern
– Executes command using semaphore or separate thread
pool to guarantee return within finite time to caller
– If a unhealthy service is detected, start to call fallback
implementation (broken circuit) and periodically check if
main implementation works (reset circuit)

Execute
auth-service
call

Call
“auth-service”

Hystrix

Web App
Front End
(REST
services)

Ribbon REST
client

App Service
App Service
(Authentication)
App Service
(Authentication)
App Service
(Authentication)
(Authentication)

Fallback implementation
27
Does Hystrix do more?
• Main reason for Hystrix is
protect yourself from
dependencies, but …
• Once you have a layer of
indirection take advantage of it,
Hystrix can provide
– Caching
– Visualization
• Aggregated via Turbine

– Request collapsing

• Programming models
– Sync, Async, Reactive (RxJava)
28
Agenda
• How did I get here?
• Netflix and Netflix OSS platform overview
• Runtime components
• Management components

• Build components
• Automated test and cleanliness components
29
Ability to reconfigure - Archaius
• Using dynamic properties, can
easily change properties across
cluster of applications, either

Application

– NetflixOSS named props
• Hystrix timeouts for example

Runtime

– Custom dynamic props
Hierarchy

• High throughput achieved by
polling approach
• HA of configuration source
dependent on what source you
use

URL

JMX
Karyon
Console

Persisted DB
Application Props
Libraries
Container

– HTTP server, database, etc.
DynamicIntProperty prop =
DynamicPropertyFactory.getInstance().getIntProperty("myProperty", DEFAULT_VALUE);
int value = prop.get(); // value will change over time based on configuration

30
ASGard
EC2 Region
(US East)

Availability
Zone

Tell EC2 to start
these instances and
Keep this many
Instances running
Availability
Zone

Web App
App Service
(REST App Service
Services)
(Authentication)
App Service
(Authentication)
(Authentication)

App Service
App Service
App Service
(Authentication)
(Authentication)
App Service
(Authentication)
(Authentication)

Availability
Zone

Web App
App Service
(REST App Service
Services)
(Authentication)
App Service
(Authentication)
(Authentication)

App Service
App Service
App Service
(Authentication)
(Authentication)
App Service
(Authentication)
(Authentication)

Web App
App Service
(REST App Service
Services)
(Authentication)
App Service
(Authentication)
(Authentication)

App Service
App Service
App Service
(Authentication)
(Authentication)
App Service
(Authentication)
(Authentication)

• Asgard is the missing EC2 console for AutoScalingGroup mgmt.
31
– EC2 only has CLI for ASG management
Asgard creates an “application”
• Enforces common practices for deploying code
– Common approach to linking auto scaling groups to launch configs,
ELB’s, security groups, scaling policies and AMIs

• Adds missing concept to the EC2 domain model – “application”
– Extends clustering to applications vs. AMI’s

• Example
–
–
–
–

Application – app1
Cluster – app1-env
Autoscaling group version n – app1-env-v009
Autoscaling group version n+1 – app1-env-v010

32
Asgard devops procedures
•
•
•
•

Fast rollback
Canary testing
Red/Black pushes
More through REST interfaces
– Adhoc processes but enforced through Asgard model

• More coming using Glisten and Amazon SWF

33
Demo

34
Augmenting the ELB tier - Zuul
• Zuul adds devops support in the front tier routing
–
–
–
–
–

Stress testing (squeeze testing)
Canary testing
Dynamic routing
Load Shedding
Debugging

• And some common function
–
–
–
–
–

Authentication
Security
Static response handling
Multi-region resiliency (DR for ELB tier)
Insight

Amazon
ELB

Filter
Filter
Filter
Filters

Zuul
Zuul
Zuul
Edge
Service

Edge
Service

• Through dynamically deployable filters (written in Groovy)
• Eureka aware using ribbon, and archaius like shown in runtime section
35
Monitoring - Servo
• Annotation based publishing through JMX of
application metrics
• Filters, Observers, and Pollers to publish metrics
– Can export metrics to CloudWatch and other monitors

• The entire Netflix monitoring infrastructure
hasn’t been open sourced due to complexity and
priority

36
A note on the next three projects
• I haven’t personally worked with the projects
• Given the audience, I included as I believe
they will be of interest

37
Edda
• Polls Amazon config and stores the data in a
queriable database
• Provides a searchable view of Amazon
deployments
– Searchable in ways not possible from Amazon API’s

• Provides a historical view
– For correlation of problems to changes
– Likely less of an issue in clouds that expose all changes
38
Ice
• Cloud spend and usage analytics
• Communicates with billing API to give
birds eye view of cloud spend with drill
down to region, availability zone, and
service team through application groups
• Watches on-demand, used and unused
reserved instances and instance sizes to
help optimize
• Not point in time
– Shows trends to help predict future
optimizations
39
Denominator
• Java Library and CLI for cross DNS configuration
• Allows for common, quicker (than using various
DNS provider UI) and automated DNS updates
• Plugins have been developed by various DNS
providers

40
Agenda
•
•
•
•

How did I get here?
Netflix and Netflix OSS platform overview
Runtime components
Management components

• Build components
• Automated test and cleanliness components
41
Get baked!
• Caution: Flame/troll bait ahead!!
• Netflix takes the approach of baking images as part of build such that
– Instance boot-up doesn’t depend on outside servers
– Instance boot-up only starts servers already set to run
– New code = new instances (never update instances in place)

• Why?
– Critical when launching hundreds of servers at a time
– Goal to reduce the failure points in places where dynamic system
configuration doesn’t provide value
– Speed of elastic scaling, boot and go
– Discourages ad hoc changes to server instances

• Criticism – “Netflix is ruining the cloud”
– Overhead of AMI’s for every code version
– Ties to Amazon AMI’s (would this work for containers – I think yes)

42
AMInator
• Starting image/volume
– Foundational image created (maybe via loopback),
base AMI with common software created/tested
independently

• Aminator running – Bakery
– Bakery obtains a known EBS volume of the base
image from a pool
– Bakery mounts volume and provisions the
application (apt/deb or yum/rpm)
– Bakery snapshots and registers snapshot

• Recent work to add other provisioning such as chef
as plugins
• I have used hand built AMI’s thus far, but blog
states developers can go through CI builds and
have running test instances within 15 minutes of
code being checked in

43
Agenda
•
•
•
•
•

How did I get here?
Netflix and Netflix OSS platform overview
Runtime components
Management components
Build components

• Automated test and cleanliness components
44
The Simian Army
• A bunch of automated “monkeys” that
perform automated system administration
tasks
• Anything that is done by a human more than
once can and should be automated
• Absolutely necessary at web scale
45
Good Monkeys
• Janitor Monkey
– Somewhat a mitigation for baking approach
– Will mark and sweep unused resources
(instances, volumes, snapshots, ASG’s,
launch configs, images, etc.)
– Owners notified, then removed

• Conformity Monkey

http://www.flickr.com/photos/sonofgroucho/5852049290

– Check instances are conforming to rules
around security, ASG/ELB, age, status/health
check, etc.

46
Back to high availability
• Failure is inevitable. Don’t try to avoid it!
• How do you know if your backup is good?
– Try to restore from your backup every so often
– Better to ensure backup works before you have a crashed
system and find out your backup is broken

• How do you know if your system is HA?
– Try to force failures every so often
– Better to force those failures during office hours
– Better to ensure HA before you have a down system and
angry users
– Best to learn from failures and add automated tests
47
Bad Monkeys
• Open Sourced – Chaos Monkey
– Used to randomly terminate instances
– Now block network, burn cpu, kill
processes, fail amazon api, fail dns, fail
dynamo, fail s3, introduce network
errors/latency, detach volumes, fill disk,
burn I/O
http://www.flickr.com/photos/27261720@N00/132750805

• Not yet open sourced
– Chaos Gorilla
• Kill all instances in an availability zone

– Chaos Kong
• Kill all instances in an entire region

– Latency Monkey
• Introduce latency into service calls directly
(ribbon server side)
48
Agenda
• Blah, blah, blah
• How can I learn more?
• How do I play with this?
• Let’s write some code!
49
Want to play?
• NetflixOSS blog and github
– http://techblog.netflix.com
– http://github.com/Netflix

• Acme Air, NetflixOSS AMI’s
– Try Asgard/Eureka with a real application
– http://bit.ly/aa-AMIs

• See what we ported to IBM Cloud (video)
– http://bit.ly/noss-sl-blog

• Fork and submit pull requests to Acme Air
– http://github.com/aspyker/acmeair-netflix

50

More Related Content

PDF
"[WORKSHOP] K8S for developers", Denis Romanuk
PDF
Kubernetes Architecture - beyond a black box - Part 1
PPTX
A Million ways of Deploying a Kubernetes Cluster
PPTX
Introduction to Kubernetes
PDF
Why kubernetes for Serverless (FaaS)
PDF
berne.*tesday1
PPTX
Ultimate Guide to Microservice Architecture on Kubernetes
PDF
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
"[WORKSHOP] K8S for developers", Denis Romanuk
Kubernetes Architecture - beyond a black box - Part 1
A Million ways of Deploying a Kubernetes Cluster
Introduction to Kubernetes
Why kubernetes for Serverless (FaaS)
berne.*tesday1
Ultimate Guide to Microservice Architecture on Kubernetes
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...

What's hot (20)

PDF
Kubernetes One-Click Deployment: Hands-on Workshop (Munich)
PDF
Kubernetes Concepts And Architecture Powerpoint Presentation Slides
PDF
On Prem Container Cloud - Lessons Learned
PPSX
Docker Kubernetes Istio
PDF
DCEU 18: 5 Patterns for Success in Application Transformation
PDF
On-the-Fly Containerization of Enterprise Java & .NET Apps by Amjad Afanah
PDF
Microservices + Events + Docker = A Perfect Trio by Docker Captain Chris Rich...
PPTX
Container orchestration overview
PPTX
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
PDF
What Is Kubernetes | Kubernetes Introduction | Kubernetes Tutorial For Beginn...
PPTX
OpenShift Enterprise 3.1 vs kubernetes
PPTX
Enabling Production Grade Containerized Applications through Policy Based Inf...
PDF
Kubernetes Networking 101
PPTX
Introduction into Docker Containers, the Oracle Platform and the Oracle (Nati...
PDF
Container World 2017 - Characterizing and Contrasting Container Orchestrators
PDF
Velocity NYC 2016 - Containers @ Netflix
PDF
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...
PPTX
Simple tweaks to get the most out of your JVM
PDF
ContainerDays NYC 2015: "Container Orchestration Compared: Kubernetes and Doc...
PPTX
Netflix0SS Services on Docker
Kubernetes One-Click Deployment: Hands-on Workshop (Munich)
Kubernetes Concepts And Architecture Powerpoint Presentation Slides
On Prem Container Cloud - Lessons Learned
Docker Kubernetes Istio
DCEU 18: 5 Patterns for Success in Application Transformation
On-the-Fly Containerization of Enterprise Java & .NET Apps by Amjad Afanah
Microservices + Events + Docker = A Perfect Trio by Docker Captain Chris Rich...
Container orchestration overview
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
What Is Kubernetes | Kubernetes Introduction | Kubernetes Tutorial For Beginn...
OpenShift Enterprise 3.1 vs kubernetes
Enabling Production Grade Containerized Applications through Policy Based Inf...
Kubernetes Networking 101
Introduction into Docker Containers, the Oracle Platform and the Oracle (Nati...
Container World 2017 - Characterizing and Contrasting Container Orchestrators
Velocity NYC 2016 - Containers @ Netflix
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...
Simple tweaks to get the most out of your JVM
ContainerDays NYC 2015: "Container Orchestration Compared: Kubernetes and Doc...
Netflix0SS Services on Docker
Ad

Viewers also liked (17)

PPT
Devops at Netflix (re:Invent)
PDF
Netflix Cloud Platform Building Blocks
PDF
Spring Cloud Netflix OSS
PDF
Optimizing the Ops in DevOps
PDF
Netflix IT Ops 2014 Roadmap
PDF
Disruption of Enterprise IT and DevOps
KEY
Consumer Science and Product Development at Netflix - OSCON 2012
PPTX
Shepherding change: leading your DevOps transformation
PDF
Netflix Open Source Meetup Season 3 Episode 2
PDF
How Netflix thinks of DevOps. Spoiler: we don’t.
PDF
Netflix oss season 2 episode 1 - meetup Lightning talks
PDF
Microservices: What's Missing - O'Reilly Software Architecture New York
PDF
Spring Boot + Netflix Eureka
PDF
20140708 - Jeremy Edberg: How Netflix Delivers Software
PPTX
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
PDF
AWS Lambda
PPTX
Beyond DevOps - How Netflix Bridges the Gap
Devops at Netflix (re:Invent)
Netflix Cloud Platform Building Blocks
Spring Cloud Netflix OSS
Optimizing the Ops in DevOps
Netflix IT Ops 2014 Roadmap
Disruption of Enterprise IT and DevOps
Consumer Science and Product Development at Netflix - OSCON 2012
Shepherding change: leading your DevOps transformation
Netflix Open Source Meetup Season 3 Episode 2
How Netflix thinks of DevOps. Spoiler: we don’t.
Netflix oss season 2 episode 1 - meetup Lightning talks
Microservices: What's Missing - O'Reilly Software Architecture New York
Spring Boot + Netflix Eureka
20140708 - Jeremy Edberg: How Netflix Delivers Software
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
AWS Lambda
Beyond DevOps - How Netflix Bridges the Gap
Ad

Similar to NetflixOSS for Triangle Devops Oct 2013 (20)

PPTX
Cloud Services Powered by IBM SoftLayer and NetflixOSS
PPTX
Ibm cloud nativenetflixossfinal
PPTX
從劍宗到氣宗 - 談AWS ECS與Serverless最佳實踐
PDF
How IT at Getty Images Brokers Cloud Services
PPT
PowerPoint Presentation
PDF
Netflix Cloud Platform and Open Source
PDF
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...
PDF
NDev Talk - Serverless Design Patterns
PDF
How to Build a Big Data Application: Serverless Edition
PPTX
Building a Just-in-Time Application Stack for Analysts
PPTX
How Serverless Changes DevOps
PDF
Windows Azure introduction
PDF
How to Build a Big Data Application: Serverless Edition
PDF
Stay productive_while_slicing_up_the_monolith
PPTX
Azure Functions Real World Examples
PPTX
Migration of an Enterprise UI Microservice System from Cloud Foundry to Kuber...
PDF
Agile infrastructure
PPT
Power Your Mobile Applications On The Cloud [IndicThreads Mobile Application ...
PDF
AperiStorageResourceManager
PPTX
Centralizing Kubernetes and Container Operations
Cloud Services Powered by IBM SoftLayer and NetflixOSS
Ibm cloud nativenetflixossfinal
從劍宗到氣宗 - 談AWS ECS與Serverless最佳實踐
How IT at Getty Images Brokers Cloud Services
PowerPoint Presentation
Netflix Cloud Platform and Open Source
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...
NDev Talk - Serverless Design Patterns
How to Build a Big Data Application: Serverless Edition
Building a Just-in-Time Application Stack for Analysts
How Serverless Changes DevOps
Windows Azure introduction
How to Build a Big Data Application: Serverless Edition
Stay productive_while_slicing_up_the_monolith
Azure Functions Real World Examples
Migration of an Enterprise UI Microservice System from Cloud Foundry to Kuber...
Agile infrastructure
Power Your Mobile Applications On The Cloud [IndicThreads Mobile Application ...
AperiStorageResourceManager
Centralizing Kubernetes and Container Operations

More from aspyker (20)

PDF
Herding Kats - Netflix’s Journey to Kubernetes Public
PDF
Season 7 Episode 1 - Tools for Data Scientists
PDF
CMP376 - Another Week, Another Million Containers on Amazon EC2
PDF
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
PDF
NetflixOSS Meetup S6E2 - Spinnaker, Kayenta
PDF
NetflixOSS Meetup S6E1 - Titus & Containers
PDF
SRECon Lightning Talk
PDF
Container World 2018
PDF
Netflix Cloud Architecture and Open Source
PPTX
Netflix OSS Meetup Season 5 Episode 1
PDF
Series of Unfortunate Netflix Container Events - QConNYC17
PDF
Netflix OSS Meetup Season 4 Episode 4
PPTX
Re:invent 2016 Container Scheduling, Execution and AWS Integration
PDF
Netflix and Containers: Not A Stranger Thing
PDF
Netflix Open Source: Building a Distributed and Automated Open Source Program
PDF
Netflix Open Source Meetup Season 4 Episode 3
PDF
Netflix Container Scheduling and Execution - QCon New York 2016
PDF
Netflix Open Source Meetup Season 4 Episode 2
PDF
Netflix Container Runtime - Titus - for Container Camp 2016
PDF
Netflix Open Source Meetup Season 4 Episode 1
Herding Kats - Netflix’s Journey to Kubernetes Public
Season 7 Episode 1 - Tools for Data Scientists
CMP376 - Another Week, Another Million Containers on Amazon EC2
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
NetflixOSS Meetup S6E2 - Spinnaker, Kayenta
NetflixOSS Meetup S6E1 - Titus & Containers
SRECon Lightning Talk
Container World 2018
Netflix Cloud Architecture and Open Source
Netflix OSS Meetup Season 5 Episode 1
Series of Unfortunate Netflix Container Events - QConNYC17
Netflix OSS Meetup Season 4 Episode 4
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Netflix and Containers: Not A Stranger Thing
Netflix Open Source: Building a Distributed and Automated Open Source Program
Netflix Open Source Meetup Season 4 Episode 3
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Open Source Meetup Season 4 Episode 2
Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Open Source Meetup Season 4 Episode 1

Recently uploaded (20)

PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Electronic commerce courselecture one. Pdf
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Modernizing your data center with Dell and AMD
PPTX
Cloud computing and distributed systems.
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
KodekX | Application Modernization Development
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Big Data Technologies - Introduction.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
MYSQL Presentation for SQL database connectivity
GamePlan Trading System Review: Professional Trader's Honest Take
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Electronic commerce courselecture one. Pdf
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Modernizing your data center with Dell and AMD
Cloud computing and distributed systems.
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Per capita expenditure prediction using model stacking based on satellite ima...
KodekX | Application Modernization Development
20250228 LYD VKU AI Blended-Learning.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Machine learning based COVID-19 study performance prediction
Mobile App Security Testing_ A Comprehensive Guide.pdf
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Spectral efficient network and resource selection model in 5G networks
Big Data Technologies - Introduction.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
MYSQL Presentation for SQL database connectivity

NetflixOSS for Triangle Devops Oct 2013

  • 1. Learning about NetflixOSS For Oct 2013 @TriangleDevops Andrew Spyker @aspyker Some content from @ma4jpb
  • 2. Agenda • How did I get here? • • • • • Netflix and Netflix OSS platform overview Runtime components Management components Build components Automated test and cleanliness components 2
  • 3. About me … • IBM STSM of Performance Architect and Strategy • Eleven years in performance in WebSphere – – – – Led the App Server Performance team for years Small sabbatical focused on IBM XML technology Work in Emerging Technology Institute and CTO Office Starting to look at cloud service operations • Email: aspyker@us.ibm.com – – – – Blog: http://ispyker.blogspot.com/ Linkedin: http://www.linkedin.com/in/aspyker Twitter: http://twitter.com/aspyker Github: http://www.github.com/aspyker • Triangle dad that enjoys technology as well as running, wine and poker 3
  • 4. Develop or maintain a service today? • Develop – starting • Maintain – starting • More on this later …. http://www.flickr.com/photos/stevendepolo/ 4
  • 5. What qualifies me to talk? • My shirt? • Of cloud prize ~ 25 nominees – Personally • Best example mash-up sample – My IBM team • Best portability enhancement – More on this coming … • http://techblog.netflix.com/2013/09/netflixoss-meetup-s1e4-cloud-prize.html 5
  • 6. Seriously, how did I get here? • Plenty of experience with performance and scale on standardized benchmarks (SPEC/TPC) – Non representative of how to (web) scale • Pinning, biggest monolithic DB “wins”, hand tuned for fixed size – Out of date on modern architecture for mobile/cloud • Created Acme Air – http://bit.ly/acmeairblog • Demonstrated that we could achieve (web) scale runs – 4B+ Mobile/Browser request/day – With modern mobile and cloud best practices 6
  • 8. What was shown? • Peak performance and scale – You betcha! • Operational visibility – Only during the run via nmon collection and post-run visualization • • • • True operational visibility - nope Devops – nope HA and DR – nope Manual and automatic elastic scaling - nope 8
  • 9. What next? • Went looking for what best industry practices around devops and high availability at web scale existed – Many have documented via research papers and on highscalability.com – Google, Twitter, Facebook, Linkedin, etc. • Why Netflix? – Documented not only on their tech blog, but also have released working OSS on github – Also, given dependence on Amazon, they are a clear bellwether of web scale public cloud availability 9
  • 10. Steps to NetflixOSS understanding • Recoded Acme Air application to make use of NetflixOSS runtime components • Worked to implement a NetflixOSS devops and high availability setup around Acme Air (on EC2) run at previous levels of scale and performance • Worked to port NetflixOSS runtime and devops/high availability servers to IBM Cloud (SoftLayer) and RightScale • Through public collaboration with Netflix technical team – Google groups, github and meetups 10
  • 11. Why? • To prove that advanced cloud high availability and devops platform wasn’t “tied” to Amazon • To understand how we can advance IBM cloud platforms for our customers • To understand how we can host our IBM public cloud services better 11
  • 12. Agenda • How did I get here? • Netflix and Netflix OSS platform overview • • • • Runtime components Management components Build components Automated test and cleanliness components 12
  • 13. My view of Netflix goals • As a business – Be the best streaming media provider in the world – Make best content deals based on real data/analysis • Technology wise – Have the most availability possible – Measure all things by “stream starts per unit of time” • Any dip in that relates back to the business – Do this at web scale 13
  • 14. Standing on the shoulder of a giants • Public Cloud (Amazon) – When adding streaming, Netflix decided they • Shouldn’t invest in building data centers worldwide • Had to plan for the streaming business to be very big – Embraced cloud architecture paying only for what they need • Open Source – Many parts of runtime depend on open source • Linux, Apache Tomcat, Apache Cassandra, etc. – Realized that Amazon wasn’t enough • Started a cloud platform on top that would eventually be open sourced - NetflixOSS http://en.wikipedia.org/wiki/ File:Andre_in_the_late_%2780s.jpg 14
  • 15. Faleure • What is failing? – Underlying IaaS problems • Instances, racks, availability zones, regions – Software issues • Operating system, servers, application code Inspiration – Surrounding services • Other application services, DNS, user registries, etc. • How is a component failing? – – – – Fails and disappears altogether Intermittently fails Works, but is responding slowly Works, but is causing users a poor experience 15
  • 16. Overview of Amazon EC2 • Amazon launches instances into availability zones – Instances of various sizes (compute, storage, etc.) • Regions independent of each other Regions only connected over the Internet Regions contain availability zones Availability zones are isolated from each over Availability zones are connected /w low-latency links Availability Zone Availability Zone Internet This gives a high level of resilience to outages – Unlikely to affect multiple availability zones or regions • Availability Zone Organized into regions and availability zones – – – – – • EC2 Region (US East) Amazon requires customer be aware of this topology to take advantage of its benefits within their application EC2 Region (US West) Availability Zone Availability Zone Availability Zone 16
  • 17. NetflixOSS • “Technical indigestion as a service” - @adrianco • netflix.github.io • 30+ OSS projects • Expanding every day 17
  • 18. NetflixOSS – for today • For today – Focus on mid tier web app and micro service servers – Devops servers and tools – Skipping some just for simplicity • For another time – Big data – Data tier – Caching 18
  • 19. Agenda • How did I get here? • Netflix and Netflix OSS platform overview • Runtime components • Management components • Build components • Automated test and cleanliness components 19
  • 20. Acme Air As A Sample ELB Web App Front End (REST services) App Service (Authentication) Data Tier Greatly simplified … 20
  • 21. Micro-services architecture • Decompose system into isolated services that can be developed separately • Why? – They can fail independently vs. fail together monolythically – They can be developed and released with difference velocities by different teams • To show this we created separate “auth service” for Acme Air • In a typical customer facing application any single front end invocation could spawn 20-30 calls to services and data sources 21
  • 22. How do services advertise themselves? • Upon web app startup, Karyon server is started – Karyon will configure (via Archaius) the application – Karyon will register the location of the instance with Eureka • Others can know of the existence of the service • Lease based so instances continue to check in updating list of available instances – Karyon will also expose a JMX console, healthcheck URL • Devops can change things about the service via JMX • The system can monitor the health of the instance App Service (Authentication) Name, Port IP address, Healthcheck url Karyon Tomcat Eureka Eureka Server(s) Eureka Server(s) Eureka Server(s) Server(s) config.properties, auth-service.properties Or remote Archaius stores 22
  • 23. How do consumers find services? • Service consumers query eureka at startup and periodically to determine location of dependencies – Can query based on availability zone and cross availability zone Web App Front End (REST services) Eureka client Tomcat What “auth-service” instances exist? Eureka Eureka Server(s) Eureka Server(s) Eureka Server(s) Server(s) 23
  • 25. How does the consumer call the service? • Protocols impls have eureka aware load balancing support build in – In client load balancing -- does not require separate LB tier • Ribbon – REST client – Pluggable load balancing scheme – Built in failure recovery support (retry next server, mark instance as failing, etc.) • Other eureka enabled clients – memcached (EVCache), asystanax coming (Priam and Cassandra) Web App Front End (REST services) Call “auth-service” Ribbon REST client Eureka client App Service App Service (Authentication) App Service (Authentication) App Service (Authentication) (Authentication) 25
  • 26. How to deploy this with HA? Instances? • Deploy across AZs • Using AutoScalingGroups in EC2 managed by Asgard Eureka? • • DNS and Elastic IP trickery Deployed across AZs • For clients to find eureka servers – – ASG manages recovery – • For new eureka servers – – – • DNS TXT record for domain lists AZ TXT records AZ TXT records have list of Eureka servers Look for list of eureka servers IP’s for the AZ it’s coming up in Look for unassigned elastic IP’s, grab one and assign it to itself Sync with other already assigned IP’s that likely are hosting Eureka server instances Simpler configurations with less HA are available 26
  • 27. Protect yourself from unhealthy services • Wrap all calls to services with Hystrix command pattern – Hystrix implements circuit breaker pattern – Executes command using semaphore or separate thread pool to guarantee return within finite time to caller – If a unhealthy service is detected, start to call fallback implementation (broken circuit) and periodically check if main implementation works (reset circuit) Execute auth-service call Call “auth-service” Hystrix Web App Front End (REST services) Ribbon REST client App Service App Service (Authentication) App Service (Authentication) App Service (Authentication) (Authentication) Fallback implementation 27
  • 28. Does Hystrix do more? • Main reason for Hystrix is protect yourself from dependencies, but … • Once you have a layer of indirection take advantage of it, Hystrix can provide – Caching – Visualization • Aggregated via Turbine – Request collapsing • Programming models – Sync, Async, Reactive (RxJava) 28
  • 29. Agenda • How did I get here? • Netflix and Netflix OSS platform overview • Runtime components • Management components • Build components • Automated test and cleanliness components 29
  • 30. Ability to reconfigure - Archaius • Using dynamic properties, can easily change properties across cluster of applications, either Application – NetflixOSS named props • Hystrix timeouts for example Runtime – Custom dynamic props Hierarchy • High throughput achieved by polling approach • HA of configuration source dependent on what source you use URL JMX Karyon Console Persisted DB Application Props Libraries Container – HTTP server, database, etc. DynamicIntProperty prop = DynamicPropertyFactory.getInstance().getIntProperty("myProperty", DEFAULT_VALUE); int value = prop.get(); // value will change over time based on configuration 30
  • 31. ASGard EC2 Region (US East) Availability Zone Tell EC2 to start these instances and Keep this many Instances running Availability Zone Web App App Service (REST App Service Services) (Authentication) App Service (Authentication) (Authentication) App Service App Service App Service (Authentication) (Authentication) App Service (Authentication) (Authentication) Availability Zone Web App App Service (REST App Service Services) (Authentication) App Service (Authentication) (Authentication) App Service App Service App Service (Authentication) (Authentication) App Service (Authentication) (Authentication) Web App App Service (REST App Service Services) (Authentication) App Service (Authentication) (Authentication) App Service App Service App Service (Authentication) (Authentication) App Service (Authentication) (Authentication) • Asgard is the missing EC2 console for AutoScalingGroup mgmt. 31 – EC2 only has CLI for ASG management
  • 32. Asgard creates an “application” • Enforces common practices for deploying code – Common approach to linking auto scaling groups to launch configs, ELB’s, security groups, scaling policies and AMIs • Adds missing concept to the EC2 domain model – “application” – Extends clustering to applications vs. AMI’s • Example – – – – Application – app1 Cluster – app1-env Autoscaling group version n – app1-env-v009 Autoscaling group version n+1 – app1-env-v010 32
  • 33. Asgard devops procedures • • • • Fast rollback Canary testing Red/Black pushes More through REST interfaces – Adhoc processes but enforced through Asgard model • More coming using Glisten and Amazon SWF 33
  • 35. Augmenting the ELB tier - Zuul • Zuul adds devops support in the front tier routing – – – – – Stress testing (squeeze testing) Canary testing Dynamic routing Load Shedding Debugging • And some common function – – – – – Authentication Security Static response handling Multi-region resiliency (DR for ELB tier) Insight Amazon ELB Filter Filter Filter Filters Zuul Zuul Zuul Edge Service Edge Service • Through dynamically deployable filters (written in Groovy) • Eureka aware using ribbon, and archaius like shown in runtime section 35
  • 36. Monitoring - Servo • Annotation based publishing through JMX of application metrics • Filters, Observers, and Pollers to publish metrics – Can export metrics to CloudWatch and other monitors • The entire Netflix monitoring infrastructure hasn’t been open sourced due to complexity and priority 36
  • 37. A note on the next three projects • I haven’t personally worked with the projects • Given the audience, I included as I believe they will be of interest 37
  • 38. Edda • Polls Amazon config and stores the data in a queriable database • Provides a searchable view of Amazon deployments – Searchable in ways not possible from Amazon API’s • Provides a historical view – For correlation of problems to changes – Likely less of an issue in clouds that expose all changes 38
  • 39. Ice • Cloud spend and usage analytics • Communicates with billing API to give birds eye view of cloud spend with drill down to region, availability zone, and service team through application groups • Watches on-demand, used and unused reserved instances and instance sizes to help optimize • Not point in time – Shows trends to help predict future optimizations 39
  • 40. Denominator • Java Library and CLI for cross DNS configuration • Allows for common, quicker (than using various DNS provider UI) and automated DNS updates • Plugins have been developed by various DNS providers 40
  • 41. Agenda • • • • How did I get here? Netflix and Netflix OSS platform overview Runtime components Management components • Build components • Automated test and cleanliness components 41
  • 42. Get baked! • Caution: Flame/troll bait ahead!! • Netflix takes the approach of baking images as part of build such that – Instance boot-up doesn’t depend on outside servers – Instance boot-up only starts servers already set to run – New code = new instances (never update instances in place) • Why? – Critical when launching hundreds of servers at a time – Goal to reduce the failure points in places where dynamic system configuration doesn’t provide value – Speed of elastic scaling, boot and go – Discourages ad hoc changes to server instances • Criticism – “Netflix is ruining the cloud” – Overhead of AMI’s for every code version – Ties to Amazon AMI’s (would this work for containers – I think yes) 42
  • 43. AMInator • Starting image/volume – Foundational image created (maybe via loopback), base AMI with common software created/tested independently • Aminator running – Bakery – Bakery obtains a known EBS volume of the base image from a pool – Bakery mounts volume and provisions the application (apt/deb or yum/rpm) – Bakery snapshots and registers snapshot • Recent work to add other provisioning such as chef as plugins • I have used hand built AMI’s thus far, but blog states developers can go through CI builds and have running test instances within 15 minutes of code being checked in 43
  • 44. Agenda • • • • • How did I get here? Netflix and Netflix OSS platform overview Runtime components Management components Build components • Automated test and cleanliness components 44
  • 45. The Simian Army • A bunch of automated “monkeys” that perform automated system administration tasks • Anything that is done by a human more than once can and should be automated • Absolutely necessary at web scale 45
  • 46. Good Monkeys • Janitor Monkey – Somewhat a mitigation for baking approach – Will mark and sweep unused resources (instances, volumes, snapshots, ASG’s, launch configs, images, etc.) – Owners notified, then removed • Conformity Monkey http://www.flickr.com/photos/sonofgroucho/5852049290 – Check instances are conforming to rules around security, ASG/ELB, age, status/health check, etc. 46
  • 47. Back to high availability • Failure is inevitable. Don’t try to avoid it! • How do you know if your backup is good? – Try to restore from your backup every so often – Better to ensure backup works before you have a crashed system and find out your backup is broken • How do you know if your system is HA? – Try to force failures every so often – Better to force those failures during office hours – Better to ensure HA before you have a down system and angry users – Best to learn from failures and add automated tests 47
  • 48. Bad Monkeys • Open Sourced – Chaos Monkey – Used to randomly terminate instances – Now block network, burn cpu, kill processes, fail amazon api, fail dns, fail dynamo, fail s3, introduce network errors/latency, detach volumes, fill disk, burn I/O http://www.flickr.com/photos/27261720@N00/132750805 • Not yet open sourced – Chaos Gorilla • Kill all instances in an availability zone – Chaos Kong • Kill all instances in an entire region – Latency Monkey • Introduce latency into service calls directly (ribbon server side) 48
  • 49. Agenda • Blah, blah, blah • How can I learn more? • How do I play with this? • Let’s write some code! 49
  • 50. Want to play? • NetflixOSS blog and github – http://techblog.netflix.com – http://github.com/Netflix • Acme Air, NetflixOSS AMI’s – Try Asgard/Eureka with a real application – http://bit.ly/aa-AMIs • See what we ported to IBM Cloud (video) – http://bit.ly/noss-sl-blog • Fork and submit pull requests to Acme Air – http://github.com/aspyker/acmeair-netflix 50