SlideShare a Scribd company logo
Maximizing Scalability, Resiliency, and
Engineering Velocity in the Cloud
Coburn Watson
Manager, Cloud Performance, Netflix
Surge „13
Netflix, Inc.
• World's leading internet television network
• ~ 38 Million subscribers in 40+ countries
• Over a billion hours streamed per month
• Approximately 33% of all US Internet traffic at
night
• Recent Notables
• Increased Originals catalog
• Large open source contribution
• OpenConnect (homegrown CDN)
2
About Me
• Manage Cloud Performance Engineering Team
• Sub-team of Cloud Solutions Organization
• Focus on performance since 2000
• Large-scale billing applications, eCommerce, datacenter
mgmt., etc.
• Genentech, McKesson, Amdocs, Mercury Int., HP, etc.
• Passion for tackling performance at cloud-scale
• Looking for great performance engineers
• cwatson@netflix.com
3
Freedom and Responsibility
• Culture deck..a great read
• Good performers: 2x, Top performers: 10x
• What engineers dislike
• cumbersome processes
• deployment inefficiency
• restricted access
• restricted technical freedom
• lack of trust
• If removed…maximize:
• Engineering velocity
• Engineer satisfaction
4
Maximizing: Engineering
Velocity
5
How
• Implementation freedom
• SCM, libraries, language
• that said..platform benefits exist
• Deployment freedom
• Service team owns
• push schedule, functionality, performance
• operational activities (being paged)
• On-demand cloud capacity
• Thousands of instances at the push of a button
6
Rapid Deployment?
Impossible..
3-6 Months?
7
Rapid (Cloud) Deployment
3-5 Minutes
8
BaseAMI
• Supply the foundation
• Monitoring, java, apache, tomcat, etc.
• Open source project: Aminator
9
Pushing Code: Red-Black
• Gracefully roll code in, or out, of production
• Asgard is our AWS configuration mgmt. tool
10
Compounded risks with increased velocity
Risks: Decreased Reliability, Performance, and Scalability
Not all Roses
11
Goal: CI (Continuous
Improvement)
12
Maximizing: Reliability
13
Fear (Revere) the Monkeys
• Simulate
• Latency
• Errors
• Initiate
• Instance Termination
• Availability Zone Failure
• Identify
• Configuration Drift
… in Test and Production
14
Tracking Change: Chronos
• Aggregate Significant Events *
• Current Sources:
• Pushes (Asgard)
• Production Change Requests (JIRA)
• AWS Notifications
• Dynamic Property Changes
• ASG Scaling Events
• Implementation
• Simple REST-service; customized adapters
* - “can disrupt production service”
15
Chronos, cont.
16
Automated Canary Analysis
• Identify regression between new and existing code
• Point ACA to baseline (prod) and canary ASG
• Typically analyze an hours worth of time series data
• Compare ratio of averages between canary and baseline
• Evaluate range and noise; determine quality of signal
• Bucket: Hot, Cold, Noisy, or OK
• Multiple classifiers available
• Multiple metric collections (e.g. hand-picked by service, general)
• Rollup
• Constrained: along metric dimensions
• Final: Score the canary
• Implementation: R-based analysis
17
HOT OK NOISYCOLDOK
NOISY
constrained rollup (dashed)
final rollup
ACA: in Action
18
Hystrix: Defend Your App
● Protection from downstream service failures
● Functional (unavailable) or performance in nature
19
Maximizing: Scalability and
Performance
20
Dynamic Scaling
EC2 footprint autoscales 2500-3500 instances per
day
• order of tens of thousands of EC2 instances
• Larger ASG spans 200-900 m2.4xlarge daily
Why:
• Improved scalability during unexpected workloads
• Absorb variance in service performance profile
• Reactive chain of dependencies
• Creates "reserved instance troughs" for batch activity
21
Dynamic Scaling, cont.
Example covers 3 services
• 2 edge (A,B), 1 mid-tier (C)
• C has more upstream services
than simply A and B
Multiple Autoscaling Policies
• (A) System Load Average
• (B,C) Request-Rate based
22
Dynamic Scaling, cont.
23
Dynamic Scaling, cont.
• Response time variability greatest during scaling events
• Average response time primary between 75-150 msec
24
Dynamic Scaling, cont.
• Instance counts 3x, Aggregate requests 4.5x (not shown)
• Average CPU utilization per instance: ~25-55%25
Study performed:
• 24 node C* SSD-based cluster (hi1.4xlarge)
• mid-tier service load application
• Targeting 2x production rates
• Increase read ops from 30k to to 70k in ~ 3 minutes
• Increase write ops 750 to 1500 in ~ 3 minutes
Results:
• 95th pctl response time increase: ~ 17 msec to 45 msec
• 99th pctl response time increase: ~ 35 msec to 80 msec
Cassandra Performance
26
Response times consistent during 4x increase in load *
* Due to upstream code change
EVcache (memcached) Scalability
27
Cloud-scale Load Testing
• Ad-Hoc or CI-based load test model
• (CI) Run-over-run comparison; email on rule violation
1. Jenkins initiates job
2. JMeter instances apply load
3. Results written to s3
4. Instance metrics published to
Atlas
5. Raw data fetched and
processed
28
Conclusions
• Continually accelerate engineering velocity
• Evolve architecture and processes to mitigate risks
• Stateless micro-service architectures win!
• Remove barriers for engineers
• Last option should be to reduce rate of change
• Exercise failure and “thundering herd” scenarios
• Cloud native scaling and resiliency are key factors
• Leverage pre-existing OSS PaaS when possible
29
Netflix Open Source
Our Open Source Software simplifies mgmt at
scale
Great projects, stunning colleagues:
jobs.netflix.com30
Q&A
• cwatson@netflix.com
• Netflix Tech Blog: http://techblog.netflix.com
31

More Related Content

PPTX
goto; London: Keeping your Cloud Footprint in Check
PDF
#lspe Q1 2013 dynamically scaling netflix in the cloud
PDF
Santa Cloud: How Netflix Does Holiday Capacity Planning - South Bay SRE Meetu...
PPTX
Crash Course in Cloud Computing
PDF
Kubernetes: Reducing Infrastructure Cost & Complexity
PDF
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
PDF
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
PDF
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
goto; London: Keeping your Cloud Footprint in Check
#lspe Q1 2013 dynamically scaling netflix in the cloud
Santa Cloud: How Netflix Does Holiday Capacity Planning - South Bay SRE Meetu...
Crash Course in Cloud Computing
Kubernetes: Reducing Infrastructure Cost & Complexity
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...

What's hot (18)

PPTX
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
PPTX
INTRODUCING: CREATE PIPELINE
PPTX
Netflix Data Pipeline With Kafka
PDF
Netflix Keystone—Cloud scale event processing pipeline
PDF
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
PDF
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
PDF
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
PDF
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
PDF
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
PPTX
Netflix viewing data architecture evolution - EBJUG Nov 2014
PDF
Business Continuity with Microservices-Based Apps and DevOps: Learnings from ...
PDF
GCPLA Meetup Workshop - Migration from a Legacy Infrastructure to the Cloud
PDF
How to Enable Industrial Decarbonization with Node-RED and InfluxDB
PDF
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
PDF
Deploying Confluent Platform for Production
PDF
Putting Kafka Together with the Best of Google Cloud Platform
PPTX
Session 03 data_migration_at_scale_by_sameer
PDF
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
INTRODUCING: CREATE PIPELINE
Netflix Data Pipeline With Kafka
Netflix Keystone—Cloud scale event processing pipeline
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
Netflix viewing data architecture evolution - EBJUG Nov 2014
Business Continuity with Microservices-Based Apps and DevOps: Learnings from ...
GCPLA Meetup Workshop - Migration from a Legacy Infrastructure to the Cloud
How to Enable Industrial Decarbonization with Node-RED and InfluxDB
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
Deploying Confluent Platform for Production
Putting Kafka Together with the Best of Google Cloud Platform
Session 03 data_migration_at_scale_by_sameer
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
Ad

Viewers also liked (16)

PPTX
2012 re:Invent Netflix: embracing the cloud final
PDF
NetflixOSS Meetup season 3 episode 2
PDF
Engineering Velocity: Shifting the Curve at Netflix
PPTX
Think Like a Hacker
PPTX
From Code to the Monkeys: Continuous Delivery at Netflix
PDF
QConSF 2014 talk on Netflix Mantis, a stream processing system
PDF
Security Monitoring with eBPF
PPTX
Engineering Tools at Netflix: Enabling Continuous Delivery
PDF
OTT & The Future of Connected TV
PDF
Continuous Delivery at Netflix, and beyond
PPTX
Implementing DevOps
PPTX
Splitting the Check on Compliance and Security
PDF
How Netflix thinks of DevOps. Spoiler: we don’t.
PDF
Critical Infrastructure Protection from Terrorist Attacks
PPTX
Hadoop and HBase experiences in perf log project
PPTX
NormShield Cyber Threat & Vulnerability Orchestration Overview
2012 re:Invent Netflix: embracing the cloud final
NetflixOSS Meetup season 3 episode 2
Engineering Velocity: Shifting the Curve at Netflix
Think Like a Hacker
From Code to the Monkeys: Continuous Delivery at Netflix
QConSF 2014 talk on Netflix Mantis, a stream processing system
Security Monitoring with eBPF
Engineering Tools at Netflix: Enabling Continuous Delivery
OTT & The Future of Connected TV
Continuous Delivery at Netflix, and beyond
Implementing DevOps
Splitting the Check on Compliance and Security
How Netflix thinks of DevOps. Spoiler: we don’t.
Critical Infrastructure Protection from Terrorist Attacks
Hadoop and HBase experiences in perf log project
NormShield Cyber Threat & Vulnerability Orchestration Overview
Ad

Similar to Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud (20)

PDF
Benchmarking Elastic Cloud Big Data Services under SLA Constraints
PDF
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
PDF
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
PPTX
Release it! - Takeaways
PDF
Effective Service Mesh to turbocharge Cloud Resiliency
PPTX
Experience with Kafka & Storm
PDF
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
PPTX
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
PDF
Tools. Techniques. Trouble?
PDF
Log Monitoring and Anomaly Detection at Scale at ORNL
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PDF
Architecture for Scale [AppFirst]
PPTX
Dunning time-series-2015
PPTX
How the Internet of Things is Turning the Internet Upside Down
PPTX
Dealing with an Upside Down Internet With High Performance Time Series Database
PPTX
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
PPTX
Concurrency at Scale: Evolution to Micro-Services
PPTX
PPTX
Play With Streams
PPSX
Active Data Guard @CERN on UKOUG 2012
Benchmarking Elastic Cloud Big Data Services under SLA Constraints
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
Release it! - Takeaways
Effective Service Mesh to turbocharge Cloud Resiliency
Experience with Kafka & Storm
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Tools. Techniques. Trouble?
Log Monitoring and Anomaly Detection at Scale at ORNL
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Architecture for Scale [AppFirst]
Dunning time-series-2015
How the Internet of Things is Turning the Internet Upside Down
Dealing with an Upside Down Internet With High Performance Time Series Database
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
Concurrency at Scale: Evolution to Micro-Services
Play With Streams
Active Data Guard @CERN on UKOUG 2012

Recently uploaded (20)

PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Empathic Computing: Creating Shared Understanding
PDF
Mushroom cultivation and it's methods.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Encapsulation theory and applications.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Empathic Computing: Creating Shared Understanding
Mushroom cultivation and it's methods.pdf
A Presentation on Artificial Intelligence
NewMind AI Weekly Chronicles - August'25-Week II
Mobile App Security Testing_ A Comprehensive Guide.pdf
Programs and apps: productivity, graphics, security and other tools
Heart disease approach using modified random forest and particle swarm optimi...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Machine learning based COVID-19 study performance prediction
OMC Textile Division Presentation 2021.pptx
Unlocking AI with Model Context Protocol (MCP)
Digital-Transformation-Roadmap-for-Companies.pptx
Encapsulation theory and applications.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Univ-Connecticut-ChatGPT-Presentaion.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
A comparative analysis of optical character recognition models for extracting...

Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

  • 1. Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud Coburn Watson Manager, Cloud Performance, Netflix Surge „13
  • 2. Netflix, Inc. • World's leading internet television network • ~ 38 Million subscribers in 40+ countries • Over a billion hours streamed per month • Approximately 33% of all US Internet traffic at night • Recent Notables • Increased Originals catalog • Large open source contribution • OpenConnect (homegrown CDN) 2
  • 3. About Me • Manage Cloud Performance Engineering Team • Sub-team of Cloud Solutions Organization • Focus on performance since 2000 • Large-scale billing applications, eCommerce, datacenter mgmt., etc. • Genentech, McKesson, Amdocs, Mercury Int., HP, etc. • Passion for tackling performance at cloud-scale • Looking for great performance engineers • cwatson@netflix.com 3
  • 4. Freedom and Responsibility • Culture deck..a great read • Good performers: 2x, Top performers: 10x • What engineers dislike • cumbersome processes • deployment inefficiency • restricted access • restricted technical freedom • lack of trust • If removed…maximize: • Engineering velocity • Engineer satisfaction 4
  • 6. How • Implementation freedom • SCM, libraries, language • that said..platform benefits exist • Deployment freedom • Service team owns • push schedule, functionality, performance • operational activities (being paged) • On-demand cloud capacity • Thousands of instances at the push of a button 6
  • 9. BaseAMI • Supply the foundation • Monitoring, java, apache, tomcat, etc. • Open source project: Aminator 9
  • 10. Pushing Code: Red-Black • Gracefully roll code in, or out, of production • Asgard is our AWS configuration mgmt. tool 10
  • 11. Compounded risks with increased velocity Risks: Decreased Reliability, Performance, and Scalability Not all Roses 11
  • 14. Fear (Revere) the Monkeys • Simulate • Latency • Errors • Initiate • Instance Termination • Availability Zone Failure • Identify • Configuration Drift … in Test and Production 14
  • 15. Tracking Change: Chronos • Aggregate Significant Events * • Current Sources: • Pushes (Asgard) • Production Change Requests (JIRA) • AWS Notifications • Dynamic Property Changes • ASG Scaling Events • Implementation • Simple REST-service; customized adapters * - “can disrupt production service” 15
  • 17. Automated Canary Analysis • Identify regression between new and existing code • Point ACA to baseline (prod) and canary ASG • Typically analyze an hours worth of time series data • Compare ratio of averages between canary and baseline • Evaluate range and noise; determine quality of signal • Bucket: Hot, Cold, Noisy, or OK • Multiple classifiers available • Multiple metric collections (e.g. hand-picked by service, general) • Rollup • Constrained: along metric dimensions • Final: Score the canary • Implementation: R-based analysis 17
  • 18. HOT OK NOISYCOLDOK NOISY constrained rollup (dashed) final rollup ACA: in Action 18
  • 19. Hystrix: Defend Your App ● Protection from downstream service failures ● Functional (unavailable) or performance in nature 19
  • 21. Dynamic Scaling EC2 footprint autoscales 2500-3500 instances per day • order of tens of thousands of EC2 instances • Larger ASG spans 200-900 m2.4xlarge daily Why: • Improved scalability during unexpected workloads • Absorb variance in service performance profile • Reactive chain of dependencies • Creates "reserved instance troughs" for batch activity 21
  • 22. Dynamic Scaling, cont. Example covers 3 services • 2 edge (A,B), 1 mid-tier (C) • C has more upstream services than simply A and B Multiple Autoscaling Policies • (A) System Load Average • (B,C) Request-Rate based 22
  • 24. Dynamic Scaling, cont. • Response time variability greatest during scaling events • Average response time primary between 75-150 msec 24
  • 25. Dynamic Scaling, cont. • Instance counts 3x, Aggregate requests 4.5x (not shown) • Average CPU utilization per instance: ~25-55%25
  • 26. Study performed: • 24 node C* SSD-based cluster (hi1.4xlarge) • mid-tier service load application • Targeting 2x production rates • Increase read ops from 30k to to 70k in ~ 3 minutes • Increase write ops 750 to 1500 in ~ 3 minutes Results: • 95th pctl response time increase: ~ 17 msec to 45 msec • 99th pctl response time increase: ~ 35 msec to 80 msec Cassandra Performance 26
  • 27. Response times consistent during 4x increase in load * * Due to upstream code change EVcache (memcached) Scalability 27
  • 28. Cloud-scale Load Testing • Ad-Hoc or CI-based load test model • (CI) Run-over-run comparison; email on rule violation 1. Jenkins initiates job 2. JMeter instances apply load 3. Results written to s3 4. Instance metrics published to Atlas 5. Raw data fetched and processed 28
  • 29. Conclusions • Continually accelerate engineering velocity • Evolve architecture and processes to mitigate risks • Stateless micro-service architectures win! • Remove barriers for engineers • Last option should be to reduce rate of change • Exercise failure and “thundering herd” scenarios • Cloud native scaling and resiliency are key factors • Leverage pre-existing OSS PaaS when possible 29
  • 30. Netflix Open Source Our Open Source Software simplifies mgmt at scale Great projects, stunning colleagues: jobs.netflix.com30
  • 31. Q&A • cwatson@netflix.com • Netflix Tech Blog: http://techblog.netflix.com 31

Editor's Notes

  • #9: Maximum engineering velocity can only be achieved when deployment velocity is a non-factor…thousands of systems in the time it takes to get a coffee.
  • #16: Chronos is the “go to” tool when something goes awry in production