SlideShare a Scribd company logo
The Case for Chaos – AWS Pop-up Loft
Bruce Wong – Engineering Manager – Chaos Engineering, Netflix
1
Who am I?
Bruce Wong
2@bruce_m_wong
Who am I?
Bruce Wong
 Netflix since 2010
3@bruce_m_wong
Who am I?
Bruce Wong
 Netflix since 2010
 Computer Science
4@bruce_m_wong
Who am I?
Bruce Wong
 Netflix since 2010
 Computer Science
 Builds Engineering Teams
 5 different teams so far
5@bruce_m_wong
Agenda
 Why?
 Case Studies
 How you can start chaos testing
 Future chaos
6@bruce_m_wong
Failure is Unavoidable
 Disks Fail
 Power outages. And your generator fails
 Software bugs
 Human Error
7@bruce_m_wong
What about the cloud?
8@bruce_m_wong
Cloud Case Study
9@bruce_m_wong
 XSA-108 Security Vulnerability
 ~10% of EC2 instances
rebooted
 Spread over a 5 days
 One availability-zone at a time
Chaos Validated + Public Cloud Validated
10@bruce_m_wong
Netflix & Micro-Services
11@bruce_m_wong
http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
Netflix & Micro-Services
12@bruce_m_wong
13@bruce_m_wong
14@bruce_m_wong
15@bruce_m_wong
16@bruce_m_wong
17
Graceful Degradation
@bruce_m_wong
Product + Engineering Decision
18
Designing for Failure
@bruce_m_wong
 Infrastructure Failure
 Instance terminations – single points of failure
 Latency
 Availability Zone
 Regional
 Application Failure
 Graceful degradation
 Software Bugs
19
Testing
@bruce_m_wong
 Unit testing
 Integration testing
 Functional testing
 Regression testing
 Chaos Testing
Finding bugs earlier
20
Resilience needs to be tested
@bruce_m_wong
Testing is hard
 Large and growing data sets
 Internet-scale traffic
 Innovation and New features
 Change is constant
21
Resilience needs to be tested
@bruce_m_wong
 Validate resilience design
 Don’t wait for next outage
 Un-controlled
 Un-predictable
Hope is not a strategy
Types of Chaos
22
Instances Fail
Lessons
• Be as stateless as possible
• Autoscaling groups are good
• Invest in automation to rebuilt
state when necessary
• Running Chaos Monkey on
C*
@bruce_m_wong
Types of Chaos
23
Many Instances can Fail
Lessons
• Cassandra works as expected
• Moving Traffic back to steady
state is just as hard
• Infrastructure Management tools
can be a bottleneck
@bruce_m_wong
Types of Chaos
24
Natural Disasters Happen
Lessons
• Cassandra works as expected
• Moving Traffic back to steady
state is just as hard
• Infrastructure Management can
be a bottleneck
• Smaller Blast-Radius Benefits
• Traffic + Capacity orchestration
is hard
@bruce_m_wong
Types of Chaos
25
Latency
Still Learning
• Functional fallbacks don’t
account for system limitations
• Thread pools
• Connection pools
• Slow can be hard to find
• Slow can be hard to contain
• Unbounded Queues are BAD
@bruce_m_wong
26
Unbounded Queues
@bruce_m_wong
 Come in many forms, to name a few
 Threads
 Memory
 Disk
 Bounded by physical limitations
 VERY difficult to find
 Elastic is not Infinite
27
For Example: Memory and Data
@bruce_m_wong
 Data is important
 In-Memory Queue grows and shrinks
 Failure Mode # 1 – Out of memory
 NOT A MEMORY LEAK!
28
For Example: Memory and Data
@bruce_m_wong
 Data is important
 If Queue gets to size X
 Write to disk
 Flush later
 Failure Mode # 2
 Disk Full
 File Descriptors Saturated
29
For Example: Memory and Data
@bruce_m_wong
 Data is important
…
But not as important as uptime
Starting Chaos
30
 Start small, very small.
 Start simple, stateless systems
 Start manually and coordinated
 Failure Injection Fridays
 Build confidence
 Outages are opportunities
@bruce_m_wong
Chaos takes time
31@bruce_m_wong
2010
2012
2014
Aspirational Chaos
32
 Increase Frequency & Intensity
 Reduces chance of drift
 Infrastructure
 Continuous Latency injection
 Chaos Gorilla random AZ weekly
 Latency Gorilla
 CPU, Memory, Disk
 Application
 Continuous Validation of fallbacks
 Startup dependency failure injection
@bruce_m_wong
Questions
33@bruce_m_wong

More Related Content

PDF
CloudFrontのリアルタイムログをKibanaで可視化しよう
PDF
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
PDF
Architecting for the Cloud using NetflixOSS - Codemash Workshop
PDF
DevOps with Database on AWS
PDF
At least onceってぶっちゃけ問題の先送りだったよね #kafkajp
PDF
Ingress on Azure Kubernetes Service
PDF
20210526 AWS Expert Online マルチアカウント管理の基本
PPTX
Spring CloudとZipkinを利用した分散トレーシング
CloudFrontのリアルタイムログをKibanaで可視化しよう
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Architecting for the Cloud using NetflixOSS - Codemash Workshop
DevOps with Database on AWS
At least onceってぶっちゃけ問題の先送りだったよね #kafkajp
Ingress on Azure Kubernetes Service
20210526 AWS Expert Online マルチアカウント管理の基本
Spring CloudとZipkinを利用した分散トレーシング

What's hot (20)

PDF
AWS Finance Symposium_바로 도입할 수 있는 금융권 업무의 클라우드 아키텍처 알아보기
PDF
Amazon Athena 初心者向けハンズオン
PPTX
How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
PDF
Azure Monitoring Overview
PDF
Black Belt Online Seminar Amazon Cognito
PDF
Serverless時代のJavaについて
PDF
AWS導入から3年 AWSマルチアカウント管理で変わらなかったこと変えていったこと
PDF
20190424 AWS Black Belt Online Seminar Amazon Aurora MySQL
PDF
マイクロサービス化設計入門 - AWS Dev Day Tokyo 2017
PPTX
Oracleからamazon auroraへの移行にむけて
PDF
[AWS EXpert Online for JAWS-UG 18] 見せてやるよ、Step Functions の本気ってやつをな
PDF
Chaos Engineering: Why the World Needs More Resilient Systems
PDF
[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス
PPTX
Amazon Athena で実現する データ分析の広がり
PDF
E-Commerce 를 풍성하게 해주는 AWS 기술들 - 서호석 이사, YOUNGWOO DIGITAL :: AWS Summit Seoul ...
PDF
20191023 AWS Black Belt Online Seminar Amazon EMR
PPTX
マイクロサービスにおける 結果整合性との戦い
PDF
Cloud Migration 과 Modernization 을 위한 30가지 아이디어-박기흥, AWS Migrations Specialist...
PDF
마이크로서비스 기반 클라우드 아키텍처 구성 모범 사례 - 윤석찬 (AWS 테크에반젤리스트)
PDF
Google Cloud のネットワークとロードバランサ
AWS Finance Symposium_바로 도입할 수 있는 금융권 업무의 클라우드 아키텍처 알아보기
Amazon Athena 初心者向けハンズオン
How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
Azure Monitoring Overview
Black Belt Online Seminar Amazon Cognito
Serverless時代のJavaについて
AWS導入から3年 AWSマルチアカウント管理で変わらなかったこと変えていったこと
20190424 AWS Black Belt Online Seminar Amazon Aurora MySQL
マイクロサービス化設計入門 - AWS Dev Day Tokyo 2017
Oracleからamazon auroraへの移行にむけて
[AWS EXpert Online for JAWS-UG 18] 見せてやるよ、Step Functions の本気ってやつをな
Chaos Engineering: Why the World Needs More Resilient Systems
[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス
Amazon Athena で実現する データ分析の広がり
E-Commerce 를 풍성하게 해주는 AWS 기술들 - 서호석 이사, YOUNGWOO DIGITAL :: AWS Summit Seoul ...
20191023 AWS Black Belt Online Seminar Amazon EMR
マイクロサービスにおける 結果整合性との戦い
Cloud Migration 과 Modernization 을 위한 30가지 아이디어-박기흥, AWS Migrations Specialist...
마이크로서비스 기반 클라우드 아키텍처 구성 모범 사례 - 윤석찬 (AWS 테크에반젤리스트)
Google Cloud のネットワークとロードバランサ
Ad

Viewers also liked (20)

PDF
The Journey of Chaos Engineering Begins with a Single Step
PDF
Scalable Microservices at Netflix. Challenges and Tools of the Trade
PDF
Chaos Patterns
PDF
Chaos Patterns Twilio SIGNALCONF 2016
PDF
Enterprise Architecture Case in PHP (MUZIK Online)
PDF
Dockercon State of the Art in Microservices
PDF
I Love APIs 2015: Microservices at Amazon
PPTX
MicroServices at Netflix - challenges of scale
PDF
Principles of microservices velocity
PDF
Chaos Driven Development
PPTX
Chaos Engineeringのススメ
PDF
Microservices 2.0
PDF
Principles of Chaos Engineering
PDF
Microservices for the rest of us
PDF
Principles Of Chaos Engineering - Chaos Engineering Hamburg
PDF
Principles of Microservices - NDC 2014
PDF
Microservices Practitioner Summit Jan '15 - Microservice Ecosystems At Scale ...
PDF
Microservices Practitioner Summit Jan '15 - Scaling Uber from 1 to 100s of Se...
PDF
"WE MAKE SPACE, SPACE MAKES US" - 김정태 MYSC 대표
PPTX
What's New in Java 8
The Journey of Chaos Engineering Begins with a Single Step
Scalable Microservices at Netflix. Challenges and Tools of the Trade
Chaos Patterns
Chaos Patterns Twilio SIGNALCONF 2016
Enterprise Architecture Case in PHP (MUZIK Online)
Dockercon State of the Art in Microservices
I Love APIs 2015: Microservices at Amazon
MicroServices at Netflix - challenges of scale
Principles of microservices velocity
Chaos Driven Development
Chaos Engineeringのススメ
Microservices 2.0
Principles of Chaos Engineering
Microservices for the rest of us
Principles Of Chaos Engineering - Chaos Engineering Hamburg
Principles of Microservices - NDC 2014
Microservices Practitioner Summit Jan '15 - Microservice Ecosystems At Scale ...
Microservices Practitioner Summit Jan '15 - Scaling Uber from 1 to 100s of Se...
"WE MAKE SPACE, SPACE MAKES US" - 김정태 MYSC 대표
What's New in Java 8
Ad

Similar to The Case for Chaos (20)

PPTX
CAP Theorem and Split Brain Syndrome
PDF
Chaos Driven Development (Bruce Wong)
PDF
intro to distributed computing | slide 1
PPTX
cse40822-CAP.pptx
PPTX
Managing the Earthquake: Surviving Major Database Architecture Changes (rev.2...
PDF
Big data 101 for beginners devoxxpl
PDF
Big data 101 for beginners riga dev days
PDF
The Misuse of Cloud Infrastructure
PDF
Building Antifragile Applications with Apache Cassandra
PDF
Think Big - How to Design a Big Data Information Architecture
PDF
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture (ANZ)
PDF
009978776.pdf
PPTX
NoSQL databases, the CAP theorem, and the theory of relativity
PPTX
Ocassionally connected devices spark final
PDF
Introduction to databasecasmfnbskdfjnfkjsdnsjkdfn
PPTX
Introduction
PDF
HPC Cluster Computing from 64 to 156,000 Cores 
PDF
W9L2 Scaling Up LLM Pretraining: Scaling Law
PDF
1. Lecture1_NOSQL_Introduction.pdf
PDF
#VirtualDesignMaster 3 Challenge 1 – James Brown
CAP Theorem and Split Brain Syndrome
Chaos Driven Development (Bruce Wong)
intro to distributed computing | slide 1
cse40822-CAP.pptx
Managing the Earthquake: Surviving Major Database Architecture Changes (rev.2...
Big data 101 for beginners devoxxpl
Big data 101 for beginners riga dev days
The Misuse of Cloud Infrastructure
Building Antifragile Applications with Apache Cassandra
Think Big - How to Design a Big Data Information Architecture
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture (ANZ)
009978776.pdf
NoSQL databases, the CAP theorem, and the theory of relativity
Ocassionally connected devices spark final
Introduction to databasecasmfnbskdfjnfkjsdnsjkdfn
Introduction
HPC Cluster Computing from 64 to 156,000 Cores 
W9L2 Scaling Up LLM Pretraining: Scaling Law
1. Lecture1_NOSQL_Introduction.pdf
#VirtualDesignMaster 3 Challenge 1 – James Brown

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
MYSQL Presentation for SQL database connectivity
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Cloud computing and distributed systems.
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Modernizing your data center with Dell and AMD
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPT
Teaching material agriculture food technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
KodekX | Application Modernization Development
Chapter 3 Spatial Domain Image Processing.pdf
Big Data Technologies - Introduction.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Dropbox Q2 2025 Financial Results & Investor Presentation
MYSQL Presentation for SQL database connectivity
The Rise and Fall of 3GPP – Time for a Sabbatical?
Understanding_Digital_Forensics_Presentation.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Cloud computing and distributed systems.
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Modernizing your data center with Dell and AMD
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Teaching material agriculture food technology
Network Security Unit 5.pdf for BCA BBA.
Digital-Transformation-Roadmap-for-Companies.pptx
Encapsulation_ Review paper, used for researhc scholars
KodekX | Application Modernization Development

The Case for Chaos

Editor's Notes

  • #11: Fantastic because we didn’t have to do anything. If we were running our own hardware we would have had to do the reboots ourselves.
  • #14: The netflix experience Services to make experience possible -”because you watched” -search -profiles -localization
  • #15: Gracefully degraded netflix experience Miss it? “evidence”
  • #16: Ratings