SlideShare a Scribd company logo
PETER ALVARO
Orchestrated
Chaos
With a prelude
of vignettes
and an appendix
of fairy tales
Mythology
About me
About me
About me
Platitudes
“Managing complexity”
Easy: removing complexity
Much harder: moving complexity around
Much harder: moving complexity around
Much harder: moving complexity around
Much harder: moving complexity around
Much harder: moving complexity around
Nontrivial systems problems
always require tradeoffs
Productivity /
Convenience
Purity /
Correctness
Vignettes
Vignette 1: teaching myself docker
Vignette 2: a DBA tale
Vignette 3: selling lovely languages
Vignette 4: Microservices
The UNIX philosophy:
Do one thing and do it well.
> man ls
Orchestrated Chaos: Applying Failure Testing Research at Scale.
Orchestrated Chaos: Applying Failure Testing Research at Scale.
Orchestrated Chaos: Applying Failure Testing Research at Scale.
Orchestrated Chaos: Applying Failure Testing Research at Scale.
Orchestrated Chaos: Applying Failure Testing Research at Scale.
Ease of release wins
Ease of release wins
Ease of release wins
Ease of release wins
Ease of release wins
Ease of release wins
Ease of release wins
Ease of release wins
Ease of release wins
Ease of release wins
Ease of release wins
Ease of release wins
Ease of release wins
The profound solipsism of the microservice
The profound solipsism of the microservice
The profound solipsism of the microservice
The profound solipsism of the microservice
Every microservice is a piece of the continent
Every microservice is a piece of the continent
What could possibly go wrong?
Consider computation
involving 100 services
Search Space:
2100
executions
“Depth” of bugs
Single Faults Search Space:
100 executions
“Depth” of bugs
Combination of 4 faults Search Space:
3M executions
“Depth” of bugs
Combination of 7 faults Search Space:
16B executions
Reflections
1. Managing complexity can be a zero-sum game
2. Productivity trumps purity
3. Chaos results…. and gives rise to a new order
Opportunity
What the hell is going on? (Observability)
Call
graph
tracing
(e.g. Zipkin)
What could possibly go wrong? (Fault injection)
A fault
injection
framework
(e.g. FIT)
Random search
A fault
injection
framework
(e.g. FIT)
Call
graph
tracing
(e.g. Zipkin)
Random Search
Search Space:
2100
executions
Engineer-guided search
A fault
injection
framework
(e.g. FIT)
Call
graph
tracing
(e.g. Zipkin)
Engineer-guided Search
Search Space:
???
…?
A fault
injection
framework
(e.g. FIT)
Call
graph
tracing
(e.g. Zipkin)
A cunning malevolent sentience?
A fault
injection
framework
(e.g. FIT)
Call
graph
tracing
(e.g. Zipkin)
A cunning malevolent sentience?
A fault
injection
framework
(e.g. FIT)
Call
graph
tracing
(e.g. Zipkin)
Lineage-driven Fault Injection
A fault
injection
framework
(e.g. FIT)
LDFI
Call
graph
tracing
(e.g. Zipkin)
Fault-tolerance “is just” redundancy
But how do we know redundancy when we see it?
Hard question: “Could a bad thing ever happen?”
Easier: “Exactly why did a good thing happen?”
“What could have gone wrong?”
Lineage-driven fault injection
Why did a good thing happen?
Consider its lineage.
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
Lineage-driven fault injection
Why did a good thing happen?
Consider its lineage.
What could have gone wrong?
Faults are cuts in the lineage graph.
Is there a cut that breaks all supports?
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
Lineage-driven fault injection
Why did a good thing happen?
Consider its lineage.
What could have gone wrong?
Faults are cuts in the lineage graph.
Is there a cut that breaks all supports?
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
What would have to go wrong?
(RepA OR Bcast1)
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast2
Client Client
Bcast1
What would have to go wrong?
(RepA OR Bcast1)
AND (RepA OR Bcast2)
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
What would have to go wrong?
(RepA OR Bcast1)
AND (RepA OR Bcast2)
AND (RepB OR Bcast2)
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1
Client Client
Bcast2
What would have to go wrong?
(RepA OR Bcast1)
AND (RepA OR Bcast2)
AND (RepB OR Bcast2)
AND (RepB OR Bcast1)
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
Lineage-driven fault injection The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
Hypothesis: {Bcast1, Bcast2}
Lineage-driven fault injection The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
Bcast3
Client
(RepA OR Bcast1)
AND (RepA OR Bcast2)
AND (RepB OR Bcast2)
AND (RepB OR Bcast1)
Lineage-driven fault injection The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
Bcast3
Client
(RepA OR Bcast1)
AND (RepA OR Bcast2)
AND (RepB OR Bcast2)
AND (RepB OR Bcast1)
AND (RepA OR Bcast3)
AND (RepB OR Bcast3)
Search Space Reduction
Each Experiment finds
a bug, OR
Reduces the
Search space
Lineage-driven Fault Injection
Recipe:
1. Start with a successful
outcome. Work backwards.
2. Ask why it happened: Lineage
3. Convert lineage to a boolean
formula and solve
4. Lather, rinse, repeat
2. Lineage 3. CNF
Fail1. Success
Why?
Encode
Solve
4. REPEAT
Minimal requirements
1. Fault injection infrastructure
2. Mechanism for collecting lineage
3. Ability to replay interactions
Lineage
Request Tracing
Request Tracing
Alternate Execution
Redundancy through History
Case study: “Netflix AppBoot”
Services ~100
Search space (executions) 2100
(1,000,000,000,000,000,000,000,000,000,000)
Experiments performed 200
Critical bugs found 11
Fairy tale
Orchestrated Chaos: Applying Failure Testing Research at Scale.
Orchestrated Chaos: Applying Failure Testing Research at Scale.
Orchestrated Chaos: Applying Failure Testing Research at Scale.
Orchestrated Chaos: Applying Failure Testing Research at Scale.
Growing Research
Don’t:
“Throw it over the wall”
Do:
Deep embeddings
Trading shoes
Growing Research
Work with us
Search prioritization
Input generation
Richer lineage collection
Search prioritization: minimal hitting sets
(A ∨ B ∨ C ) ∧ (C ∨ D ∨ E ∨ F) ∧ (D ∨ E ∨ F ∨ G) ∧ (H ∨ I)
(A, B, C), (C, D, E, F), (D, E, F, G), (H, I)
⇩
Search prioritization: minimal hitting sets
(A ∨ B ∨ C ) ∧ (C ∨ D ∨ E ∨ F) ∧ (D ∨ E ∨ F ∨ G) ∧ (H ∨ I)
(A, B, C), (C, D, E, F), (D, E, F, G), (H, I)
⇩
e.g. (C, E, H) ✔
X X X X X
Measuring FT by counting alternatives
Measuring fault tolerance by counting alternatives
Most likely combination of faults
X
X
X
X
X
Most likely combination of faults
X
X
X
X
X
Most likely combination of faults
X
X
X
X
X
Input generation
Using lightweight modeling to understand Chord
Pamela Zave
The importance of being inputs
Using lightweight modeling to understand Chord
Pamela Zave
The importance of being inputs
Using lightweight modeling to understand Chord
Pamela Zave
The importance of being inputs
Using lightweight modeling to understand Chord
Pamela Zave
Richer lineage collection
Where we are
A fault
injection
framework
(e.g. FIT)
Call
graph
tracing
(e.g. Zipkin)
Where we’re headed
A fault
injection
framework
(e.g. FIT)
Lineage-
driven
fault
injection
Call
graph
tracing
(e.g. Zipkin)
Thanks to our hosts, benefactors and collaborators!
References
● ‘Automating Failure Testing at Internet Scale [ACM SoCC’16]
https://people.ucsc.edu/~palvaro/fit-ldfi.pdf
● ‘Lineage Driven Fault Injection’ [ACM SIGMOD’15]
http://people.ucsc.edu/~palvaro/molly.pdf
● Netflix Tech Blog on ‘Automated Failure Testing’
http://techblog.netflix.com/2016/01/automated-failure-testing.html
FOLD
Orchestrated Chaos: Applying Failure Testing Research at Scale.
The profound solipsism of the microservice
UGLY
GOOD RAW
GOOD RAW
GOOD RAW
GOOD RAW
True Silicon Valley Stories
1. Crazy legwork
2. The “what the hell does our site do” project
3. Offsite => online
Replay
Bins and Balls
Request
Class 1
Class 2
Class 3
Class n
[...]
r’ r
Class n
Predicting Request Graphs
Request
Class n
Predicting Request Graphs
Request
Some function f:
Requests → Classes
F( ) =
Class n
Request
Predicting Request Graphs
The profound solipsism of the microservice

More Related Content

PDF
Stranger Things: The Forces that Disrupt Netflix
PDF
Monkeys in Lab Coats: Applying Failure Testing Research @Netflix
PDF
Madaari : Ordering For The Monkeys
PDF
Dependability Benchmarking by Injecting Software Bugs
PDF
Introduction to Software Testing
PPTX
Pa chapter08-testing integrating-the_programs-cs_390
PDF
What could possibly go wrong
PDF
Property-based testing an open-source compiler, pflua (FOSDEM 2015)
Stranger Things: The Forces that Disrupt Netflix
Monkeys in Lab Coats: Applying Failure Testing Research @Netflix
Madaari : Ordering For The Monkeys
Dependability Benchmarking by Injecting Software Bugs
Introduction to Software Testing
Pa chapter08-testing integrating-the_programs-cs_390
What could possibly go wrong
Property-based testing an open-source compiler, pflua (FOSDEM 2015)

Similar to Orchestrated Chaos: Applying Failure Testing Research at Scale. (20)

PPTX
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
PPTX
PDF
Fault tolerance
PPTX
It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
PDF
Highly Dependable Software 1st Edition Marvin Zelkowitz Phd Ms Bs
PPTX
0Flake - Reaching reliable non-flaky tests - Itai Friendinger - DevOpsDays Te...
PDF
Lionel Briand ICSM 2011 Keynote
PDF
A survey of fault prediction using machine learning algorithms
PDF
November 2024 - Top 10 Read Articles in Software Engineering & Applications
PPTX
Designing Fault Tolerant Microservices
PPT
testing
PPTX
Fault Tolerance in Distributed Environment
PPTX
Application Fault Tolerance (AFT)
PPT
Testing foundations
PDF
Applications of Machine Learning and Metaheuristic Search to Security Testing
PDF
DEF CON 27 - CHRISTOPHER ROBERTS - firmware slap
PDF
Agile, Lean, Rugged: The Paper Edition - Ines Sombra's keynote at GOTO London
PDF
Agile, Rugged, and Lean - The Paper Edition
PDF
S-CUBE LP: Variability Modeling and QoS Analysis of Web Services Orchestrations
PPTX
RTS fault tolerance, Reliability evaluation
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
Fault tolerance
It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
Highly Dependable Software 1st Edition Marvin Zelkowitz Phd Ms Bs
0Flake - Reaching reliable non-flaky tests - Itai Friendinger - DevOpsDays Te...
Lionel Briand ICSM 2011 Keynote
A survey of fault prediction using machine learning algorithms
November 2024 - Top 10 Read Articles in Software Engineering & Applications
Designing Fault Tolerant Microservices
testing
Fault Tolerance in Distributed Environment
Application Fault Tolerance (AFT)
Testing foundations
Applications of Machine Learning and Metaheuristic Search to Security Testing
DEF CON 27 - CHRISTOPHER ROBERTS - firmware slap
Agile, Lean, Rugged: The Paper Edition - Ines Sombra's keynote at GOTO London
Agile, Rugged, and Lean - The Paper Edition
S-CUBE LP: Variability Modeling and QoS Analysis of Web Services Orchestrations
RTS fault tolerance, Reliability evaluation
Ad

More from Reactivesummit (6)

PPTX
Distributed stream processing with Apache Kafka
PDF
Reactive Polyglot Microservices with OpenShift and Vert.x
PDF
Microservices: The danger of overhype and importance of checklists
PDF
The Zen Of Erlang
PDF
Monolith to Reactive Microservices
PPTX
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
Distributed stream processing with Apache Kafka
Reactive Polyglot Microservices with OpenShift and Vert.x
Microservices: The danger of overhype and importance of checklists
The Zen Of Erlang
Monolith to Reactive Microservices
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
Ad

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
cuic standard and advanced reporting.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Cloud computing and distributed systems.
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
cuic standard and advanced reporting.pdf
sap open course for s4hana steps from ECC to s4
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Programs and apps: productivity, graphics, security and other tools
Dropbox Q2 2025 Financial Results & Investor Presentation
Chapter 3 Spatial Domain Image Processing.pdf
Encapsulation_ Review paper, used for researhc scholars
Advanced methodologies resolving dimensionality complications for autism neur...
The AUB Centre for AI in Media Proposal.docx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Review of recent advances in non-invasive hemoglobin estimation
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Cloud computing and distributed systems.
MYSQL Presentation for SQL database connectivity
Mobile App Security Testing_ A Comprehensive Guide.pdf

Orchestrated Chaos: Applying Failure Testing Research at Scale.