SlideShare a Scribd company logo
The Hurricane’s Butterfly
Debugging pathologically performing systems
CTO
bryan@joyent.com
Bryan Cantrill
@bcantrill
Debugging system failure
• Failures are easiest to debug when they are explicit and fatal
• A system that fails fatally stops: it ceases to make forward
progress, leaving behind a snapshot of its state — a core dump
• Unfortunately, these are not all problems…
• A broad class of problems are non-fatal: the system continues
to operate despite having failed, often destroying evidence
• Worst of all are those non-fatal failures that are also implicit
Implicit, non-fatal failure
• The most difficult, time-consuming bugs to debug are those in
which the system failure is unbeknownst to the system itself
• The system does the wrong thing or returns the wrong result or
has pathological side effects (e.g., resource leaks)
• Of these, the gnarliest class are those failures that are not
strictly speaking failure at all: the system is operating correctly,
but is failing to operate in a timely or efficient fashion
• That is, it just… sucks
The stack of abstraction
• Our software systems are built as stacks of abstraction
• These stacks allow us to stand on the shoulders of history — to
reuse components without rebuilding them
• We can do this because of the software paradox: software is
both information and machine, exhibiting properties of both
• Our stacks are higher and run deeper than we can see or know:
software is silent and opaque; the nature of abstraction is to
seal us from what runs beneath!
• They run so deep as to challenge our definition of software…
The Butterflies
• When the stack of abstraction performs pathologically, its power
transmogrifies to peril: layering amplifies performance
pathologies but hinders insight
• Work amplifies as we go down the stack
• Latency amplifies as we go up the stack
• Seemingly minor issues in one layer can cascade into systemic
pathological performance
• These are the butterflies that cause hurricanes
Butterfly I: ARC-induced black hole
Butterfly II: Disk reader starvation
Butterfly III: Kernel page-table isolation
Data courtesy Scaleway, running a PHP workload with KPTI patches for Linux. Thank you Edouard Bonlieu and team!
The Hurricane
• With pathologically performing systems, we are faced with
Leventhal’s Conundrum: given a hurricane, find the butterflies!
• This is excruciatingly difficult:
• Symptoms are often far removed from root cause
• There may not be a single root cause but several
• The system is dynamic and may change without warning
• Improvements to the system are hard to model and verify
• Emphatically, this is not “tuning” — it is debugging
Performance debugging
• When we think of it as debugging, we can stop pretending that
understanding (and rectifying) pathological system performance
is rote or mechanical — or easy
• We can resist the temptation to be guided by folklore: just
because someone heard about something causing a problem
once doesn’t mean it’s the problem now!
• We can resist the temptation to change the system before
understanding it: just as you wouldn’t (or shouldn’t!) debug by
just changing code, you shouldn’t debug a pathologically
performing system by randomly altering it!
How do we debug?
• To debug methodically, we must resist the temptation to quick
hypotheses, focusing rather on questions and observations
• Iterating between questions and observations gathers the facts
that will constrain future hypotheses
• These facts can be used to disconfirm hypotheses!
• How do we ask questions?
• How do we make observations?
Asking questions
• For performance debugging, the initial question formulation is
particularly challenging: where does one start?
• Resource-centric methodologies like the USE Method
(Utilization/Saturation/Errors) can be excellent starting points…
• But keep these methodologies in their context: they provide
initial questions to ask — they are not recipes for debugging
arbitrary performance pathologies!
Making observations
• Questions are answered through observation
• The observability of the system is paramount
• If the system cannot be observed, one is reduced to guessing,
making changes, and drawing inferences
• If it must be said, drawing inferences based only on change is
highly flawed: correlation does not imply causation!
• To be observable, systems must be instrumentable: they must
be able to be altered to emit a datum in the desired condition
Observability through instrumentation
• Static instrumentation modifies source to provide semantically
relevant information, e.g., via logging or counters
• Dynamic instrumentation allows for the system to be changed
while running to emit data, e.g. DTrace, OpenTracing
• Both mechanisms of instrumentation are essential!
• Static instrumentation provides the observations necessary for
early question formulation…
• Dynamic instrumentation answers deeper, ad hoc questions
Aside: Monitoring vs. observability
• Monitoring is an essential operational activity that can indicate a
pathologically performing system and provide initial questions
• But monitoring alone is often insufficient to completely debug a
pathologically performing system, because the questions that it
can answer are limited to that which is monitored
• As we increasingly deploy developed systems rather than
received ones, it is a welcome (and unsurprising!) development
to see the focus of monitoring expand to observability!
Aggregation
• When instrumenting the system, it can become overwhelmed
with the overhead of instrumentation
• Aggregation is essential for scalable, non-invasive
instrumentation — and is a first-class primitive in (e.g.) DTrace
• But aggregation also eliminates important dimensions of data,
especially with respect to time; some questions may only be
answered with disaggregated data!
• Use aggregation for performance debugging — but also
understand its limits!
Visualization
• The visual cortex is unparalleled at detecting patterns
• The value of visualizing data is not merely providing answers,
but also (and especially) provoking new questions
• Our systems are so large, complicated and abstract that there is
not one way to visualize them, but many
• The visualization of systems and their representations is an
essential skill for performance debugging!
Visualization: Gnuplot
• Graphs are terrific — so much so that we should not restrict
ourselves to the captive graphs found in bundled software!
• An ad hoc plotting tool is essential for performance debugging;
and Gnuplot is an excellent (if idiosyncratic) one
• Gnuplot is easily combined with workhorses like awk or perl
• That Gnuplot is an essential tool helps to set expectation
around performance debugging tools: they are not magicians!
Visualization: Heatmaps
Visualization: Flamegraphs
Visualization: Statemaps
• Especially when trying to understand interplay between different
entities, it can be useful to visualize their state over time
• Time is the critical element here!
• We are experimenting with statemaps whereby state transitions
are instrumented (e.g., with DTrace) and then visualized
• This is not necessarily a new way of visualizing the system
(e.g., early thread debuggers often showed thread state over
time), but with a new focus on post hoc visualization
• Primordial implementation: https://github.com/joyent/statemap
Visualization: Statemaps
Visualization: Statemaps
Visualization: Statemaps
Visualization: Statemaps
Visualization: Statemaps
The hurricane’s butterfly
• Finding the source(s) of pathologically performing systems must
be thought of as debugging — albeit the hardest kind
• Debugging isn’t about making guesses; it’s about asking
questions and answering them with observations
• We must enshrine observability to assure debuggability!
• Debugging rewards persistence, grit, and resilience more than
intuition or insight — it is more perspiration than inspiration!
• We must have the faith that our systems are — in the end —
purely synthetic; we can find the hurricane’s butterfly!

More Related Content

PDF
Visualizing Systems with Statemaps
PDF
Debugging (Docker) containers in production
PDF
Debugging under fire: Keeping your head when systems have lost their mind
PDF
Leadership Without Management: Scaling Organizations by Scaling Engineers
PDF
Zebras all the way down: The engineering challenges of the data path
PDF
Debugging microservices in production
PPTX
Jax Devops 2017 Succeeding in the Cloud – the guidebook of Fail
PDF
Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over
Visualizing Systems with Statemaps
Debugging (Docker) containers in production
Debugging under fire: Keeping your head when systems have lost their mind
Leadership Without Management: Scaling Organizations by Scaling Engineers
Zebras all the way down: The engineering challenges of the data path
Debugging microservices in production
Jax Devops 2017 Succeeding in the Cloud – the guidebook of Fail
Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

What's hot (20)

PDF
Chaos Engineering Without Observability ... Is Just Chaos
PDF
Scaling a Web Site - OSCON Tutorial
PDF
Thierry de Pauw - Feature Branching considered Evil - Codemotion Milan 2018
PDF
Creating An Incremental Architecture For Your System
PPTX
DevOps Roadtrip Final Speaking Deck
PPTX
Support at scale in a DevOps world How Swarming and Cynefin can save you from...
PDF
DevOps: A Practical Guide
PPS
CS101- Introduction to Computing- Lecture 45
PDF
Corporate Open Source Anti-patterns
PDF
DSC UTeM DevOps Session#1: Intro to DevOps Presentation Slides
PDF
Twelve tips on how to prepare an ERC grant proposal
PPTX
Being Elastic -- Evolving Programming for the Cloud
PPT
Plugin style EA
PPTX
Using Machine Learning to Optimize DevOps Practices
PPTX
Internet of Things, TYBSC IT, Semester 5, Unit II
PDF
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...
PDF
How good is your software development team ?
PPTX
#ATAGTR2021 Presentation : "Chaos engineering: Break it to make it" by Anupa...
PDF
Building a Successful Organization By Mastering Failure
PDF
Architects and design-org
Chaos Engineering Without Observability ... Is Just Chaos
Scaling a Web Site - OSCON Tutorial
Thierry de Pauw - Feature Branching considered Evil - Codemotion Milan 2018
Creating An Incremental Architecture For Your System
DevOps Roadtrip Final Speaking Deck
Support at scale in a DevOps world How Swarming and Cynefin can save you from...
DevOps: A Practical Guide
CS101- Introduction to Computing- Lecture 45
Corporate Open Source Anti-patterns
DSC UTeM DevOps Session#1: Intro to DevOps Presentation Slides
Twelve tips on how to prepare an ERC grant proposal
Being Elastic -- Evolving Programming for the Cloud
Plugin style EA
Using Machine Learning to Optimize DevOps Practices
Internet of Things, TYBSC IT, Semester 5, Unit II
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...
How good is your software development team ?
#ATAGTR2021 Presentation : "Chaos engineering: Break it to make it" by Anupa...
Building a Successful Organization By Mastering Failure
Architects and design-org
Ad

Similar to The Hurricane's Butterfly: Debugging pathologically performing systems (20)

PPT
Automatic Assessment of Failure Recovery in Erlang Applications
PPTX
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
ODP
DevOps Days Vancouver 2014 Slides
PDF
Dependable Systems -Fault Tolerance Patterns (4/16)
PPTX
Production based system
PPTX
Chaos engineering
PDF
The Cost of Complexity
PPTX
Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25
PDF
Fault tolerance review by tsegabrehan zerihun
PPTX
Expert system (unit 1 & 2)
PDF
Kanban - A Crash Course
PDF
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
PPTX
Tools and practices to use in a Continuous Delivery pipeline
PPTX
DockerCon SF 2019 - TDD is Dead
PPTX
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
PDF
Debugging distributed systems
PDF
Bugs as Deviant Behavior: A General Approach to Inferring Errors in Systems Code
PPTX
Beyond pretty charts, Analytics for the rest of us. Toufic Boubez DevOps Days...
PPTX
Lecture 06 production system
PDF
Monitoring Complex Systems - Chicago Erlang, 2014
Automatic Assessment of Failure Recovery in Erlang Applications
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
DevOps Days Vancouver 2014 Slides
Dependable Systems -Fault Tolerance Patterns (4/16)
Production based system
Chaos engineering
The Cost of Complexity
Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25
Fault tolerance review by tsegabrehan zerihun
Expert system (unit 1 & 2)
Kanban - A Crash Course
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
Tools and practices to use in a Continuous Delivery pipeline
DockerCon SF 2019 - TDD is Dead
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
Debugging distributed systems
Bugs as Deviant Behavior: A General Approach to Inferring Errors in Systems Code
Beyond pretty charts, Analytics for the rest of us. Toufic Boubez DevOps Days...
Lecture 06 production system
Monitoring Complex Systems - Chicago Erlang, 2014
Ad

More from bcantrill (20)

PDF
Predicting the Present
PDF
Sharpening the Axe: The Primacy of Toolmaking
PDF
Coming of Age: Developing young technologists without robbing them of their y...
PDF
I have come to bury the BIOS, not to open it: The need for holistic systems
PDF
Towards Holistic Systems
PDF
The Coming Firmware Revolution
PDF
Hardware/software Co-design: The Coming Golden Age
PDF
Tockilator: Deducing Tock execution flows from Ibex Verilator traces
PDF
No Moore Left to Give: Enterprise Computing After Moore's Law
PDF
Andreessen's Corollary: Ethical Dilemmas in Software Engineering
PDF
Platform values, Rust, and the implications for system software
PDF
Is it time to rewrite the operating system in Rust?
PDF
dtrace.conf(16): DTrace state of the union
PDF
Papers We Love: ARC after dark
PDF
Principles of Technology Leadership
PDF
Platform as reflection of values: Joyent, node.js, and beyond
PDF
Down Memory Lane: Two Decades with the Slab Allocator
PDF
The State of Cloud 2016: The whirlwind of creative destruction
PDF
Oral tradition in software engineering: Passing the craft across generations
PDF
The Container Revolution: Reflections after the first decade
Predicting the Present
Sharpening the Axe: The Primacy of Toolmaking
Coming of Age: Developing young technologists without robbing them of their y...
I have come to bury the BIOS, not to open it: The need for holistic systems
Towards Holistic Systems
The Coming Firmware Revolution
Hardware/software Co-design: The Coming Golden Age
Tockilator: Deducing Tock execution flows from Ibex Verilator traces
No Moore Left to Give: Enterprise Computing After Moore's Law
Andreessen's Corollary: Ethical Dilemmas in Software Engineering
Platform values, Rust, and the implications for system software
Is it time to rewrite the operating system in Rust?
dtrace.conf(16): DTrace state of the union
Papers We Love: ARC after dark
Principles of Technology Leadership
Platform as reflection of values: Joyent, node.js, and beyond
Down Memory Lane: Two Decades with the Slab Allocator
The State of Cloud 2016: The whirlwind of creative destruction
Oral tradition in software engineering: Passing the craft across generations
The Container Revolution: Reflections after the first decade

Recently uploaded (20)

PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
history of c programming in notes for students .pptx
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Transform Your Business with a Software ERP System
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Nekopoi APK 2025 free lastest update
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
System and Network Administration Chapter 2
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
medical staffing services at VALiNTRY
Operating system designcfffgfgggggggvggggggggg
history of c programming in notes for students .pptx
Navsoft: AI-Powered Business Solutions & Custom Software Development
Design an Analysis of Algorithms II-SECS-1021-03
Transform Your Business with a Software ERP System
Understanding Forklifts - TECH EHS Solution
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Nekopoi APK 2025 free lastest update
CHAPTER 2 - PM Management and IT Context
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Adobe Illustrator 28.6 Crack My Vision of Vector Design
2025 Textile ERP Trends: SAP, Odoo & Oracle
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
System and Network Administration Chapter 2
ManageIQ - Sprint 268 Review - Slide Deck
ISO 45001 Occupational Health and Safety Management System
Upgrade and Innovation Strategies for SAP ERP Customers
medical staffing services at VALiNTRY

The Hurricane's Butterfly: Debugging pathologically performing systems

  • 1. The Hurricane’s Butterfly Debugging pathologically performing systems CTO bryan@joyent.com Bryan Cantrill @bcantrill
  • 2. Debugging system failure • Failures are easiest to debug when they are explicit and fatal • A system that fails fatally stops: it ceases to make forward progress, leaving behind a snapshot of its state — a core dump • Unfortunately, these are not all problems… • A broad class of problems are non-fatal: the system continues to operate despite having failed, often destroying evidence • Worst of all are those non-fatal failures that are also implicit
  • 3. Implicit, non-fatal failure • The most difficult, time-consuming bugs to debug are those in which the system failure is unbeknownst to the system itself • The system does the wrong thing or returns the wrong result or has pathological side effects (e.g., resource leaks) • Of these, the gnarliest class are those failures that are not strictly speaking failure at all: the system is operating correctly, but is failing to operate in a timely or efficient fashion • That is, it just… sucks
  • 4. The stack of abstraction • Our software systems are built as stacks of abstraction • These stacks allow us to stand on the shoulders of history — to reuse components without rebuilding them • We can do this because of the software paradox: software is both information and machine, exhibiting properties of both • Our stacks are higher and run deeper than we can see or know: software is silent and opaque; the nature of abstraction is to seal us from what runs beneath! • They run so deep as to challenge our definition of software…
  • 5. The Butterflies • When the stack of abstraction performs pathologically, its power transmogrifies to peril: layering amplifies performance pathologies but hinders insight • Work amplifies as we go down the stack • Latency amplifies as we go up the stack • Seemingly minor issues in one layer can cascade into systemic pathological performance • These are the butterflies that cause hurricanes
  • 7. Butterfly II: Disk reader starvation
  • 8. Butterfly III: Kernel page-table isolation Data courtesy Scaleway, running a PHP workload with KPTI patches for Linux. Thank you Edouard Bonlieu and team!
  • 9. The Hurricane • With pathologically performing systems, we are faced with Leventhal’s Conundrum: given a hurricane, find the butterflies! • This is excruciatingly difficult: • Symptoms are often far removed from root cause • There may not be a single root cause but several • The system is dynamic and may change without warning • Improvements to the system are hard to model and verify • Emphatically, this is not “tuning” — it is debugging
  • 10. Performance debugging • When we think of it as debugging, we can stop pretending that understanding (and rectifying) pathological system performance is rote or mechanical — or easy • We can resist the temptation to be guided by folklore: just because someone heard about something causing a problem once doesn’t mean it’s the problem now! • We can resist the temptation to change the system before understanding it: just as you wouldn’t (or shouldn’t!) debug by just changing code, you shouldn’t debug a pathologically performing system by randomly altering it!
  • 11. How do we debug? • To debug methodically, we must resist the temptation to quick hypotheses, focusing rather on questions and observations • Iterating between questions and observations gathers the facts that will constrain future hypotheses • These facts can be used to disconfirm hypotheses! • How do we ask questions? • How do we make observations?
  • 12. Asking questions • For performance debugging, the initial question formulation is particularly challenging: where does one start? • Resource-centric methodologies like the USE Method (Utilization/Saturation/Errors) can be excellent starting points… • But keep these methodologies in their context: they provide initial questions to ask — they are not recipes for debugging arbitrary performance pathologies!
  • 13. Making observations • Questions are answered through observation • The observability of the system is paramount • If the system cannot be observed, one is reduced to guessing, making changes, and drawing inferences • If it must be said, drawing inferences based only on change is highly flawed: correlation does not imply causation! • To be observable, systems must be instrumentable: they must be able to be altered to emit a datum in the desired condition
  • 14. Observability through instrumentation • Static instrumentation modifies source to provide semantically relevant information, e.g., via logging or counters • Dynamic instrumentation allows for the system to be changed while running to emit data, e.g. DTrace, OpenTracing • Both mechanisms of instrumentation are essential! • Static instrumentation provides the observations necessary for early question formulation… • Dynamic instrumentation answers deeper, ad hoc questions
  • 15. Aside: Monitoring vs. observability • Monitoring is an essential operational activity that can indicate a pathologically performing system and provide initial questions • But monitoring alone is often insufficient to completely debug a pathologically performing system, because the questions that it can answer are limited to that which is monitored • As we increasingly deploy developed systems rather than received ones, it is a welcome (and unsurprising!) development to see the focus of monitoring expand to observability!
  • 16. Aggregation • When instrumenting the system, it can become overwhelmed with the overhead of instrumentation • Aggregation is essential for scalable, non-invasive instrumentation — and is a first-class primitive in (e.g.) DTrace • But aggregation also eliminates important dimensions of data, especially with respect to time; some questions may only be answered with disaggregated data! • Use aggregation for performance debugging — but also understand its limits!
  • 17. Visualization • The visual cortex is unparalleled at detecting patterns • The value of visualizing data is not merely providing answers, but also (and especially) provoking new questions • Our systems are so large, complicated and abstract that there is not one way to visualize them, but many • The visualization of systems and their representations is an essential skill for performance debugging!
  • 18. Visualization: Gnuplot • Graphs are terrific — so much so that we should not restrict ourselves to the captive graphs found in bundled software! • An ad hoc plotting tool is essential for performance debugging; and Gnuplot is an excellent (if idiosyncratic) one • Gnuplot is easily combined with workhorses like awk or perl • That Gnuplot is an essential tool helps to set expectation around performance debugging tools: they are not magicians!
  • 21. Visualization: Statemaps • Especially when trying to understand interplay between different entities, it can be useful to visualize their state over time • Time is the critical element here! • We are experimenting with statemaps whereby state transitions are instrumented (e.g., with DTrace) and then visualized • This is not necessarily a new way of visualizing the system (e.g., early thread debuggers often showed thread state over time), but with a new focus on post hoc visualization • Primordial implementation: https://github.com/joyent/statemap
  • 27. The hurricane’s butterfly • Finding the source(s) of pathologically performing systems must be thought of as debugging — albeit the hardest kind • Debugging isn’t about making guesses; it’s about asking questions and answering them with observations • We must enshrine observability to assure debuggability! • Debugging rewards persistence, grit, and resilience more than intuition or insight — it is more perspiration than inspiration! • We must have the faith that our systems are — in the end — purely synthetic; we can find the hurricane’s butterfly!