SlideShare a Scribd company logo
5 Essential Techniques for
Building Fault-tolerant Systems
DIEGO BERRUETA | ENGINEERING PRINCIPAL | ATLASSIAN
Preston Rhea; Flickr (www.flickr.com/photos/prestonrhea/), CC-by
ALERT!
5 Essential Techniques for Building Fault-tolerant Systems
Incident response
REVERSE PROXY
CASCADING FAILURE
CASCADING FAILURE
REVERSE PROXY
aotaro; Flickr (www.flickr.com/photos/aotaro/), CC-by
Fault-tolerant system
Continues to operate in the event of faults in some
of its components
Fault
Deviation from the normal state, either due to
internal or external causes
Failure
Observable impact of a fault in a system, usually
manifests as reduced availability
Fault-tolerant system
Continues to operate in the event of faults in some
of its components
Fault
Deviation from the normal state, either due to
internal or external causes
Failure
Observable impact of a fault in a system, usually
manifests as reduced availability
Fault-tolerant system
Continues to operate in the event of faults in some
of its components
Fault
Deviation from the normal state, either due to
internal or external causes
Failure
Observable impact of a fault in a system, usually
manifests as reduced availability
Bob Yeats (OSU); Flickr (www.flickr.com/photos/oregonstateuniversity/), CC-by
Faults happen
Preventing them may be technically or
economically impractical
Systems can be designed and
built to tolerate faults
FAULT TOLERANCE
STABLE SYSTEM UNSTABLE SYSTEM
Essential
techniques
Contain
Fail fast
Escape
Adjust
Learn
Contain the fault
Using physical or logical barriers
Norlando Pobre; Flickr (www.flickr.com/photos/npobre/), CC-by
Shawn O’Neil; Flickr (www.flickr.com/photos/oneilsh/), CC-by
Pool separation
Different pools for each

task or dependency
Brian Cantoni; Flickr (www.flickr.com/photos/cantoni/), CC-by
Asynchronous communication
Sender and receiver fail independently,
receiver can catch up later
Synchronous communication
Propagates failures, context may be lost
Contain: 

in practice
CLIENT ISOLATION
NEW CLIENT
CLIENT A
CLIENT B
Avoid SPOF
Find the components which
compromise the system
How to contain faults
Invest in redundancy
Improve availability by having
more than one of everything
Build bulkheads
Set up logic walls to

reduce the blast radius
Essential
techniques
Contain
Fail fast
Escape
Adjust
Learn
A quick error is better

than a slow response
FAIL FAST PRINCIPLE
Validate early
Anticipate problems and

change course
Carl Wycoff; Wikimedia Commons, CC-by
Reject politely
Fast refusal is better

than slow error
Taber A.B.; Flickr (www.flickr.com/photos/andrewbain/), CC-by
Never wait long
Protect long tasks

with timeouts
Robert C-B.; Flickr (www.flickr.com/photos/29233640@N07/), CC-by
Watch out
for slowness
In-process locks
Database queries
Sockets
Remote APIs
Albert Herring, Wikimedia Commons, CC-by
Watch out
for slowness
In-process locks
Database queries
Sockets
Remote APIs
Albert Herring, Wikimedia Commons, CC-by
Watch out
for slowness
In-process locks
Database queries
Sockets
Remote APIs
Albert Herring, Wikimedia Commons, CC-by
Watch out
for slowness
In-process locks
Database queries
Sockets
Remote APIs
Albert Herring, Wikimedia Commons, CC-by
Fail fast: 

in practice
DATABASE FAILOVER
SYNC
Decline service
When overloaded, ask clients
to come back later
How to fail fast
Never wait long
Set a timeout for blocking
calls and slow operations
Validate early
Avoid starting something that
cannot be completed
Essential
techniques
Contain
Fail fast
Escape
Adjust
Learn
Circuit breakers
Separate the failing parts
REST CIRCUIT-BREAKER
CLIENT SERVER
Fault-tolerance libraries
Circuit breaking
Avoid cascading failures during
periods of turbulence
Monitoring and alerting
Observe the behaviour of all your
dependencies
Timeouts
Time-bound any operation
Fall-back
Recover using an alternative path
Escape: 

in practice
SERVICE EVOLUTION
TENANT ATENANT A
TENANT ATENANT B
TENANT ATENANT C
Anticipate failure
If it is not going to work,

do not even try
How to escape
Degrade gracefully
A cached result or a default
value may be an alternative
Detect problems
Compare all interactions
against error thresholds
Essential
techniques
Contain
Fail fast
Escape
Adjust
Learn
Communication
Share initial expectations 

and status updates
Adjust
Page size
Growth
Flow
Retry
Paginate queries
Avoid unbounded database and
remote queries
Adjust
Page size
Growth
Flow
Retry
Beware of things that grow
Clean up old data and set size limits
Adjust
Page size
Growth
Flow
Retry
Negotiate speed
Adjust dynamically using rate limits,
request throttling, buffering and
autoscaling
Adjust
Page size
Growth
Flow
Retry
Insist smartly
Retry transient errors with timeouts
and back-off
Adjust: 

in practice
KILLER HEALTH CHECK
Insist smartly
Transient errors can be
retried with back-off
How to adjust
Report availability
Apply back pressure to
prevent congestion
Negotiate size
Limit the cost of the job
Essential
techniques
Contain
Fail fast
Escape
Adjust
Learn
Monitor the behaviour
Define metrics, observe trends, set up
alerts, capture logs
Learn from
experience
Observe
Test
Reflect
Test with chaos
Verify your hypothesis about
fault-tolerance by continuously
introducing chaos
Learn from
experience
Observe
Test
Reflect
Learn from mistakes
Each incident is an opportunity to
make the system more robust
Learn from
experience
Observe
Test
Reflect
Learn: 

in practice
PARTIAL FAILURE
1
2
3
PARTIAL FAILURE
1 32
2
2
Monitor and alert
Understand the behaviour of
the system in production
How to learn
Reflect on incidents
Analyse the root cause and
prevent recurrences
Test what if…?
Deliberately introduce chaos
to assess fault-tolerance
Essential
techniques
Contain
Fail fast
Escape
Adjust
Learn
Hope is not a strategy
Test, observe, reflect
Life starts after releasing
“Code complete” is not “production ready”
Be cynical
Do not trust anybodyRobustness
is an attitude
Some disasters

can be prevented
Build and test with failure in mind
Faults are unavoidable
Any possible fault

will eventually happen
Leo Hidalgo; Flickr (www.flickr.com/photos/ileohidalgo/), CC-by
Release it!
Michael Nygard
Site Reliability
Engineering
Google
References
Thank you!
DIEGO BERRUETA | ENGINEERING PRINCIPAL | ATLASSIAN

More Related Content

PDF
The Four Principles of Atlassian Performance Tuning
PDF
Monitoring As Code: How to Integrate App Monitoring Into Your Developer Cycle
PDF
Modern Operations at Scale within Viasat – How to Structure Teams and Build A...
PDF
Integration Testing on Steroids: Run Your Tests on the Real Things
PDF
Beyond Agile and DevOps: From Concepts to Products in Weeks, Not Months
PDF
SecOps - Bringing Agility into Security
PDF
Big Bang And Beyond: Migrating Between Server and Cloud
PPTX
Atlassian User Group NYC - May 24, 2017 Slides
The Four Principles of Atlassian Performance Tuning
Monitoring As Code: How to Integrate App Monitoring Into Your Developer Cycle
Modern Operations at Scale within Viasat – How to Structure Teams and Build A...
Integration Testing on Steroids: Run Your Tests on the Real Things
Beyond Agile and DevOps: From Concepts to Products in Weeks, Not Months
SecOps - Bringing Agility into Security
Big Bang And Beyond: Migrating Between Server and Cloud
Atlassian User Group NYC - May 24, 2017 Slides

What's hot (20)

PDF
Accelerating Add-on Development From Concept to Launch
PDF
Continuously Integrating Distributed Code at Netflix
PDF
Managing Application Lifecycle using Jira and Bitbucket Cloud and AWS Tooling
PDF
Leaning into Server to Cloud App Migration
PDF
Automate Mission Critical Business Workflows With JIRA
PDF
Herding Microservices – the Atlassian Way
PDF
Mock Servers - Fake All the Things!
PDF
Slam Dunk with Splunk and Stash Data Center
PPTX
Splunk'ing JIRA for deep insights into application, database, and server heal...
PDF
Scaling to 150,000 Builds a Month... and Beyond
PDF
Principles Of Chaos Engineering - Chaos Engineering Hamburg
PPTX
DevOps Days Toronto: From 6 Months Waterfall to 1 hour Code Deploys
PPTX
Introduction to Chaos Engineering
PDF
Atlassian User Group NYC 03302017 Presentations
PDF
Demystifying Web Performance
PDF
Behind the Scenes of Vendor Security Reviews in the Enterprise
PDF
An Exploration of Cross-product App Experiences
PDF
Advocating Adoption: Best Practices for User-Friendly Jira Configurations
PDF
Introduction to Chaos Engineering with Microsoft Azure
PDF
Expecto Performa! The Magic and Reality of Performance Tuning
Accelerating Add-on Development From Concept to Launch
Continuously Integrating Distributed Code at Netflix
Managing Application Lifecycle using Jira and Bitbucket Cloud and AWS Tooling
Leaning into Server to Cloud App Migration
Automate Mission Critical Business Workflows With JIRA
Herding Microservices – the Atlassian Way
Mock Servers - Fake All the Things!
Slam Dunk with Splunk and Stash Data Center
Splunk'ing JIRA for deep insights into application, database, and server heal...
Scaling to 150,000 Builds a Month... and Beyond
Principles Of Chaos Engineering - Chaos Engineering Hamburg
DevOps Days Toronto: From 6 Months Waterfall to 1 hour Code Deploys
Introduction to Chaos Engineering
Atlassian User Group NYC 03302017 Presentations
Demystifying Web Performance
Behind the Scenes of Vendor Security Reviews in the Enterprise
An Exploration of Cross-product App Experiences
Advocating Adoption: Best Practices for User-Friendly Jira Configurations
Introduction to Chaos Engineering with Microsoft Azure
Expecto Performa! The Magic and Reality of Performance Tuning
Ad

Viewers also liked (20)

PDF
Know Thy Product: Tips from a Tester
PDF
12 Ways to Supercharge Your Connect Add-on
PDF
How to be Successful in the DevOps Business
PDF
Shipping to Server and Cloud with Docker
PDF
Build a JIRA Server Add-on as a Microservice - You Can Do It!
PDF
How to Plan and Execute a Go-to-market Campaign for an Atlassian Add-on
PDF
Building for the Future: Design Your Add-on with Tomorrow in Mind
PDF
React for Re-use: Creating UI Components with Confluence Connect
PDF
How to Build a Better JIRA Add-on
PDF
Atlassian Connect on Serverless Platforms: Low Cost Add-Ons
PDF
Adventures in Customization with Atlassian Add-ons and Integrations Using JIR...
PDF
How to Make Customer Support Your Product's Greatest Feature
PDF
What's New with Confluence Connect
PDF
Creating Your Own Server Add-on that Customizes Confluence or JIRA
PDF
Tempo’s Journey Into the Cloud
PDF
Ten Battle-Tested Tips for Atlassian Connect Add-ons
PDF
Closing the Deal: How Atlassian Partners Help Grow Your User Base
PDF
Marketplace Listings: How to Make Yours a Success
PDF
Launch into New Markets with JIRA Service Desk
PDF
Building Search for Bitbucket Cloud
Know Thy Product: Tips from a Tester
12 Ways to Supercharge Your Connect Add-on
How to be Successful in the DevOps Business
Shipping to Server and Cloud with Docker
Build a JIRA Server Add-on as a Microservice - You Can Do It!
How to Plan and Execute a Go-to-market Campaign for an Atlassian Add-on
Building for the Future: Design Your Add-on with Tomorrow in Mind
React for Re-use: Creating UI Components with Confluence Connect
How to Build a Better JIRA Add-on
Atlassian Connect on Serverless Platforms: Low Cost Add-Ons
Adventures in Customization with Atlassian Add-ons and Integrations Using JIR...
How to Make Customer Support Your Product's Greatest Feature
What's New with Confluence Connect
Creating Your Own Server Add-on that Customizes Confluence or JIRA
Tempo’s Journey Into the Cloud
Ten Battle-Tested Tips for Atlassian Connect Add-ons
Closing the Deal: How Atlassian Partners Help Grow Your User Base
Marketplace Listings: How to Make Yours a Success
Launch into New Markets with JIRA Service Desk
Building Search for Bitbucket Cloud
Ad

Similar to 5 Essential Techniques for Building Fault-tolerant Systems (20)

PDF
Reliability and Resilience Patterns
PDF
Fault tolerance
PPTX
Fault Tolerance System
PDF
Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over
PDF
When Things Go Bump in the Night
PDF
The 7 quests of resilient software design
PPTX
RTS fault tolerance, Reliability evaluation
PPTX
real time systems fault tolerance, Redundancy
PDF
Agile, Lean, Rugged: The Paper Edition - Ines Sombra's keynote at GOTO London
PDF
Agile, Rugged, and Lean - The Paper Edition
PPTX
Surviving Black Friday - CodeMotion
PDF
Architectural Patterns of Resilient Distributed Systems
PDF
Architecting for Failures in micro services: patterns and lessons learned
PDF
Devoxx2017
PPTX
Operating system.assig.ppt gokgfchvhj;;hhjcghfxgch
PPTX
Fault Tolerance in Distributed System
PPTX
PriyaDharshini distributed operating system
PPTX
Vanmathy distributed operating system
PPTX
Embracing Failure - AzureDay Rome
PDF
Applying principles of chaos engineering to serverless
Reliability and Resilience Patterns
Fault tolerance
Fault Tolerance System
Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over
When Things Go Bump in the Night
The 7 quests of resilient software design
RTS fault tolerance, Reliability evaluation
real time systems fault tolerance, Redundancy
Agile, Lean, Rugged: The Paper Edition - Ines Sombra's keynote at GOTO London
Agile, Rugged, and Lean - The Paper Edition
Surviving Black Friday - CodeMotion
Architectural Patterns of Resilient Distributed Systems
Architecting for Failures in micro services: patterns and lessons learned
Devoxx2017
Operating system.assig.ppt gokgfchvhj;;hhjcghfxgch
Fault Tolerance in Distributed System
PriyaDharshini distributed operating system
Vanmathy distributed operating system
Embracing Failure - AzureDay Rome
Applying principles of chaos engineering to serverless

More from Atlassian (20)

PPTX
International Women's Day 2020
PDF
10 emerging trends that will unbreak your workplace in 2020
PDF
Forge App Showcase
PDF
Let's Build an Editor Macro with Forge UI
PDF
Meet the Forge Runtime
PDF
Forge UI: A New Way to Customize the Atlassian User Experience
PDF
Take Action with Forge Triggers
PDF
Observability and Troubleshooting in Forge
PDF
Trusted by Default: The Forge Security & Privacy Model
PDF
Designing Forge UI: A Story of Designing an App UI System
PDF
Forge: Under the Hood
PDF
Access to User Activities - Activity Platform APIs
PDF
Design Your Next App with the Atlassian Vendor Sketch Plugin
PDF
Tear Up Your Roadmap and Get Out of the Building
PDF
Nailing Measurement: a Framework for Measuring Metrics that Matter
PDF
Building Apps With Color Blind Users in Mind
PDF
Creating Inclusive Experiences: Balancing Personality and Accessibility in UX...
PDF
Beyond Diversity: A Guide to Building Balanced Teams
PDF
The Road(map) to Las Vegas - The Story of an Emerging Self-Managed Team
PDF
Building Apps With Enterprise in Mind
International Women's Day 2020
10 emerging trends that will unbreak your workplace in 2020
Forge App Showcase
Let's Build an Editor Macro with Forge UI
Meet the Forge Runtime
Forge UI: A New Way to Customize the Atlassian User Experience
Take Action with Forge Triggers
Observability and Troubleshooting in Forge
Trusted by Default: The Forge Security & Privacy Model
Designing Forge UI: A Story of Designing an App UI System
Forge: Under the Hood
Access to User Activities - Activity Platform APIs
Design Your Next App with the Atlassian Vendor Sketch Plugin
Tear Up Your Roadmap and Get Out of the Building
Nailing Measurement: a Framework for Measuring Metrics that Matter
Building Apps With Color Blind Users in Mind
Creating Inclusive Experiences: Balancing Personality and Accessibility in UX...
Beyond Diversity: A Guide to Building Balanced Teams
The Road(map) to Las Vegas - The Story of an Emerging Self-Managed Team
Building Apps With Enterprise in Mind

Recently uploaded (20)

PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
iTop VPN Free 5.6.0.5262 Crack latest version 2025
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
System and Network Administration Chapter 2
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
System and Network Administraation Chapter 3
PPTX
Transform Your Business with a Software ERP System
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
L1 - Introduction to python Backend.pptx
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
Upgrade and Innovation Strategies for SAP ERP Customers
iTop VPN Free 5.6.0.5262 Crack latest version 2025
CHAPTER 2 - PM Management and IT Context
Which alternative to Crystal Reports is best for small or large businesses.pdf
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Design an Analysis of Algorithms II-SECS-1021-03
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Softaken Excel to vCard Converter Software.pdf
Why Generative AI is the Future of Content, Code & Creativity?
System and Network Administration Chapter 2
Designing Intelligence for the Shop Floor.pdf
System and Network Administraation Chapter 3
Transform Your Business with a Software ERP System
PTS Company Brochure 2025 (1).pdf.......
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Adobe Illustrator 28.6 Crack My Vision of Vector Design
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
L1 - Introduction to python Backend.pptx
Navsoft: AI-Powered Business Solutions & Custom Software Development

5 Essential Techniques for Building Fault-tolerant Systems