SlideShare a Scribd company logo
Chaos Engineering
Injecting failure for building
resilience in systems
Nice to meet you
YURY NIÑO
Software Engineer and Chaos
Engineer Advocate.
Loves building software applications, solving
resilience issues and teaching. Passionate about
reading, writing and cycling.
Agenda
● Resilience vs Reliability
● Why the world needs Resilience and
Reliability?
● Chaos Engineering
● Principles of Chaos
● Chaos in Practice
● Game Days
How many of you
Have encountered a crash of
your systems on production?
A recognition for ...
This talk is dedicated to
the #SystemAdministrators well
caffeinated, who get woken up in the
middle of the night when “things go
bump”.
#EngineeringTeam #DigitalFactory
@jnhernandz @
What is a
Resilient System?
A resilient system can maintain an acceptable level
of service in the face of failure.
A resilient system can weather the storm such a
large scale natural disaster or a controlled chaos
engineering.
Tammy Bütow Principal SRE at Gremlin
https://securethegrid.com
A distributed system on production needs to be
resilient in order to be reliable and this is precisely
a target that we Software Engineers, Systems
Engineers, Site Reliability Engineers and Chaos
Engineers always aim.
Mine :)
Why the world needs
Resilient Systems?
Because ...
We are surrounded by
distributed systems.
When we read the news in our
cellphones, send an email or buy our
lunch ...
We do not tolerate that
they fail!
Chaos Engineering: Injecting Failure for Building Resilience in Systems
February 28th, 2017 will be remembered
● Simple Storage Service (S3) went down in US-EAST.
● Outage lasted about 4 hrs.
● > 100.000 websites across the world were impacted.
Me :(
The World is Chaotic!
● Distributed systems contains moving
parts.
● Many things can go wrong.
○ Hard disks can fail.
○ The network can go down.
○ Customer traffic can overload.
How many of you know
What is Chaos
Engineering?
Chaos Engineering
It is the discipline of experimenting in
production on a distributed system in
order to reveal their weakness and to
build confidence in their resilience
capability.
https://principlesofchaos.org/
Chaos Engineering
It is deliberately inducing stress or
fault into software and/or hardware as
a way of learning/verifying things
about systems.
https://www.gremlin.com
Chaos Engineering is about
● Simulating the failure of a datacenter.
● Injecting latency between services.
● Randomly causing exceptions.
● Changing time travel.
● Emulating I/O errors.
http://principlesofchaos.org/
2008
Chaos Engineering
began at Netflix
2010
Chaos Monkey was
launched
2018
A lot of resources for
Chaos Engineering.
2014
Role of Chaos
Engineer was created.
History of Chaos Engineering
Kolton Andrus
Chaos in Practice
Principles of
Chaos
https://principlesofchaos.org/
1. Steady Stead
Chaos Engineering: Injecting Failure for Building Resilience in Systems
2. Hypothesis:
Circuit
Breaker
builds
Resilience
2. Hypothesis:
Circuit
Breaker
builds
Resilience
4. Run the Experiment
Application
Name Finer Observability DataDog
Hypothesis Circuit Breaker works
Environment My Home Results
Duration 5 - 10 seconds
Load 1 request
Actions
4. Run the Experiment
Application
Name Finer Observability DataDog
Hypothesis Circuit Breaker works
Facing latencies > 5 seconds between
dashboard_api and smart_api to open
the circuit.
Environment My Home Results
Duration 20 milliseconds
Load 1 request
Issue #4356
Configure the proper hystrix parameters
according the results.
Implement a fallback.
Actions
Game Days
Game Day: Roles
Master of Disaster First on-call Team
https://www.pinterest.es/pin/824299538021645731/
Game Days can Transform our Teams
Even though Game Days are not real! they
make Engineers gain confidence.
Since we, Engineers are experiencing the failure as part
of our job, we should start designing for failure.
Me :)
The best time to learn about fire
is when you’re on fire.
—Jen Hammond, New Relic engineering manager
How to begin ...
https://chaosengineering.slack.com
https://github.com/dastergon/awesome-chaos
-engineering
https://www.infoq.com/chaos-engineering
@yurynino

More Related Content

PPTX
Chaos engineering
PDF
Chaos Engineering
PDF
Chaos Engineering, When should you release the monkeys?
PDF
Chaos Engineering - The Art of Breaking Things in Production
PDF
Chaos Engineering 101: A Field Guide
PDF
Chaos Engineering: Why the World Needs More Resilient Systems
PDF
chaos-engineering-Knolx
PPTX
Introduction to Chaos Engineering
Chaos engineering
Chaos Engineering
Chaos Engineering, When should you release the monkeys?
Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering 101: A Field Guide
Chaos Engineering: Why the World Needs More Resilient Systems
chaos-engineering-Knolx
Introduction to Chaos Engineering

What's hot (20)

PPTX
Chaos engineering & Gameday on AWS
PDF
Design System & Atomic Design
PDF
An Introduction to Chaos Engineering
PDF
Choose your own adventure Chaos Engineering - QCon NYC 2017
PDF
Creating and maintaining a design system for 130 teams - Bethany Sonefeld
PDF
The Thinking Tester, Evolved
PDF
Design System & Atomic Design
PPTX
Chaos Engineering with Gremlin Platform
PDF
Practical Chaos Engineering
DOC
Manual testing interview questions by infotech
PPTX
Chaos engineering and chaos testing
PDF
Continuous Testing in DevOps
PPTX
Tips for Writing Better Charters for Exploratory Testing Sessions by Michael...
PDF
Design System 101
PPTX
Introduction to DevOps
PDF
Introduction to Chaos Engineering with Microsoft Azure
PDF
Design Systems at Scale
KEY
Specification by Example
PDF
Engineering Velocity: Shifting the Curve at Netflix
PPTX
Cypress first impressions
Chaos engineering & Gameday on AWS
Design System & Atomic Design
An Introduction to Chaos Engineering
Choose your own adventure Chaos Engineering - QCon NYC 2017
Creating and maintaining a design system for 130 teams - Bethany Sonefeld
The Thinking Tester, Evolved
Design System & Atomic Design
Chaos Engineering with Gremlin Platform
Practical Chaos Engineering
Manual testing interview questions by infotech
Chaos engineering and chaos testing
Continuous Testing in DevOps
Tips for Writing Better Charters for Exploratory Testing Sessions by Michael...
Design System 101
Introduction to DevOps
Introduction to Chaos Engineering with Microsoft Azure
Design Systems at Scale
Specification by Example
Engineering Velocity: Shifting the Curve at Netflix
Cypress first impressions
Ad

Similar to Chaos Engineering: Injecting Failure for Building Resilience in Systems (20)

PDF
Chaos Engineering to Establish Software Reliability
PDF
Introduction to Chaos Engineering | SRECon Asia - Ana Medina
PPTX
#ATAGTR2021 Presentation : "Chaos engineering: Break it to make it" by Anupa...
PDF
Trust and Confidence through Chaos Keynote for W-JAX Munich 2018
PDF
Applying Chaos Engineering to Build Resilient Serverless Applications
PDF
Chaos engineering intro
PDF
Using security to drive chaos engineering - April 2018
PPTX
Chaos Engineering on Cloud Foundry
PDF
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
PDF
The Practice of Chaos Engineering - Reactive Summit 2018 - Montreal, QC
PDF
The case for chaos testing
PDF
Chaos Engineering and Systems Reliability
PDF
The Case for Chaos Testing
PPTX
Introduction to Chaos Engineering: Dynamic talks Dallas 3/26/19
PDF
신뢰성 높은 클라우드 기반 서비스 운영을 위한 Chaos Engineering in Action (윤석찬, AWS 테크에반젤리스트) :: ...
PDF
Applying Chaos Engineering to build Resilient Serverless Applications - Emrah...
PDF
Chaos is a ladder !
PDF
Chaos Engineering Site Reliability Through Controlled Disruption 1st Edition ...
PDF
Chaos engineering open science for software engineering - kube con north am...
PDF
Chaos Engineering
Chaos Engineering to Establish Software Reliability
Introduction to Chaos Engineering | SRECon Asia - Ana Medina
#ATAGTR2021 Presentation : "Chaos engineering: Break it to make it" by Anupa...
Trust and Confidence through Chaos Keynote for W-JAX Munich 2018
Applying Chaos Engineering to Build Resilient Serverless Applications
Chaos engineering intro
Using security to drive chaos engineering - April 2018
Chaos Engineering on Cloud Foundry
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
The Practice of Chaos Engineering - Reactive Summit 2018 - Montreal, QC
The case for chaos testing
Chaos Engineering and Systems Reliability
The Case for Chaos Testing
Introduction to Chaos Engineering: Dynamic talks Dallas 3/26/19
신뢰성 높은 클라우드 기반 서비스 운영을 위한 Chaos Engineering in Action (윤석찬, AWS 테크에반젤리스트) :: ...
Applying Chaos Engineering to build Resilient Serverless Applications - Emrah...
Chaos is a ladder !
Chaos Engineering Site Reliability Through Controlled Disruption 1st Edition ...
Chaos engineering open science for software engineering - kube con north am...
Chaos Engineering
Ad

Recently uploaded (20)

PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
System and Network Administraation Chapter 3
PPTX
Essential Infomation Tech presentation.pptx
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
Introduction to Artificial Intelligence
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
ai tools demonstartion for schools and inter college
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Transform Your Business with a Software ERP System
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
How Creative Agencies Leverage Project Management Software.pdf
CHAPTER 2 - PM Management and IT Context
System and Network Administraation Chapter 3
Essential Infomation Tech presentation.pptx
Odoo Companies in India – Driving Business Transformation.pdf
2025 Textile ERP Trends: SAP, Odoo & Oracle
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Introduction to Artificial Intelligence
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Operating system designcfffgfgggggggvggggggggg
ai tools demonstartion for schools and inter college
PTS Company Brochure 2025 (1).pdf.......
Transform Your Business with a Software ERP System
Understanding Forklifts - TECH EHS Solution
Navsoft: AI-Powered Business Solutions & Custom Software Development
VVF-Customer-Presentation2025-Ver1.9.pptx
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Wondershare Filmora 15 Crack With Activation Key [2025
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...

Chaos Engineering: Injecting Failure for Building Resilience in Systems

  • 1. Chaos Engineering Injecting failure for building resilience in systems
  • 2. Nice to meet you YURY NIÑO Software Engineer and Chaos Engineer Advocate. Loves building software applications, solving resilience issues and teaching. Passionate about reading, writing and cycling.
  • 3. Agenda ● Resilience vs Reliability ● Why the world needs Resilience and Reliability? ● Chaos Engineering ● Principles of Chaos ● Chaos in Practice ● Game Days
  • 4. How many of you Have encountered a crash of your systems on production?
  • 5. A recognition for ... This talk is dedicated to the #SystemAdministrators well caffeinated, who get woken up in the middle of the night when “things go bump”. #EngineeringTeam #DigitalFactory @jnhernandz @
  • 7. A resilient system can maintain an acceptable level of service in the face of failure. A resilient system can weather the storm such a large scale natural disaster or a controlled chaos engineering. Tammy Bütow Principal SRE at Gremlin
  • 9. A distributed system on production needs to be resilient in order to be reliable and this is precisely a target that we Software Engineers, Systems Engineers, Site Reliability Engineers and Chaos Engineers always aim. Mine :)
  • 10. Why the world needs Resilient Systems?
  • 11. Because ... We are surrounded by distributed systems. When we read the news in our cellphones, send an email or buy our lunch ... We do not tolerate that they fail!
  • 13. February 28th, 2017 will be remembered ● Simple Storage Service (S3) went down in US-EAST. ● Outage lasted about 4 hrs. ● > 100.000 websites across the world were impacted.
  • 14. Me :(
  • 15. The World is Chaotic! ● Distributed systems contains moving parts. ● Many things can go wrong. ○ Hard disks can fail. ○ The network can go down. ○ Customer traffic can overload.
  • 16. How many of you know What is Chaos Engineering?
  • 17. Chaos Engineering It is the discipline of experimenting in production on a distributed system in order to reveal their weakness and to build confidence in their resilience capability. https://principlesofchaos.org/
  • 18. Chaos Engineering It is deliberately inducing stress or fault into software and/or hardware as a way of learning/verifying things about systems. https://www.gremlin.com
  • 19. Chaos Engineering is about ● Simulating the failure of a datacenter. ● Injecting latency between services. ● Randomly causing exceptions. ● Changing time travel. ● Emulating I/O errors. http://principlesofchaos.org/
  • 20. 2008 Chaos Engineering began at Netflix 2010 Chaos Monkey was launched 2018 A lot of resources for Chaos Engineering. 2014 Role of Chaos Engineer was created. History of Chaos Engineering Kolton Andrus
  • 27. 4. Run the Experiment Application Name Finer Observability DataDog Hypothesis Circuit Breaker works Environment My Home Results Duration 5 - 10 seconds Load 1 request Actions
  • 28. 4. Run the Experiment Application Name Finer Observability DataDog Hypothesis Circuit Breaker works Facing latencies > 5 seconds between dashboard_api and smart_api to open the circuit. Environment My Home Results Duration 20 milliseconds Load 1 request Issue #4356 Configure the proper hystrix parameters according the results. Implement a fallback. Actions
  • 30. Game Day: Roles Master of Disaster First on-call Team https://www.pinterest.es/pin/824299538021645731/
  • 31. Game Days can Transform our Teams Even though Game Days are not real! they make Engineers gain confidence.
  • 32. Since we, Engineers are experiencing the failure as part of our job, we should start designing for failure. Me :) The best time to learn about fire is when you’re on fire. —Jen Hammond, New Relic engineering manager
  • 33. How to begin ... https://chaosengineering.slack.com https://github.com/dastergon/awesome-chaos -engineering https://www.infoq.com/chaos-engineering @yurynino