SlideShare a Scribd company logo
Chaos Testing with
MongoDB
Chaos Monkey for MongoDB
Justin LaBreck
Sr. Consulting Engineer,
MongoDB
Timo Geusch
Sr. Consulting Engineer,
MongoDB
T
Chaos Monkey
Havoc in production since 2011
• MongoDB has built-in
high availability
• Chaos Monkey is a tool
• Test your application
Test your infrastructure
Test your response to disaster
+
Swipe right!
=
A Rigorous
Testing
Methodology
Anti-goals: A Way Not To Chaos Test
• Chaos Monkey is a tool to help identify problems
• Chaos Testing is NOT:
• A topology recommendation
• A test framework
• A benchmarking system
• The answer to all life’s woes
Terminology
Removing the chaos from chaos testing
Term Definition
Scenario A concern you have about your software or infrastructure.
“Will my application stay online if MongoDB shuts down?”
Action A performed operation on a resource (server, network, disk, etc.)
“Shutdown mongod”
Dimension A factor that needs considering when creating a test.
“Which mongod gets shutdown?”
Matrix An exhaustive comparison of actions and dimensions.
Where Do You Begin?
Using a non-chaotic methodology
1. Develop a list of scenarios Why are we performing tests? What
concerns do we have?
2. Create a testing matrix and determine
dimensions
What are we going to test? What in our
software might trigger problems based on
our concerns?
3. Define actions to simulate all dimensions How are we performing the tests? How
are we quantifying the results?
4. Automate the operations and deploy Where are the tests being performed?
Who is monitoring the impact of the tests?
1. Define Scenario
Establishing a baseline
• Primary goes down, secondary
takes over
• Upgrade version of MongoDB
• Secondary comes up behind
primary
• MongoDB is CPU bound
• DR site has a network problem
• Spike in connections
• Change configuration values
Each
is
a
concern
but am
biguous
in
practice.
W
hy
are
these
im
proper actions?
2. Create A Testing Matrix
Defining dimensions
• Replication
• System failure
• Network errors/failure
• Primary read failover
• Secondary read failover
• No primary (read-only
cluster)
• Failure duration
• Resource restraints and
contention
• Sharding
• System failure
• Network errors/failure
• Replication failover
• Mongos failure
• Configuration server failure
• Resource restraints and
contention
These are all big concerns, but can be
simplified greatly into fewer actions.
(more on this later)
2. Create A Testing Matrix
An example to develop “stop a mongod”
PrimaryOnly Primary Secondary Secondary
Primary Up Down Up
Secondary Up Up Down
Secondary Down Down Down
SecondaryPreferred Primary Secondary Secondary
Primary Up Down Up
Secondary Up Up Down
Secondary Down Down Down
Dimension1Dimension2
Note: these matrixes are incomplete for brevity!
Reminder!
We’re not testing MongoDB.
Bringing down a mongod or mongos as part of a test happens to test
your software. MongoDB has already been vetted.
3. Define Actions
Move from dimensions to performable actions
PrimaryOnly Primary Secondary Secondary
Primary Up Down Up
Secondary Up Up Down
Secondary Down Down Down
Dimension1
All those boxes, just three actions!
● Shutdown 1 secondary
● Shutdown primary and 1 secondary
● Shutdown all nodes Now iterate!
• Find missing dimensions
4. Automate the Actions and Deploy!
Write code
Test code
Deploy code
Repeat.
Identify weak points in your infracture
Release the chaos monkey
More on the how, up next.
Questions & Answers
(half way thru)
Implementation considerations
• Netflix runs Chaos Monkey in production. Should you, too?
• How do you integrate with your existing system and test suite?
• How do you capture your application behaviour, desired and
actual?
• Who is responsible for continuous monitoring and maintenance of
the test suite?
• Does your test system mirror production enough to be meaningful?
Start with building blocks
• Break down your test cases into common components
• Take your components and use them to build this:
Scenario:
“I want to simulate an unreliable secondary”
A deep dive
Deep dive - simulating an “unreliable” mongod
What does “unreliable” mean in the context of your application?
Could it be:
• Intermittent connection failures?
• Variable response times that may breach the SLA?
• Completely overloaded server?
• Write errors?
• Random crashes?
Example 1 - overloaded server
What causes server overload?
• I/O overloaded?
• CPU overloaded?
How do we simulate it?
• Throttle the I/O or move to a VM with less provisioned I/O
• Throttle the CPU?
Wait, isn’t that the same observable behaviour?
Overloaded server, continued
We’re not testing mongoDB. We’re testing your application instead!
What matters is the behaviour your application experiences, not
the minute details of the exact mongoDB behaviour affecting
your application.
Overloaded server, continued
• Pick what is easier to simulate in your environment
• If it’s restricting CPU, do that
• If it’s throttling I/O, do that
• Are there tools?
• On Linux, we have:
• Cpulimit to throttle the CPU
• User mode file systems to simulate I/O restrictions
• Network throttling
Example 2 - “bouncy mongod”
What is our scenario?
• Unreliable machine running mongod?
• Intermittent network connection?
Bouncy mongod - continued
So what we really want is a mongod that disconnects and reconnects
to a replica set or sharded cluster. Consider:
• How long before it reconnects? Does it fall off the oplog?
• Does its behaviour trigger elections? Does it have to?
• Are we doing secondary reads?
Bouncy mongod - bash script
#!/bin/bash
for i in $(seq 1 $max); do
service stop mongod
sleep $RANDOM
service start mongod
Sleep $RANDOM
done
3. Define Your Operations
Category Operation
Server Mongod shutdown; graceful or bound
Server Mongod shutdown; forced
Server Mongod slow down or halt; CPU bound operation
Server Mongod halt; disk fault/failure
Server Mongod secondary goes into recovery mode
Server Mongod primary step-down
Server Multiple mongod instances shutdown gracefully across multiple servers
Server Multiple mongod instances shutdown forcefully across multiple servers
Server Mongos shutdown; graceful
3. Define Your Operations
Category Operation
Network Network partition between primary and one secondary
Network Network partition between primary and one secondary without chaining
Network Short network partition between primary and one secondary
Network Network partition between mongod and the application
Network Short network partition between mongod and the application
Network Network partition between primary and secondary with RECOVERY mode
Network Network partition between mongos and the application
Our experience on-site ...
• Agent/server architecture
• Deploy to multiple systems
• Standardized verbiage and lexicon, both in your communications
and implementation
• “restart mongos”
• “firewall block port 27017”
• Combine servers and verbiage
• { localhost: [ “stop mongod”, “sleep 15” ] }
• { host1.local: [ “firewall unblock 27017” ] }
Questions & Answers
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey

More Related Content

PDF
Monufacture: Effortless Test Data for MongoDB
PDF
Scaling Engineering with Docker
PPTX
Performance Tuning in the Trenches
PDF
2020-Feb: Testing: Cables and Chains
PPTX
Continuous integration of_puppet_code
PDF
Олександр Хотемський:”Serverless архітектура та її застосування в автоматизац...
PPTX
Top DevOps Security Failures
PDF
Attack-driven defense
Monufacture: Effortless Test Data for MongoDB
Scaling Engineering with Docker
Performance Tuning in the Trenches
2020-Feb: Testing: Cables and Chains
Continuous integration of_puppet_code
Олександр Хотемський:”Serverless архітектура та її застосування в автоматизац...
Top DevOps Security Failures
Attack-driven defense

What's hot (20)

PPTX
Leveraging Azure for Performance Testing
PPTX
Final presentation
PPTX
Windows Azure Acid Test
PDF
Tests antipatterns
PDF
How to Test PowerShell Code Using Pester
PPTX
Sam Guckenheimer - Moving to One Engineering System
PDF
Puppet Camp Seattle 2014: Keynote
PPTX
I Don't Test Often ...
PDF
Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over
PDF
Puppet Camp Tokyo 2014: Keynote
PDF
Effective approaches to web application security
PDF
How to adapt the SDLC to the era of DevSecOps
PDF
Performant Django - Ara Anjargolian
PPTX
Elm - Could this be the Future of Web Dev?
PDF
Serverless in production, an experience report (codemotion milan)
PPTX
Reactive Development: Commands, Actors and Events. Oh My!!
PDF
Making sense of your data jug
PDF
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
PPTX
LONDON SDET MEETUP - Difference between Selenium and Cypress presentation
PDF
Puppet Camp Melbourne: Keynote
Leveraging Azure for Performance Testing
Final presentation
Windows Azure Acid Test
Tests antipatterns
How to Test PowerShell Code Using Pester
Sam Guckenheimer - Moving to One Engineering System
Puppet Camp Seattle 2014: Keynote
I Don't Test Often ...
Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over
Puppet Camp Tokyo 2014: Keynote
Effective approaches to web application security
How to adapt the SDLC to the era of DevSecOps
Performant Django - Ara Anjargolian
Elm - Could this be the Future of Web Dev?
Serverless in production, an experience report (codemotion milan)
Reactive Development: Commands, Actors and Events. Oh My!!
Making sense of your data jug
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
LONDON SDET MEETUP - Difference between Selenium and Cypress presentation
Puppet Camp Melbourne: Keynote
Ad

Similar to MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey (20)

PPTX
High Performance, Scalable MongoDB in a Bare Metal Cloud
PDF
Mongo nyc nyt + mongodb
PPTX
Partner Webinar: MongoDB and Softlayer on Bare Metal: Stability, Performance,...
PPTX
Capacityplanning
PPTX
Capacity Planning For Your Growing MongoDB Cluster
PDF
MongoDB: Advantages of an Open Source NoSQL Database
PDF
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
PDF
Silicon Valley Code Camp 2014 - Advanced MongoDB
PDF
MongoDB at MapMyFitness from a DevOps Perspective
PPTX
Webinar: Capacity Planning
PPTX
Hardware Provisioning
PPTX
MongoDB Capacity Planning
KEY
Deployment Strategy
PDF
High performance Infrastructure Oct 2013
PPTX
Webinar: Ensuring Zero Downtime for Your Mission Critical App
PDF
Shaping the Future of Travel with MongoDB
PPTX
Hardware Provisioning
PDF
MongoDB at MapMyFitness
POTX
Mobile 3: Launch Like a Boss!
PDF
MongoDB and server performance
High Performance, Scalable MongoDB in a Bare Metal Cloud
Mongo nyc nyt + mongodb
Partner Webinar: MongoDB and Softlayer on Bare Metal: Stability, Performance,...
Capacityplanning
Capacity Planning For Your Growing MongoDB Cluster
MongoDB: Advantages of an Open Source NoSQL Database
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2014 - Advanced MongoDB
MongoDB at MapMyFitness from a DevOps Perspective
Webinar: Capacity Planning
Hardware Provisioning
MongoDB Capacity Planning
Deployment Strategy
High performance Infrastructure Oct 2013
Webinar: Ensuring Zero Downtime for Your Mission Critical App
Shaping the Future of Travel with MongoDB
Hardware Provisioning
MongoDB at MapMyFitness
Mobile 3: Launch Like a Boss!
MongoDB and server performance
Ad

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Cloud computing and distributed systems.
PPT
Teaching material agriculture food technology
PDF
Advanced IT Governance
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Approach and Philosophy of On baking technology
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Empathic Computing: Creating Shared Understanding
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Machine learning based COVID-19 study performance prediction
PDF
Spectral efficient network and resource selection model in 5G networks
NewMind AI Weekly Chronicles - August'25 Week I
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
GamePlan Trading System Review: Professional Trader's Honest Take
Understanding_Digital_Forensics_Presentation.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Chapter 3 Spatial Domain Image Processing.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Cloud computing and distributed systems.
Teaching material agriculture food technology
Advanced IT Governance
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Approach and Philosophy of On baking technology
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Empathic Computing: Creating Shared Understanding
Mobile App Security Testing_ A Comprehensive Guide.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Machine learning based COVID-19 study performance prediction
Spectral efficient network and resource selection model in 5G networks

MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey

  • 1. Chaos Testing with MongoDB Chaos Monkey for MongoDB
  • 2. Justin LaBreck Sr. Consulting Engineer, MongoDB Timo Geusch Sr. Consulting Engineer, MongoDB T
  • 3. Chaos Monkey Havoc in production since 2011
  • 4. • MongoDB has built-in high availability • Chaos Monkey is a tool • Test your application Test your infrastructure Test your response to disaster + Swipe right! = A Rigorous Testing Methodology
  • 5. Anti-goals: A Way Not To Chaos Test • Chaos Monkey is a tool to help identify problems • Chaos Testing is NOT: • A topology recommendation • A test framework • A benchmarking system • The answer to all life’s woes
  • 6. Terminology Removing the chaos from chaos testing Term Definition Scenario A concern you have about your software or infrastructure. “Will my application stay online if MongoDB shuts down?” Action A performed operation on a resource (server, network, disk, etc.) “Shutdown mongod” Dimension A factor that needs considering when creating a test. “Which mongod gets shutdown?” Matrix An exhaustive comparison of actions and dimensions.
  • 7. Where Do You Begin? Using a non-chaotic methodology 1. Develop a list of scenarios Why are we performing tests? What concerns do we have? 2. Create a testing matrix and determine dimensions What are we going to test? What in our software might trigger problems based on our concerns? 3. Define actions to simulate all dimensions How are we performing the tests? How are we quantifying the results? 4. Automate the operations and deploy Where are the tests being performed? Who is monitoring the impact of the tests?
  • 8. 1. Define Scenario Establishing a baseline • Primary goes down, secondary takes over • Upgrade version of MongoDB • Secondary comes up behind primary • MongoDB is CPU bound • DR site has a network problem • Spike in connections • Change configuration values Each is a concern but am biguous in practice. W hy are these im proper actions?
  • 9. 2. Create A Testing Matrix Defining dimensions • Replication • System failure • Network errors/failure • Primary read failover • Secondary read failover • No primary (read-only cluster) • Failure duration • Resource restraints and contention • Sharding • System failure • Network errors/failure • Replication failover • Mongos failure • Configuration server failure • Resource restraints and contention These are all big concerns, but can be simplified greatly into fewer actions. (more on this later)
  • 10. 2. Create A Testing Matrix An example to develop “stop a mongod” PrimaryOnly Primary Secondary Secondary Primary Up Down Up Secondary Up Up Down Secondary Down Down Down SecondaryPreferred Primary Secondary Secondary Primary Up Down Up Secondary Up Up Down Secondary Down Down Down Dimension1Dimension2 Note: these matrixes are incomplete for brevity!
  • 11. Reminder! We’re not testing MongoDB. Bringing down a mongod or mongos as part of a test happens to test your software. MongoDB has already been vetted.
  • 12. 3. Define Actions Move from dimensions to performable actions PrimaryOnly Primary Secondary Secondary Primary Up Down Up Secondary Up Up Down Secondary Down Down Down Dimension1 All those boxes, just three actions! ● Shutdown 1 secondary ● Shutdown primary and 1 secondary ● Shutdown all nodes Now iterate! • Find missing dimensions
  • 13. 4. Automate the Actions and Deploy! Write code Test code Deploy code Repeat. Identify weak points in your infracture Release the chaos monkey More on the how, up next.
  • 15. Implementation considerations • Netflix runs Chaos Monkey in production. Should you, too? • How do you integrate with your existing system and test suite? • How do you capture your application behaviour, desired and actual? • Who is responsible for continuous monitoring and maintenance of the test suite? • Does your test system mirror production enough to be meaningful?
  • 16. Start with building blocks • Break down your test cases into common components • Take your components and use them to build this:
  • 17. Scenario: “I want to simulate an unreliable secondary” A deep dive
  • 18. Deep dive - simulating an “unreliable” mongod What does “unreliable” mean in the context of your application? Could it be: • Intermittent connection failures? • Variable response times that may breach the SLA? • Completely overloaded server? • Write errors? • Random crashes?
  • 19. Example 1 - overloaded server What causes server overload? • I/O overloaded? • CPU overloaded? How do we simulate it? • Throttle the I/O or move to a VM with less provisioned I/O • Throttle the CPU? Wait, isn’t that the same observable behaviour?
  • 20. Overloaded server, continued We’re not testing mongoDB. We’re testing your application instead! What matters is the behaviour your application experiences, not the minute details of the exact mongoDB behaviour affecting your application.
  • 21. Overloaded server, continued • Pick what is easier to simulate in your environment • If it’s restricting CPU, do that • If it’s throttling I/O, do that • Are there tools? • On Linux, we have: • Cpulimit to throttle the CPU • User mode file systems to simulate I/O restrictions • Network throttling
  • 22. Example 2 - “bouncy mongod” What is our scenario? • Unreliable machine running mongod? • Intermittent network connection?
  • 23. Bouncy mongod - continued So what we really want is a mongod that disconnects and reconnects to a replica set or sharded cluster. Consider: • How long before it reconnects? Does it fall off the oplog? • Does its behaviour trigger elections? Does it have to? • Are we doing secondary reads?
  • 24. Bouncy mongod - bash script #!/bin/bash for i in $(seq 1 $max); do service stop mongod sleep $RANDOM service start mongod Sleep $RANDOM done
  • 25. 3. Define Your Operations Category Operation Server Mongod shutdown; graceful or bound Server Mongod shutdown; forced Server Mongod slow down or halt; CPU bound operation Server Mongod halt; disk fault/failure Server Mongod secondary goes into recovery mode Server Mongod primary step-down Server Multiple mongod instances shutdown gracefully across multiple servers Server Multiple mongod instances shutdown forcefully across multiple servers Server Mongos shutdown; graceful
  • 26. 3. Define Your Operations Category Operation Network Network partition between primary and one secondary Network Network partition between primary and one secondary without chaining Network Short network partition between primary and one secondary Network Network partition between mongod and the application Network Short network partition between mongod and the application Network Network partition between primary and secondary with RECOVERY mode Network Network partition between mongos and the application
  • 27. Our experience on-site ... • Agent/server architecture • Deploy to multiple systems • Standardized verbiage and lexicon, both in your communications and implementation • “restart mongos” • “firewall block port 27017” • Combine servers and verbiage • { localhost: [ “stop mongod”, “sleep 15” ] } • { host1.local: [ “firewall unblock 27017” ] }