MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey

Chaos Testing with
MongoDB
Chaos Monkey for MongoDB

Justin LaBreck
Sr. Consulting Engineer,
MongoDB
Timo Geusch
Sr. Consulting Engineer,
MongoDB
T

Chaos Monkey
Havoc in production since 2011

• MongoDB has built-in
high availability
• Chaos Monkey is a tool
• Test your application
Test your infrastructure
Test your response to disaster
+
Swipe right!
=
A Rigorous
Testing
Methodology

Anti-goals: A Way Not To Chaos Test
• Chaos Monkey is a tool to help identify problems
• Chaos Testing is NOT:
• A topology recommendation
• A test framework
• A benchmarking system
• The answer to all life’s woes

Terminology
Removing the chaos from chaos testing
Term Definition
Scenario A concern you have about your software or infrastructure.
“Will my application stay online if MongoDB shuts down?”
Action A performed operation on a resource (server, network, disk, etc.)
“Shutdown mongod”
Dimension A factor that needs considering when creating a test.
“Which mongod gets shutdown?”
Matrix An exhaustive comparison of actions and dimensions.

Where Do You Begin?
Using a non-chaotic methodology
1. Develop a list of scenarios Why are we performing tests? What
concerns do we have?
2. Create a testing matrix and determine
dimensions
What are we going to test? What in our
software might trigger problems based on
our concerns?
3. Define actions to simulate all dimensions How are we performing the tests? How
are we quantifying the results?
4. Automate the operations and deploy Where are the tests being performed?
Who is monitoring the impact of the tests?

1. Define Scenario
Establishing a baseline
• Primary goes down, secondary
takes over
• Upgrade version of MongoDB
• Secondary comes up behind
primary
• MongoDB is CPU bound
• DR site has a network problem
• Spike in connections
• Change configuration values
Each
is
a
concern
but am
biguous
in
practice.
W
hy
are
these
im
proper actions?

2. Create A Testing Matrix
Defining dimensions
• Replication
• System failure
• Network errors/failure
• Primary read failover
• Secondary read failover
• No primary (read-only
cluster)
• Failure duration
• Resource restraints and
contention
• Sharding
• System failure
• Network errors/failure
• Replication failover
• Mongos failure
• Configuration server failure
• Resource restraints and
contention
These are all big concerns, but can be
simplified greatly into fewer actions.
(more on this later)

2. Create A Testing Matrix
An example to develop “stop a mongod”
PrimaryOnly Primary Secondary Secondary
Primary Up Down Up
Secondary Up Up Down
Secondary Down Down Down
SecondaryPreferred Primary Secondary Secondary
Primary Up Down Up
Dimension1Dimension2
Note: these matrixes are incomplete for brevity!

Reminder!
We’re not testing MongoDB.
Bringing down a mongod or mongos as part of a test happens to test
your software. MongoDB has already been vetted.

3. Define Actions
Move from dimensions to performable actions
PrimaryOnly Primary Secondary Secondary
Primary Up Down Up
Dimension1
All those boxes, just three actions!
● Shutdown 1 secondary
● Shutdown primary and 1 secondary
● Shutdown all nodes Now iterate!
• Find missing dimensions

4. Automate the Actions and Deploy!
Write code
Test code
Deploy code
Repeat.
Identify weak points in your infracture
Release the chaos monkey
More on the how, up next.

Questions & Answers
(half way thru)

Implementation considerations
• Netflix runs Chaos Monkey in production. Should you, too?
• How do you integrate with your existing system and test suite?
• How do you capture your application behaviour, desired and
actual?
• Who is responsible for continuous monitoring and maintenance of
the test suite?
• Does your test system mirror production enough to be meaningful?

Start with building blocks
• Break down your test cases into common components
• Take your components and use them to build this:

Scenario:
“I want to simulate an unreliable secondary”
A deep dive

Deep dive - simulating an “unreliable” mongod
What does “unreliable” mean in the context of your application?
Could it be:
• Intermittent connection failures?
• Variable response times that may breach the SLA?
• Completely overloaded server?
• Write errors?
• Random crashes?

Example 1 - overloaded server
What causes server overload?
• I/O overloaded?
• CPU overloaded?
How do we simulate it?
• Throttle the I/O or move to a VM with less provisioned I/O
• Throttle the CPU?
Wait, isn’t that the same observable behaviour?

Overloaded server, continued
We’re not testing mongoDB. We’re testing your application instead!
What matters is the behaviour your application experiences, not
the minute details of the exact mongoDB behaviour affecting
your application.

Overloaded server, continued
• Pick what is easier to simulate in your environment
• If it’s restricting CPU, do that
• If it’s throttling I/O, do that
• Are there tools?
• On Linux, we have:
• Cpulimit to throttle the CPU
• User mode file systems to simulate I/O restrictions
• Network throttling

Example 2 - “bouncy mongod”
What is our scenario?
• Unreliable machine running mongod?
• Intermittent network connection?

Bouncy mongod - continued
So what we really want is a mongod that disconnects and reconnects
to a replica set or sharded cluster. Consider:
• How long before it reconnects? Does it fall off the oplog?
• Does its behaviour trigger elections? Does it have to?
• Are we doing secondary reads?

Bouncy mongod - bash script
#!/bin/bash
for i in $(seq 1 $max); do
service stop mongod
sleep $RANDOM
service start mongod
Sleep $RANDOM
done

3. Define Your Operations
Category Operation
Server Mongod shutdown; graceful or bound
Server Mongod shutdown; forced
Server Mongod slow down or halt; CPU bound operation
Server Mongod halt; disk fault/failure
Server Mongod secondary goes into recovery mode
Server Mongod primary step-down
Server Multiple mongod instances shutdown gracefully across multiple servers
Server Multiple mongod instances shutdown forcefully across multiple servers
Server Mongos shutdown; graceful

3. Define Your Operations
Category Operation
Network Network partition between primary and one secondary
Network Network partition between primary and one secondary without chaining
Network Short network partition between primary and one secondary
Network Network partition between mongod and the application
Network Short network partition between mongod and the application
Network Network partition between primary and secondary with RECOVERY mode
Network Network partition between mongos and the application

Our experience on-site ...
• Agent/server architecture
• Deploy to multiple systems
• Standardized verbiage and lexicon, both in your communications
and implementation
• “restart mongos”
• “firewall block port 27017”
• Combine servers and verbiage
• { localhost: [ “stop mongod”, “sleep 15” ] }
• { host1.local: [ “firewall unblock 27017” ] }

MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey

MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey

More Related Content

What's hot (20)

Similar to MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey (20)

More from MongoDB (20)

Recently uploaded (20)

MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey