SlideShare a Scribd company logo
Chaos Engineering at Jet.com
Rachel Reese | @rachelreese | rachelree.se
Jet Technology | @JetTechnology | tech.jet.com
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/jet-microservices-testing
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon London
www.qconlondon.com
Why do you need chaos testing?
The world is naturally chaotic
But do we need more testing?
Unit Sanity Random Continuous
UsabilityA/BLocalizationAcceptance
Regression Performance Integration Security
You’ve already tested all your
components in multiple ways.
Microservices Chaos Testing at Jet
It’s super important to test the interactions in your
environment
Jet? Jet who?
Taking on Amazon!
Launched July 22
• Both Apple & Android named our
app as one of their tops for 2015
• Over 20k orders per day
• Over 10.5 million SKUs
• #4 marketplace worldwide
• 700 microservices
We’re hiring!
http://jet.com/about-us/working-at-jet
Azure Web sites Cloud
services VMs Service bus
queues
Services
bus topics
Blob storage
Table
storage Queues Hadoop DNS Active
directory
SQL Azure R
F# Paket FSharp.Data Chessie Unquote SQLProvider Python
Deedle
FAK
E
FSharp.Async React Node Angular SAS
Storm Elastic
Search
Xamarin Microservices Consul Kafka PDW
Splunk Redis SQL Puppet Jenkins
Apache
Hive
Apache
Tez
Microservices at Jet
Microservices
• An application of the single responsibility principle at the service level.
• Has an input, produces an output.
Easy scalability
Independent releasability
More even distribution of complexity
Benefits
“A class should have one, and only one, reason to change.”
What is chaos engineering?
It’s just wreaking havoc with your code
for fun, right?
Microservices Chaos Testing at Jet
Chaos Engineering is…
Controlled experiments on a distributed system
that help you build confidence in the system’s
ability to tolerate the inevitable failures.
Microservices Chaos Testing at Jet
Principles of Chaos Engineering
1. Define “normal”
2. Assume ”normal” will continue in both a control group
and an experimental group.
3. Introduce chaos: servers that crash, hard drives that
malfunction, network connections that are severed, etc.
4. Look for a difference in behavior between the control
group and the experimental group.
Going farther
Build a Hypothesis around Normal Behavior
Vary Real-world Events
Run Experiments in Production
Automate Experiments to Run Continuously
From http://principlesofchaos.org/
Benefits of chaos engineering
Benefits of chaos engineering
You're awake Design for failure
Healthy systems Self service
Current examples of chaos engineering
Maybe you meant Netflix’s Chaos Monkey?
How is Jet different?
We’re not testing in prod (yet).
SQL restarts & geo-replication
Start
- Checks the source db for write access
- Renames db on destination server (to create a new one)
- Creates a geo-replication in the destination region
Stop
- Shuts down cloud services writing to source db
- Sets source db as read-only
- Ends continuous copy
- Allows writes to secondary db
Azure & F#
Why F#?
Microservices Chaos Testing at Jet
What FP means to us
Prefer immutability
Avoid state changes,
side effects, and
mutable data
Use data in  data out
transformations
Think about mapping
inputs to outputs.
Look at problems
recursively
Consider successively
smaller chunks of the
same problem
Treat functions as
unit of work
Higher-order functions
The F# solution offers us an order of magnitude
increase in productivity and allows one developer to
perform the work [of] a team of dedicated
developers…
Yan Cui
Lead Server Engineer, Gamesys
“
“ “
Concise and powerful code
public abstract class Transport{ }
public abstract class Car : Transport {
public string Make { get; private set; }
public string Model { get; private set; }
public Car (string make, string model) {
this.Make = make;
this.Model = model;
}
}
public abstract class Bus : Transport {
public int Route { get; private set; }
public Bus (int route) {
this.Route = route;
}
}
public class Bicycle: Transport {
public Bicycle() {
}
}
type Transport =
| Car of Make:string * Model:string
| Bus of Route:int
| Bicycle
C# F#
Trivial to pattern match on!
F#patternmatching
C#
Concise and powerful code
public abstract class Transport{ }
public abstract class Car : Transport {
public string Make { get; private set; }
public string Model { get; private set; }
public Car (string make, string model) {
this.Make = make;
this.Model = model;
}
}
public abstract class Bus : Transport {
public int Route { get; private set; }
public Bus (int route) {
this.Route = route;
}
}
public class Bicycle: Transport {
public Bicycle() {
}
}
type Transport =
| Car of Make:string * Model:string
| Bus of Route:int
| Bicycle
| Train of Line:int
let getThereVia (transport:Transport) =
match transport with
| Car (make,model) -> ...
| Bus route -> ...
| Bicycle -> ...
Warning FS0025: Incomplete pattern
matches on this expression. For example,
the value ’Train' may indicate a case not
covered by the pattern(s)
C# F#
Units of Measure
TickSpec – an F# project
Thanks to Scott Wlaschin for his post, Cycles and modularity in the wild
SpecFlow– a comparable C# project
Thanks to Scott Wlaschin for his post, Cycles and modularity in the wild
Chaos code!
Microservices Chaos Testing at Jet
type Input =
| Product of Product
type Output =
| ProductPriceNile of Product * decimal
| ProductPriceCheckFailed of PriceCheckFailed
let handle (input:Input) =
async {
return Some(ProductPriceNile({Sku="343434"; ProductId = 17; ProductDescription = "My
amazing product"; CostPer=1.96M}, 3.96M))
}
let interpret id output =
match output with
| Some (Output.ProductPriceNile (e, price)) -> async {()} // write to event store
| Some (Output.ProductPriceCheckFailed e) -> async {()} // log failure
| None -> async.Return ()
let consume = EventStoreQueue.consume (decodeT Input.Product) handle interpret
What do our services look like?
Define inputs
& outputs
Define how input
transforms to output
Define what to do
with output
Read events,
handle, & interpret
Our code!
let selectRandomInstance compute hostedService = async {
try
let! details = getHostedServiceDetails compute hostedService.ServiceName
let deployment = getProductionDeployment details
let instance = deployment.RoleInstances
|> Seq.toArray
|> randomPick
return details.ServiceName, deployment.Name, instance
with e ->
log.error "Failed selecting random instancen%A" e
reraise e
}
Our code!
let restartRandomInstance compute hostedService = async {
try
let! serviceName, deploymentId, roleInstance =
selectRandomInstance compute hostedService
match roleInstance.PowerState with
| RoleInstancePowerState.Stopped ->
log.info "Service=%s Instance=%s is stopped...ignoring...”
serviceName roleInstance.InstanceName
| _ ->
do! restartInstance compute serviceName deploymentId roleInstance.InstanceName
with e ->
log.error "%s" e.Message
}
Our code!
compute
|> getHostedServices
|> Seq.filter ignoreList
|> knuthShuffle
|> Seq.distinctBy (fun a -> a.ServiceName)
|> Seq.map (fun hostedService -> async {
try
return! restartRandomInstance compute hostedService
with
e -> log.warn "failed: service=%s . %A" hostedService.ServiceName e
return ()
})
|> Async.ParallelIgnore 1
|> Async.RunSynchronously
Has it helped?
Elasticsearch restart
Additional chaos finds
- Redis
- Checkpointing
Microservices Chaos Testing at Jet
If availability matters, you should be
testing for it.
Azure + F# + Chaos = <3
Chaos Engineering at Jet.com
Rachel Reese | @rachelreese | rachelree.se
Jet Technology | @JetTechnology | tech.jet.com
Nora Jones | @nora_js
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/jet-
microservices-testing

More Related Content

PDF
Legacy Code and Refactoring Workshop - Session 1 - October 2019
PDF
Chaos Patterns
PDF
Patterns & Practices for Cloud-based Microservices
PDF
Voxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVM
PPTX
Patterns and practices for real-world event-driven microservices
PPTX
Patterns and practices for real-world event-driven microservices by Rachel Re...
PDF
Need 4 Speed FI
PPTX
IBM Bluemix OpenWhisk: Cloud Foundry Summit 2016, Frankfurt, Germany: The Fut...
Legacy Code and Refactoring Workshop - Session 1 - October 2019
Chaos Patterns
Patterns & Practices for Cloud-based Microservices
Voxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVM
Patterns and practices for real-world event-driven microservices
Patterns and practices for real-world event-driven microservices by Rachel Re...
Need 4 Speed FI
IBM Bluemix OpenWhisk: Cloud Foundry Summit 2016, Frankfurt, Germany: The Fut...

Similar to Microservices Chaos Testing at Jet (20)

PDF
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
PDF
Breaking Dependencies Legacy Code - Cork Software Crafters - September 2019
PPTX
Going open source with small teams
PDF
Multilanguage Pipelines with Jenkins, Docker and Kubernetes (Commit Conf 2018)
PDF
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...
PDF
DockerCon SF 2015: Keynote Day 1
PPTX
Serverless Single Page Apps with React and Redux at ItCamp 2017
PDF
Clean Architecture @ Taxibeat
PPTX
Framework engineering JCO 2011
PDF
The Modern Tech Stack: Microservices - The Dark Side
PDF
Cloud continuous integration- A distributed approach using distinct services
PPTX
2008 - TechDays PT: Building Software + Services with Volta
PDF
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
PPTX
Recommendations for Building Machine Learning Software
PDF
ML in the Browser: Interactive Experiences with Tensorflow.js
PPTX
Containers and the Docker EE Difference and usecases
PDF
LINQ Inside
PDF
Keynote: Trends in Modern Application Development - Gilly Dekel, IBM
PPTX
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
PDF
Architecture Patterns with Python 1st Edition Harry Percival
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Breaking Dependencies Legacy Code - Cork Software Crafters - September 2019
Going open source with small teams
Multilanguage Pipelines with Jenkins, Docker and Kubernetes (Commit Conf 2018)
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...
DockerCon SF 2015: Keynote Day 1
Serverless Single Page Apps with React and Redux at ItCamp 2017
Clean Architecture @ Taxibeat
Framework engineering JCO 2011
The Modern Tech Stack: Microservices - The Dark Side
Cloud continuous integration- A distributed approach using distinct services
2008 - TechDays PT: Building Software + Services with Volta
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Recommendations for Building Machine Learning Software
ML in the Browser: Interactive Experiences with Tensorflow.js
Containers and the Docker EE Difference and usecases
LINQ Inside
Keynote: Trends in Modern Application Development - Gilly Dekel, IBM
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Architecture Patterns with Python 1st Edition Harry Percival
Ad

More from C4Media (20)

PDF
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
PDF
Next Generation Client APIs in Envoy Mobile
PDF
Software Teams and Teamwork Trends Report Q1 2020
PDF
Understand the Trade-offs Using Compilers for Java Applications
PDF
Kafka Needs No Keeper
PDF
High Performing Teams Act Like Owners
PDF
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
PDF
Service Meshes- The Ultimate Guide
PDF
Shifting Left with Cloud Native CI/CD
PDF
CI/CD for Machine Learning
PDF
Fault Tolerance at Speed
PDF
Architectures That Scale Deep - Regaining Control in Deep Systems
PDF
Build Your Own WebAssembly Compiler
PDF
User & Device Identity for Microservices @ Netflix Scale
PDF
Scaling Patterns for Netflix's Edge
PDF
Make Your Electron App Feel at Home Everywhere
PDF
The Talk You've Been Await-ing For
PDF
Future of Data Engineering
PDF
Navigating Complexity: High-performance Delivery and Discovery Teams
PDF
High Performance Cooperative Distributed Systems in Adtech
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Next Generation Client APIs in Envoy Mobile
Software Teams and Teamwork Trends Report Q1 2020
Understand the Trade-offs Using Compilers for Java Applications
Kafka Needs No Keeper
High Performing Teams Act Like Owners
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Service Meshes- The Ultimate Guide
Shifting Left with Cloud Native CI/CD
CI/CD for Machine Learning
Fault Tolerance at Speed
Architectures That Scale Deep - Regaining Control in Deep Systems
Build Your Own WebAssembly Compiler
User & Device Identity for Microservices @ Netflix Scale
Scaling Patterns for Netflix's Edge
Make Your Electron App Feel at Home Everywhere
The Talk You've Been Await-ing For
Future of Data Engineering
Navigating Complexity: High-performance Delivery and Discovery Teams
High Performance Cooperative Distributed Systems in Adtech
Ad

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
A Presentation on Artificial Intelligence
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Big Data Technologies - Introduction.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
cuic standard and advanced reporting.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPT
Teaching material agriculture food technology
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Review of recent advances in non-invasive hemoglobin estimation
Assigned Numbers - 2025 - Bluetooth® Document
A Presentation on Artificial Intelligence
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Spectral efficient network and resource selection model in 5G networks
Per capita expenditure prediction using model stacking based on satellite ima...
Advanced methodologies resolving dimensionality complications for autism neur...
Empathic Computing: Creating Shared Understanding
Big Data Technologies - Introduction.pptx
Unlocking AI with Model Context Protocol (MCP)
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Network Security Unit 5.pdf for BCA BBA.
cuic standard and advanced reporting.pdf
Programs and apps: productivity, graphics, security and other tools
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Teaching material agriculture food technology
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx

Microservices Chaos Testing at Jet

  • 1. Chaos Engineering at Jet.com Rachel Reese | @rachelreese | rachelree.se Jet Technology | @JetTechnology | tech.jet.com
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /jet-microservices-testing
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon London www.qconlondon.com
  • 4. Why do you need chaos testing?
  • 5. The world is naturally chaotic
  • 6. But do we need more testing? Unit Sanity Random Continuous UsabilityA/BLocalizationAcceptance Regression Performance Integration Security
  • 7. You’ve already tested all your components in multiple ways.
  • 9. It’s super important to test the interactions in your environment
  • 11. Taking on Amazon! Launched July 22 • Both Apple & Android named our app as one of their tops for 2015 • Over 20k orders per day • Over 10.5 million SKUs • #4 marketplace worldwide • 700 microservices We’re hiring! http://jet.com/about-us/working-at-jet
  • 12. Azure Web sites Cloud services VMs Service bus queues Services bus topics Blob storage Table storage Queues Hadoop DNS Active directory SQL Azure R F# Paket FSharp.Data Chessie Unquote SQLProvider Python Deedle FAK E FSharp.Async React Node Angular SAS Storm Elastic Search Xamarin Microservices Consul Kafka PDW Splunk Redis SQL Puppet Jenkins Apache Hive Apache Tez
  • 14. Microservices • An application of the single responsibility principle at the service level. • Has an input, produces an output. Easy scalability Independent releasability More even distribution of complexity Benefits “A class should have one, and only one, reason to change.”
  • 15. What is chaos engineering?
  • 16. It’s just wreaking havoc with your code for fun, right?
  • 18. Chaos Engineering is… Controlled experiments on a distributed system that help you build confidence in the system’s ability to tolerate the inevitable failures.
  • 20. Principles of Chaos Engineering 1. Define “normal” 2. Assume ”normal” will continue in both a control group and an experimental group. 3. Introduce chaos: servers that crash, hard drives that malfunction, network connections that are severed, etc. 4. Look for a difference in behavior between the control group and the experimental group.
  • 21. Going farther Build a Hypothesis around Normal Behavior Vary Real-world Events Run Experiments in Production Automate Experiments to Run Continuously From http://principlesofchaos.org/
  • 22. Benefits of chaos engineering
  • 23. Benefits of chaos engineering You're awake Design for failure Healthy systems Self service
  • 24. Current examples of chaos engineering
  • 25. Maybe you meant Netflix’s Chaos Monkey?
  • 26. How is Jet different?
  • 27. We’re not testing in prod (yet).
  • 28. SQL restarts & geo-replication Start - Checks the source db for write access - Renames db on destination server (to create a new one) - Creates a geo-replication in the destination region Stop - Shuts down cloud services writing to source db - Sets source db as read-only - Ends continuous copy - Allows writes to secondary db
  • 32. What FP means to us Prefer immutability Avoid state changes, side effects, and mutable data Use data in  data out transformations Think about mapping inputs to outputs. Look at problems recursively Consider successively smaller chunks of the same problem Treat functions as unit of work Higher-order functions
  • 33. The F# solution offers us an order of magnitude increase in productivity and allows one developer to perform the work [of] a team of dedicated developers… Yan Cui Lead Server Engineer, Gamesys “ “ “
  • 34. Concise and powerful code public abstract class Transport{ } public abstract class Car : Transport { public string Make { get; private set; } public string Model { get; private set; } public Car (string make, string model) { this.Make = make; this.Model = model; } } public abstract class Bus : Transport { public int Route { get; private set; } public Bus (int route) { this.Route = route; } } public class Bicycle: Transport { public Bicycle() { } } type Transport = | Car of Make:string * Model:string | Bus of Route:int | Bicycle C# F# Trivial to pattern match on!
  • 36. Concise and powerful code public abstract class Transport{ } public abstract class Car : Transport { public string Make { get; private set; } public string Model { get; private set; } public Car (string make, string model) { this.Make = make; this.Model = model; } } public abstract class Bus : Transport { public int Route { get; private set; } public Bus (int route) { this.Route = route; } } public class Bicycle: Transport { public Bicycle() { } } type Transport = | Car of Make:string * Model:string | Bus of Route:int | Bicycle | Train of Line:int let getThereVia (transport:Transport) = match transport with | Car (make,model) -> ... | Bus route -> ... | Bicycle -> ... Warning FS0025: Incomplete pattern matches on this expression. For example, the value ’Train' may indicate a case not covered by the pattern(s) C# F#
  • 38. TickSpec – an F# project Thanks to Scott Wlaschin for his post, Cycles and modularity in the wild
  • 39. SpecFlow– a comparable C# project Thanks to Scott Wlaschin for his post, Cycles and modularity in the wild
  • 42. type Input = | Product of Product type Output = | ProductPriceNile of Product * decimal | ProductPriceCheckFailed of PriceCheckFailed let handle (input:Input) = async { return Some(ProductPriceNile({Sku="343434"; ProductId = 17; ProductDescription = "My amazing product"; CostPer=1.96M}, 3.96M)) } let interpret id output = match output with | Some (Output.ProductPriceNile (e, price)) -> async {()} // write to event store | Some (Output.ProductPriceCheckFailed e) -> async {()} // log failure | None -> async.Return () let consume = EventStoreQueue.consume (decodeT Input.Product) handle interpret What do our services look like? Define inputs & outputs Define how input transforms to output Define what to do with output Read events, handle, & interpret
  • 43. Our code! let selectRandomInstance compute hostedService = async { try let! details = getHostedServiceDetails compute hostedService.ServiceName let deployment = getProductionDeployment details let instance = deployment.RoleInstances |> Seq.toArray |> randomPick return details.ServiceName, deployment.Name, instance with e -> log.error "Failed selecting random instancen%A" e reraise e }
  • 44. Our code! let restartRandomInstance compute hostedService = async { try let! serviceName, deploymentId, roleInstance = selectRandomInstance compute hostedService match roleInstance.PowerState with | RoleInstancePowerState.Stopped -> log.info "Service=%s Instance=%s is stopped...ignoring...” serviceName roleInstance.InstanceName | _ -> do! restartInstance compute serviceName deploymentId roleInstance.InstanceName with e -> log.error "%s" e.Message }
  • 45. Our code! compute |> getHostedServices |> Seq.filter ignoreList |> knuthShuffle |> Seq.distinctBy (fun a -> a.ServiceName) |> Seq.map (fun hostedService -> async { try return! restartRandomInstance compute hostedService with e -> log.warn "failed: service=%s . %A" hostedService.ServiceName e return () }) |> Async.ParallelIgnore 1 |> Async.RunSynchronously
  • 48. Additional chaos finds - Redis - Checkpointing
  • 50. If availability matters, you should be testing for it.
  • 51. Azure + F# + Chaos = <3
  • 52. Chaos Engineering at Jet.com Rachel Reese | @rachelreese | rachelree.se Jet Technology | @JetTechnology | tech.jet.com Nora Jones | @nora_js
  • 53. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/jet- microservices-testing