SlideShare a Scribd company logo
Chaos Engineering on Cloud Foundry
Chaos Engineering for
Cloud Foundry
Karun
Chennuri
Sr Engineer
Ramesh Krishnaram
Sr Manager
Who are we?
“Provide simple, secure
and scalable platform
services that is platform
and infrastructure-
agnostic.”
Platform Engineering
Agility (Re)-defined
36K+ Containers 100+Projectteams700MTransactions/day13+Foundations 3000+ Applications
PCF
Faster Apps
More Frequent
Changes
Fewer Incidents
Zero Downtime
Deployments
Fail Fast & Fix-Fast Elasticity
43% reduction
in application
response time
10x increase in planned
changes
83% fewer
incidents
Daytime changes
became the norm
67% faster
DevOps CultureShift (Code it, Own it)
On-demand
infrastructure scaling
“The only thing that is constant is change. Learn to embrace it.
Failure is inevitable”
“The only thing that is constant is change. Learn to embrace it.
Failure is inevitable”
failure
Problem Statement
•How is this huge infrastructure holding together?
•What happens if something is broken?
•What if there is real chaos?
Let’s begin with a story…
Once there was a
king…
…who received a gift of two
magnificent falcons…
…2 beautiful
falcons are put
to training…
…months
one of the
falcons was
flying…
• Edit Master text styles
• Second level
• Third level
• Fourth level
• Fifth level
But other
remained on the
..King got upset…
The king
asked all
the wise
men, but
no one
could make
the bird
fly…
Few days
later King
saw the
2nd falcon
flying…
King - “Bring me the
doer of this
miracle.”
Minister - “It’s a
local Farmer who
solved this
problem”
King with
farmer -
“How did you
make the
falcon fly?”
Farmer - “It was
very easy, your
highness. I simply
cut the branch
where the bird was
sitting.”
Moral of the story…
Simple change made the bird
fly…
Simple change can disrupt our
systems…
Not all problems need a complex
solution…
Are our systems prepared for Chaos?
App Resiliency is Key tenet to Security
In app world, we conform to the familiar,
the comfortable, and the mundane. Let us
learn to destroy the branch of network
connections we cling to and free ourselves
to the glory of app resiliency!
Re-iterating Problem Statement
Chaos Engineering is the discipline of
experimenting on a distributed system in order
to build confidence in the system’s capability
to withstand turbulent conditions in
production.
Principles of Chaos Engineering, Netflix
Re-iterating Problem Statement
Infrastructur
e
Monarch
Turbulence++
Application
Quest for Existing solutions
We didn’t want to re-invent
the wheel!
Journey & Tool explored…
Chaos Lemur
Kill VMs
Kill Process
Latency
CPU/Memory
App Knowledge
Gremlin
Kill VMs
Kill Process
Latency
CPU/Memory
App Knowledge
Turbulence
Kill VMs
Kill Process
Latency
CPU/Memory
App Knowledge
T-Mo CTK
Kill VMs
Kill Process
Latency
CPU/Memory
App Knowledge
CTK – Chaos
ToolkitMonarch
Turbulence
++
Introducing Monarch & Turbulence ++
Enables initiating sophisticated Failure
Injection Tests on Bosh deployed
Infrastructure and Apps deployed in such
infrastructure.
Turbulence++
• Turbulence – Api-server,
Agent
• Features:
• Kill VM
• Kill Process
• Pause Process
• Stress
• Disk Corrupt
• Control Network Delay
• Limit Bandwidth
• Re-ordering Packets
• Firewall
• Targeted Blocking
• Shutdown
• Block DNS
• Duplication
T-Mobile OSS Contribution
Monarch
•Block general network traffic
•Block dependencies traffic
•Latency
•Bandwidth restriction
•Speed-test within hosting containers
•Crash random AIs
Infrastructure Level Chaos
Engineering
Infrastructure Level Chaos
Engineering
CELL CELL CELL CELL
REP REP REP REP
Go Router Go Router Go Router
Kill
Process
Kill VM
Latency
TURBULENCE ++
Demos – Infrastructure Level Chaos
Attacks1.Kill VM
2.Block SSH traffic to Diego Cell
3.Manipulate network traffic
Application Level Chaos
Engineering
Application Level Chaos Engineering
CELL CELL CELL CELL
REP REP REP REP
Go Router Go Router Go Router
Executor ExecutorExecutorExecutor
MySQL
Block
Traffic
Kill AI
MONARCH
Latency
Cascading effect
Ref: https://github.com/michaelgruczel/microservice-architecture-by-example
Service 2Service1 3rd PartyWeb
App
Client
Database
Timeout
Demo Setup
MySQL
Fortune
UI
Fortune
Service
Eureka
Service Discovery
Hystrix
Circuit Breaker
Latency
Config Server
Kill
Service
Block
Service
Demos – App Level Chaos Engineering
•Attack on Spring Boot + Spring Cloud Services +
Database Services
•Block incoming traffic to app (introduce app
latency)
•Crash random app instances
Limitations/Improvements/Futur
e•Monarch – Limits attacking one Cluster at a time
•Turbulence++ and Monarch are 2 different –
Possibility of merge into one solution
•Istio/Envoy
Game
Days!
Ref: http://funnypicture.org/funny-cat-games-27-cool-hd-wallpaper.html#.W6uZ_2hKiUk
Then
Now
Then & Now
Thank You
Karun.Chennuri1@T-Mobile.com
Ramesh.Vaithiyamkrishnaram1@T-Mobile.com

More Related Content

PDF
Red Hat OpenShift on Bare Metal and Containerized Storage
PDF
Open Source DataViz with Apache Superset
PDF
The Complete Guide to Service Mesh
PDF
Chaos Engineering: Why the World Needs More Resilient Systems
PDF
Designing a complete ci cd pipeline using argo events, workflow and cd products
PDF
Cloud native application 입문
PPTX
Microservices Part 3 Service Mesh and Kafka
PDF
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Red Hat OpenShift on Bare Metal and Containerized Storage
Open Source DataViz with Apache Superset
The Complete Guide to Service Mesh
Chaos Engineering: Why the World Needs More Resilient Systems
Designing a complete ci cd pipeline using argo events, workflow and cd products
Cloud native application 입문
Microservices Part 3 Service Mesh and Kafka
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...

What's hot (20)

PDF
게임을 위한 최적의 AWS DB 서비스 선정 퀘스트 깨기::최유정::AWS Summit Seoul 2018
PPTX
Autoscaling with Kubernetes
PDF
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
PPTX
Scaling production grade EKS Multi-Cluster environments using GitOps
PPTX
Azure Migration Program Overview
PPTX
DevOps a pratical approach
PDF
AWS Black Belt Techシリーズ Amazon Kinesis
PPTX
DevOps introduction
PDF
DevSecOps Jenkins Pipeline -Security
PDF
DevOps with GitHub Actions
PPTX
Jenkins CI
PPTX
Microservices Architecture & Testing Strategies
PPTX
GOTO Berlin - Battle of the Circuit Breakers: Resilience4J vs Istio
PDF
프론트엔드 개발자를 위한 서버리스 - 윤석찬 (AWS 테크에반젤리스트)
PPTX
DevOps and Tools
PPTX
Apache Superset - open source data exploration and visualization (Conclusion ...
PDF
DevOps and AWS
PPTX
CICD Pipeline Using Github Actions
PPTX
Platform engineering 101
PDF
Kubernetes security
게임을 위한 최적의 AWS DB 서비스 선정 퀘스트 깨기::최유정::AWS Summit Seoul 2018
Autoscaling with Kubernetes
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
Scaling production grade EKS Multi-Cluster environments using GitOps
Azure Migration Program Overview
DevOps a pratical approach
AWS Black Belt Techシリーズ Amazon Kinesis
DevOps introduction
DevSecOps Jenkins Pipeline -Security
DevOps with GitHub Actions
Jenkins CI
Microservices Architecture & Testing Strategies
GOTO Berlin - Battle of the Circuit Breakers: Resilience4J vs Istio
프론트엔드 개발자를 위한 서버리스 - 윤석찬 (AWS 테크에반젤리스트)
DevOps and Tools
Apache Superset - open source data exploration and visualization (Conclusion ...
DevOps and AWS
CICD Pipeline Using Github Actions
Platform engineering 101
Kubernetes security
Ad

Similar to Chaos Engineering on Cloud Foundry (20)

PPTX
Chaos Engineering
PDF
Chaos Engineering - The Art of Breaking Things in Production
PDF
Chaos Engineering to Establish Software Reliability
PPTX
Introduction to Chaos Engineering
PPTX
How HashiCorp platform tools can make the difference in development and deplo...
PDF
Building Reactive applications with Akka
PPTX
Embracing Failure - AzureDay Rome
PDF
Why AIOps Matters For Kubernetes
PPTX
Infrastructure as Code - Getting Started, Concepts & Tools
PDF
DOST 2016 Cloud Without Failures
PDF
Devoxx France 2023 - 1,2,3 Quarkus.pdf
PPTX
Iot cloud service v2.0
PDF
Proactive ops for container orchestration environments
PPTX
Containers and Why They Matter
PDF
Patterns and Pains of Migrating Legacy Applications to Kubernetes
PDF
Patterns and Pains of Migrating Legacy Applications to Kubernetes
PPTX
ChaosEngineeringITEA.pptx
PDF
TAMING THE INFRASTRUCTURE GONE WILD
PDF
TAMING THE INFRASTRUCTURE GONE WILD
PDF
Mini-Track: Lessons from Public Cloud
Chaos Engineering
Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering to Establish Software Reliability
Introduction to Chaos Engineering
How HashiCorp platform tools can make the difference in development and deplo...
Building Reactive applications with Akka
Embracing Failure - AzureDay Rome
Why AIOps Matters For Kubernetes
Infrastructure as Code - Getting Started, Concepts & Tools
DOST 2016 Cloud Without Failures
Devoxx France 2023 - 1,2,3 Quarkus.pdf
Iot cloud service v2.0
Proactive ops for container orchestration environments
Containers and Why They Matter
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to Kubernetes
ChaosEngineeringITEA.pptx
TAMING THE INFRASTRUCTURE GONE WILD
TAMING THE INFRASTRUCTURE GONE WILD
Mini-Track: Lessons from Public Cloud
Ad

Recently uploaded (20)

PDF
System and Network Administration Chapter 2
PDF
System and Network Administraation Chapter 3
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PPTX
ai tools demonstartion for schools and inter college
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Nekopoi APK 2025 free lastest update
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
history of c programming in notes for students .pptx
PPTX
Computer Software and OS of computer science of grade 11.pptx
System and Network Administration Chapter 2
System and Network Administraation Chapter 3
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
How to Migrate SBCGlobal Email to Yahoo Easily
Softaken Excel to vCard Converter Software.pdf
wealthsignaloriginal-com-DS-text-... (1).pdf
Design an Analysis of Algorithms II-SECS-1021-03
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Operating system designcfffgfgggggggvggggggggg
Digital Systems & Binary Numbers (comprehensive )
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
ai tools demonstartion for schools and inter college
Upgrade and Innovation Strategies for SAP ERP Customers
Nekopoi APK 2025 free lastest update
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Navsoft: AI-Powered Business Solutions & Custom Software Development
history of c programming in notes for students .pptx
Computer Software and OS of computer science of grade 11.pptx

Chaos Engineering on Cloud Foundry

Editor's Notes

  • #4: Ramesh: So let’s talk about us. Who are we ? We are a group of engineers that fondly like to call ourselves agents of chaos. Agent of chaos is we like to radically transform the complexities involved in deploying software to the cloud, we have done this by delivering a platform that is simple/secure/scalable to use. Our goal is to have our application workloads to be able to run from anywhere, anyhow. IT is now all about as-a-service which means the expectation of Customers is all about Agility. And this varies broadly with the level of abstraction you choose. We are a team that is focused predominantly on delivering services for CaaS, PaaS and FaaS (future).
  • #5: Ramesh: PCF was launched at T-Mobile in early 2016 & you quickly see how we have graduated over the last 24 months. A number of T-Mo business critical (customer facing or middle-ware) runs on PCF. Still not convinced ? In that case, let me tell you that as of this minute we have roughly 30K+ containers, 900 active users in the PCF community at T-Mo. And just in FY 2018, we have scaled out our PCF foundations from 2 to 10+. If that does not cut it, let me tell you that since the time we have moved a number of apps to micro-service SOA, we have shorter/fewer incidents and faster apps ! And guess what, on top of this we have seen an increase in # of changes made to these services, a vast majority of these being day-time changes.
  • #6: Ramesh: We are engineers, we write services. A simple web app has a client making a connection to a server, server talks to a backend dependency determines what needs to be rendered to the client & responds back. But that’s one app and a SOA has thousands of these micro-services & just like how we share the world, they share a eco-systems that is complex & vulnerable to attacks. Ramesh : I like to call this death start diagram as Micro-service explosion, a common theme. In summary, when we design services, we make assumptions. Assumptions go wrong/not validated. A few common fallacies in distributed system The network is reliable Latency is zero Bandwidth is infinite Infinite compute resources The network is secure Topology doesn't change There is one administrator Transport cost is zero The network is homogeneous Chaos Engineering focuses on building confidence in your system by validating known recovery paths. When a recovery path fails, you get an opportunity to look at the results and fix why it failed. Ramesh: Before we move on I want to high-light that T-Mobile is not one of the mom and pop telecom companies out there. We are the Un-carrier. We care about our customers, so we want to build stuff that’s simple, secure and scalable. And this is not possible until you acknowledge that the only thing that is constant is failure. Learn to embrace it, failure is inevitable.
  • #7: Ramesh: I just wonder how all this stuff is holding together and what would happen with chaos ?
  • #8: Karun: Ramesh before I jump into the solution design, let me tell you a story… Ramesh: Sure https://coaching-journey.com/coaching-story-falcon-branch/ http://www.srikumar.com/family/moral_inspirational_stories/story-of-two-birds.htm Story: Once there was a king who received a gift of two magnificent falcons ( birds ). They were peregrine falcons, the most beautiful birds he had ever seen. He gave the precious birds to his head falconer to be trained. Months passed, and one day the head falconer informed the king that though one of the falcons was flying majestically, soaring high in the sky, the other bird had not moved from its branch since the day it had arrived. The king summoned healers and sorcerers from all the land to tend to the falcon, but no one could make the bird fly. He presented the task to the member of his court, but the next day, the king saw through the palace window that the bird had still not moved from its perch. Having tried everything else, the king thought to himself, “May be I need someone more familiar with the countryside to understand the nature of this problem.” So he cried out to his court, “Go and get a farmer.” In the morning, the king was thrilled to see the falcon soaring high above the palace gardens. He said to his court, “Bring me the doer of this miracle.” The court quickly located the farmer, who came and stood before the king. The king asked him, “How did you make the falcon fly?” With his head bowed, the farmer said to the king, “It was very easy, your highness. I simply cut the branch where the bird was sitting.” We are all made to fly — to realize our incredible potential as human beings. But at times we sit on our branches, clinging to the things that are familiar to us. The possibilities are endless, but for most of us, they remain undiscovered. We conform to the familiar, the comfortable, and the mundane. So for the most part, our lives are mediocre instead of exciting, thrilling and fulfilling. Let us learn to destroy the branch of fear we cling to and free ourselves to the glory of flight! Image Sources: https://www.whats-your-sign.com/falcon-animal-totem.html http://www.animatedimages.org/cat-falcons-1180.htm http://www.animatedimages.org/data/media/1180/animated-falcon-image-0002.gif
  • #9: Karun: Once there was a king… https://coaching-journey.com/coaching-story-falcon-branch/ http://www.srikumar.com/family/moral_inspirational_stories/story-of-two-birds.htm Story: Once there was a king who received a gift of two magnificent falcons ( birds ). They were peregrine falcons, the most beautiful birds he had ever seen. He gave the precious birds to his head falconer to be trained. Months passed, and one day the head falconer informed the king that though one of the falcons was flying majestically, soaring high in the sky, the other bird had not moved from its branch since the day it had arrived. The king summoned healers and sorcerers from all the land to tend to the falcon, but no one could make the bird fly. He presented the task to the member of his court, but the next day, the king saw through the palace window that the bird had still not moved from its perch. Having tried everything else, the king thought to himself, “May be I need someone more familiar with the countryside to understand the nature of this problem.” So he cried out to his court, “Go and get a farmer.” In the morning, the king was thrilled to see the falcon soaring high above the palace gardens. He said to his court, “Bring me the doer of this miracle.” The court quickly located the farmer, who came and stood before the king. The king asked him, “How did you make the falcon fly?” With his head bowed, the farmer said to the king, “It was very easy, your highness. I simply cut the branch where the bird was sitting.” We are all made to fly — to realize our incredible potential as human beings. But at times we sit on our branches, clinging to the things that are familiar to us. The possibilities are endless, but for most of us, they remain undiscovered. We conform to the familiar, the comfortable, and the mundane. So for the most part, our lives are mediocre instead of exciting, thrilling and fulfilling. Let us learn to destroy the branch of fear we cling to and free ourselves to the glory of flight! Image Sources: https://www.whats-your-sign.com/falcon-animal-totem.html http://www.animatedimages.org/cat-falcons-1180.htm http://www.animatedimages.org/data/media/1180/animated-falcon-image-0002.gif
  • #10: Karun to say…
  • #11: Karun to say…
  • #12: Karun to say…
  • #13: Karun to say…
  • #14: Karun to say…
  • #15: Karun to say…
  • #16: Karun to say…
  • #17: Karun to say…
  • #18: Karun to say…
  • #19: Karun to say…
  • #22: Karun: So Ramesh what do you think? Do you’ve anything to add? Ramesh: Yes, In app world, we conform to the familiar, the comfortable, and the mundane. Let us learn to destroy the branch of network connections we cling to and free ourselves to the glory of app resiliency!
  • #23: Karun: You are right Ramesh. Just to re-iterate the problem statement, here well known definition of Chaos Engineering
  • #24: Ramesh: I like the way we are progressing, can you please help me understand what is the best way to tackle Chaos? Karun: Sure Ramesh. T-Mobile is not a single application company. We’ve 3000+ application belonging to multiple internal customer running on a shared foundation. Performing a chaos attack at infrastructure level impacts multiple customers and I know how customer obsessed you are based on initial slides, hence we came up with application level chaos attacks, which is to target specific app and it’s dependencies running in a cluster. Ramesh: So what you are saying is Running Turbulence on Infrastructure may impact multiple customers, where as Monarch impacts only one customer! Karun: Exactly!
  • #25: Ramesh: How did we arrive at Turbulence & Monarch? Karun: Definitely we didn’t want to reinvent the wheel, we started looking at existing solutions in the opensource and commercial world.
  • #26: Only certain features taken for comparison for now. Karun: We started with Chaos Lemur, but it only addresses Kill VMs. Gremlin is a commercial offering with Control plane offered as SaaS offering, which means one less software for Ops team to manage. It’s a good option that comes with a cost. Gremlin can run as a process as well as in container. We deployed Gremlin as a run time config on one of our test foundations. Enters ChaosToolKit a nice little framework that orchestrates solutions like Gremlin, turbulence, aws, all at the same time. It’s driver based architecture helped us build a new capability that now knows how to interact with an app instance running in the cluster.
  • #27: Karun to say…
  • #28: Karun to say…
  • #29: Karun to say… Ramesh: This is great I now understand capabilities of Turbulence ++ and Monarch. Can I see a demo? Karun: Sure. Before that let’s get in to some more details…
  • #31: Karun: Cloudfoundry is a cluster of multiple VMs. All the Application Instances run as a container in these VMs. Imagine what if a VM goes down? “Rep” is a process in diegocell responsible for managing lifecycle of containers running on that diegocell. What if ‘rep’ process goes down? The system is designed on timeouts, what if there is a latency injected between GoRouter and DiegoCell simulating timeout failures? Turbulence++ helps to simulate all these failures, it’s built on top of opensource Turbulence.
  • #32: Karun: To understand this we will go through 3 demos.
  • #34: Karun: Let’s look at Application level chaos attacks. Like I said earlier this is a single targeted attack on a app and it’s dependencies. Attacking which won’t impact other apps running in same diegocell or foundation. In this you have an AI running in a diegocell and it is dependent on mysql. What if traffic between AI and MySQL is blocked? What if AI is crashed?
  • #35: Karun: There is a famous metaphor that explain Chaos Theory i.e. Butterfly flying in Brazil could lead to hurricanes in Texas area! Karun : No matter how good we design, no matter if we follow 12-factor design patterns, in real world as in this case Weather service is dependent on 3rd party, which if goes down would result in Concert app’s failure thus eventually web app fails. Couple of questions to keep in mind: How to verify app’s behavior if 3rd party goes offline? What if Concert database goes offline? What if Weather microservice misbehaves? Ramesh : Why can’t we use hystrix Circuit Breaker for Weather service? Karun : Yes we can… and should in fact. Having something like Monarch blocker programmed to run interval of time, will simulate cascading failures seamlessly every interval of time and thus generates job for circuit breaker…
  • #38: Ramesh: This is all great Karun. Are there any known limitations in our solution design? Karun: Yes Ramesh. What if 5apps deployed on 5 clusters and we want to attack 3 out of 5 apps?
  • #39: Karun: Would you like to share our experience with Game Days and how Monarch helped? Ramesh: We are still taking baby steps and putting Monarch to stress with real time projects. We did a 2 hrs Game day with one of the production applications and we made about 6 recommendations based on our experiments.
  • #40: Karun: Clearly, we believe we’ve moved from Joker’s analogy on Chaos (2008) to Game of Throne’s Chaos Analogy (2017). According to later, Chaos isn't a pit. Chaos is a ladder. Many who try to climb it fail, never to try again. The fall breaks them. And some given a chance to climb, they refuse. They cling to the realm, or the gods, or love, the illusions. Only the ladder is real. The climb is all there is… Leaving it here and let you decide…