SlideShare a Scribd company logo
Chaos Engineering for PCF
Ramesh Krishnaram : Sr. Engineering Manager
Karun Chennuri : Sr. Software Engineer
PLATFORM ENGINEERING
Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/
WHO ARE WE ?
PLATFORM ENGINEERING
“Provide simple, secure
and scalable platform
services that is platform
and infrastructure-
agnostic.”
FaaS
PaaS
CaaS
IaaS
Greater Flexibility
Less conformance to
standards
Lower dev complexity
Greater operational efficiency
Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/
Here is the BIG Deal…
Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/
“The only thing that is constant is change failure. Learn to embrace it.
Failure is inevitable”
Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/
Problem Statement
 Platform Failures  Application Failures
A
A
AA
B BB
B
C
C
C
C
Ref: https://twitter.com/fiberstore/status/549826256338825216
Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/
After few busy weeks….
Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/
Journey & Tools we explored…
Chaos Lemur
Kill VMs
Kill Process
Latency
CPU/Memory
App Knowledge
Gremlin
Kill VMs
Kill Process
Latency
CPU/Memory
App Knowledge
Turbulence
Kill VMs
Kill Process
Latency
CPU/Memory
App Knowledge
T-Mo CTK
Kill VMs
Kill Process
Latency
CPU/Memory
App Knowledge
Note: App knowledge in Gremlin seem to be in the road map and may be available in future versions.
CTK – Chaos Toolkit
Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/
Chaos Engineering:
Platform/ Infrastructure
Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/
Simulating Failures in PCF
Turbulence:
Features:
• Kill VM
• Kill Process
• Pause Process
• Stress
• Disk Corrupt
• Control Network
Delay
• Limit Bandwidth
• Re-ordering
Packets
• Firewall
• Targeted
Blocking
• Shutdown
• Block DNS
• Duplication
• Api-server • Agent
T-Mobile OSS contribution
Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/
Demo 1:
Addons to Turbulence
Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/
Demo 1: Addons to Turbulence
Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/
Chaos Engineering:
Applications
Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/
Some where in ops world…
My App isn’t picking latest configurations
My app isn’t connecting to Cassandra
My app works locally but not on PCF!
WTF with the Platform?
My app was working well till yesterday,
but not today!
Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/
Cascading effect
Ref: https://github.com/michaelgruczel/microservice-architecture-by-example
WeatherConcert 3rd PartyWeb
App
Client
Database
Timeout
Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/
CTK CF BLOCKER
CTK CF Blocker:
• Target specific CF
Apps
• Discovers
• Application hosts
• Bound services
• Service
Instances
• Block all traffic to
• App instances
• Bound services
Diego Cell
Weather
Concert
Config Server
Eureka
Service Discovery
Hystrix
Circuit Breaker
Cloud Controller UAA Git Repo
Message B rokers
RMQ Kafka JMX
Database
Go Router
CredHub
Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/
Demo 2:
CTK CF Blocker
Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/
Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/
Upstream Contribution
Demo Videos:
• Platform Chaos Attack: https://www.youtube.com/watch?v=9jt8Qq6RTN8
• CF App Blocker Attack: https://www.youtube.com/watch?v=ewtzyZdb67o
https://opensource.t-mobile.com
Turbulence Release PR : https://github.com/cppforlife/turbulence-release/pull/25
Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/
Next Steps…
Ref: http://funnypicture.org/funny-cat-games-27-cool-hd-wallpaper.html#.W6uZ_2hKiUk
Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/
Team
PLATFORM ENGINEERING
LET US KNOW HOW YOU FEEL ABOUT THIS SESSION.
TAKE THE SURVEY ON THE MOBILE APP!
Ramesh.Vaithiyamkrishnaram1@T-Mobile.com,
Karun.Chennuri1@T-Mobile.com
#springone@s1p

More Related Content

PDF
GitHub Actions with Node.js
PPTX
Software Composition Analysis Deep Dive
PPTX
Azure Devops Build Tools for Powerapps
PPTX
Flutter introduction
PPTX
PPTX
Azure devops
PDF
KrakenD API Gateway
PPTX
Azure DevOps for Developers
GitHub Actions with Node.js
Software Composition Analysis Deep Dive
Azure Devops Build Tools for Powerapps
Flutter introduction
Azure devops
KrakenD API Gateway
Azure DevOps for Developers

What's hot (20)

PPTX
Golang - Overview of Go (golang) Language
PPTX
CI/CD with GitHub Actions
PPTX
Anatomy of An Open Source Project: Key Factors to Success
PDF
Introduction to Google Developer Relations
PDF
Contract testing and Pact
PDF
Argocd up and running
PDF
Xen architecture q1 2008
PDF
INTRODUCTION TO FLUTTER.pdf
PPTX
DevOps introduction
PPTX
What is DevOps? | DevOps Introduction | DevOps Tools | DevOps Tutorial For Be...
PDF
Introduction to Progressive web app (PWA)
PPTX
NGINX Back to Basics Part 3: Security (Japanese Version)
PPTX
Maven ppt
PPTX
Azure DevOps
PPT
Monitoring using Prometheus and Grafana
PPTX
Tdd and bdd
PDF
ML Kit , Cloud FF GDSC MESCOE.pdf
PDF
Developing Cross platform apps in flutter (Android, iOS, Web)
PPTX
Akka.net versus microsoft orleans
PDF
Intro to GitOps with Weave GitOps, Flagger and Linkerd
Golang - Overview of Go (golang) Language
CI/CD with GitHub Actions
Anatomy of An Open Source Project: Key Factors to Success
Introduction to Google Developer Relations
Contract testing and Pact
Argocd up and running
Xen architecture q1 2008
INTRODUCTION TO FLUTTER.pdf
DevOps introduction
What is DevOps? | DevOps Introduction | DevOps Tools | DevOps Tutorial For Be...
Introduction to Progressive web app (PWA)
NGINX Back to Basics Part 3: Security (Japanese Version)
Maven ppt
Azure DevOps
Monitoring using Prometheus and Grafana
Tdd and bdd
ML Kit , Cloud FF GDSC MESCOE.pdf
Developing Cross platform apps in flutter (Android, iOS, Web)
Akka.net versus microsoft orleans
Intro to GitOps with Weave GitOps, Flagger and Linkerd
Ad

Similar to Chaos Engineering for PCF (20)

PPTX
It’s a Multi-Cloud World, But What About The Data?
PDF
Developer Secure Containers for the Cyberspace Battlefield
PPTX
Connecting All Abstractions with Istio
PDF
Cross-Platform Observability for Cloud Foundry
PDF
Cloud Foundry Services on PKS with No Extra Code, "We Bosh So You Don’t Have ...
PPTX
Monoliths, Microservices, Events, Functions: What It Takes to Go Through the ...
PDF
P to V to C: The Value of Bringing “Everything” to Containers
PDF
Lattice: A Cloud-Native Platform for Your Spring Applications
PDF
Cassandra and DataStax Enterprise on PCF
PDF
Cloud Foundry Networking with VMware NSX
PDF
Cloud-Native Streaming Platform: Running Apache Kafka on PKS (Pivotal Contain...
PPTX
Deploying Spring Boot apps on Kubernetes
PDF
Containers Were Never Your End State
PPTX
How to Build More Secure Service Brokers
PPTX
What We're Learning Adopting Spring Boot and PCF for Dell.com's eCommerce
PDF
Scalable Smart Caching for Spring Developers
PDF
Consumer Driven Contracts and Your Microservice Architecture
PDF
Heavyweights: Tipping the Scales with Very Large Foundations
PDF
Building a Data Exchange with Spring Cloud Data Flow
PDF
S1P: Spring Cloud on PKS
It’s a Multi-Cloud World, But What About The Data?
Developer Secure Containers for the Cyberspace Battlefield
Connecting All Abstractions with Istio
Cross-Platform Observability for Cloud Foundry
Cloud Foundry Services on PKS with No Extra Code, "We Bosh So You Don’t Have ...
Monoliths, Microservices, Events, Functions: What It Takes to Go Through the ...
P to V to C: The Value of Bringing “Everything” to Containers
Lattice: A Cloud-Native Platform for Your Spring Applications
Cassandra and DataStax Enterprise on PCF
Cloud Foundry Networking with VMware NSX
Cloud-Native Streaming Platform: Running Apache Kafka on PKS (Pivotal Contain...
Deploying Spring Boot apps on Kubernetes
Containers Were Never Your End State
How to Build More Secure Service Brokers
What We're Learning Adopting Spring Boot and PCF for Dell.com's eCommerce
Scalable Smart Caching for Spring Developers
Consumer Driven Contracts and Your Microservice Architecture
Heavyweights: Tipping the Scales with Very Large Foundations
Building a Data Exchange with Spring Cloud Data Flow
S1P: Spring Cloud on PKS
Ad

More from VMware Tanzu (20)

PDF
Spring into AI presented by Dan Vega 5/14
PDF
What AI Means For Your Product Strategy And What To Do About It
PDF
Make the Right Thing the Obvious Thing at Cardinal Health 2023
PPTX
Enhancing DevEx and Simplifying Operations at Scale
PDF
Spring Update | July 2023
PPTX
Platforms, Platform Engineering, & Platform as a Product
PPTX
Building Cloud Ready Apps
PDF
Spring Boot 3 And Beyond
PDF
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
PDF
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
PDF
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
PPTX
tanzu_developer_connect.pptx
PDF
Tanzu Virtual Developer Connect Workshop - French
PDF
Tanzu Developer Connect Workshop - English
PDF
Virtual Developer Connect Workshop - English
PDF
Tanzu Developer Connect - French
PDF
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
PDF
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
PDF
SpringOne Tour: The Influential Software Engineer
PDF
SpringOne Tour: Domain-Driven Design: Theory vs Practice
Spring into AI presented by Dan Vega 5/14
What AI Means For Your Product Strategy And What To Do About It
Make the Right Thing the Obvious Thing at Cardinal Health 2023
Enhancing DevEx and Simplifying Operations at Scale
Spring Update | July 2023
Platforms, Platform Engineering, & Platform as a Product
Building Cloud Ready Apps
Spring Boot 3 And Beyond
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
tanzu_developer_connect.pptx
Tanzu Virtual Developer Connect Workshop - French
Tanzu Developer Connect Workshop - English
Virtual Developer Connect Workshop - English
Tanzu Developer Connect - French
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: Domain-Driven Design: Theory vs Practice

Recently uploaded (20)

PDF
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
DOCX
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
PPTX
Weekly report ppt - harsh dattuprasad patel.pptx
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
iTop VPN Crack Latest Version Full Key 2025
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
history of c programming in notes for students .pptx
PPTX
assetexplorer- product-overview - presentation
PDF
Website Design Services for Small Businesses.pdf
PDF
17 Powerful Integrations Your Next-Gen MLM Software Needs
PPTX
Reimagine Home Health with the Power of Agentic AI​
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Patient Appointment Booking in Odoo with online payment
PDF
AutoCAD Professional Crack 2025 With License Key
PDF
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
Weekly report ppt - harsh dattuprasad patel.pptx
Designing Intelligence for the Shop Floor.pdf
Wondershare Filmora 15 Crack With Activation Key [2025
Digital Systems & Binary Numbers (comprehensive )
iTop VPN Crack Latest Version Full Key 2025
Odoo Companies in India – Driving Business Transformation.pdf
history of c programming in notes for students .pptx
assetexplorer- product-overview - presentation
Website Design Services for Small Businesses.pdf
17 Powerful Integrations Your Next-Gen MLM Software Needs
Reimagine Home Health with the Power of Agentic AI​
Operating system designcfffgfgggggggvggggggggg
Patient Appointment Booking in Odoo with online payment
AutoCAD Professional Crack 2025 With License Key
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
CHAPTER 2 - PM Management and IT Context
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx

Chaos Engineering for PCF

  • 1. Chaos Engineering for PCF Ramesh Krishnaram : Sr. Engineering Manager Karun Chennuri : Sr. Software Engineer PLATFORM ENGINEERING
  • 2. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ WHO ARE WE ? PLATFORM ENGINEERING “Provide simple, secure and scalable platform services that is platform and infrastructure- agnostic.” FaaS PaaS CaaS IaaS Greater Flexibility Less conformance to standards Lower dev complexity Greater operational efficiency
  • 3. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Here is the BIG Deal…
  • 4. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ “The only thing that is constant is change failure. Learn to embrace it. Failure is inevitable”
  • 5. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Problem Statement  Platform Failures  Application Failures A A AA B BB B C C C C Ref: https://twitter.com/fiberstore/status/549826256338825216
  • 6. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ After few busy weeks….
  • 7. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Journey & Tools we explored… Chaos Lemur Kill VMs Kill Process Latency CPU/Memory App Knowledge Gremlin Kill VMs Kill Process Latency CPU/Memory App Knowledge Turbulence Kill VMs Kill Process Latency CPU/Memory App Knowledge T-Mo CTK Kill VMs Kill Process Latency CPU/Memory App Knowledge Note: App knowledge in Gremlin seem to be in the road map and may be available in future versions. CTK – Chaos Toolkit
  • 8. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Chaos Engineering: Platform/ Infrastructure
  • 9. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Simulating Failures in PCF Turbulence: Features: • Kill VM • Kill Process • Pause Process • Stress • Disk Corrupt • Control Network Delay • Limit Bandwidth • Re-ordering Packets • Firewall • Targeted Blocking • Shutdown • Block DNS • Duplication • Api-server • Agent T-Mobile OSS contribution
  • 10. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Demo 1: Addons to Turbulence
  • 11. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Demo 1: Addons to Turbulence
  • 12. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Chaos Engineering: Applications
  • 13. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Some where in ops world… My App isn’t picking latest configurations My app isn’t connecting to Cassandra My app works locally but not on PCF! WTF with the Platform? My app was working well till yesterday, but not today!
  • 14. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Cascading effect Ref: https://github.com/michaelgruczel/microservice-architecture-by-example WeatherConcert 3rd PartyWeb App Client Database Timeout
  • 15. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ CTK CF BLOCKER CTK CF Blocker: • Target specific CF Apps • Discovers • Application hosts • Bound services • Service Instances • Block all traffic to • App instances • Bound services Diego Cell Weather Concert Config Server Eureka Service Discovery Hystrix Circuit Breaker Cloud Controller UAA Git Repo Message B rokers RMQ Kafka JMX Database Go Router CredHub
  • 16. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Demo 2: CTK CF Blocker
  • 17. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/
  • 18. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Upstream Contribution Demo Videos: • Platform Chaos Attack: https://www.youtube.com/watch?v=9jt8Qq6RTN8 • CF App Blocker Attack: https://www.youtube.com/watch?v=ewtzyZdb67o https://opensource.t-mobile.com Turbulence Release PR : https://github.com/cppforlife/turbulence-release/pull/25
  • 19. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Next Steps… Ref: http://funnypicture.org/funny-cat-games-27-cool-hd-wallpaper.html#.W6uZ_2hKiUk
  • 20. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Team PLATFORM ENGINEERING
  • 21. LET US KNOW HOW YOU FEEL ABOUT THIS SESSION. TAKE THE SURVEY ON THE MOBILE APP! Ramesh.Vaithiyamkrishnaram1@T-Mobile.com, Karun.Chennuri1@T-Mobile.com #springone@s1p

Editor's Notes

  • #2: “JOKER: Introduce a little anarchy. Upset the established order and everything becomes chaos. I'm an agent of chaos. Oh, and you know the thing about chaos?” Ramesh: Hey Karun, I recently saw this movie The Dark Knight. And I really enjoyed the Joker’s interpretation of Chaos. So you know the thing about Chaos Karun? Karun: Apart from what Joker said, I have been reading about a famous metaphor that explain Chaos Theory i.e. How “Butterfly wings in Brazil could ultimately cause a hurricane in Texas” Ramesh : And ? Karun: I think we should say “Pre-emptive chaos attack on butterflies!!!???” Not really ‘am just joking… You look like you have something to say, what is it and how can I help ? Ramesh : I am trying to draw an analogy here - A tiny butterfly could bring such a huge impact to the environment, so can a bug/failure in the system to a company and company’s revenue. So, let’s get started…
  • #3: Ramesh: So let’s talk about us. Who are we ? We are a group of engineers that fondly like to call ourselves agents of chaos. Ramesh: That’s right, what I mean by agent of chaos is we like to radically transform the complexities involved in deploying software to the cloud, we have done this by delivering a platform that is simple/secure/scalable to use. Our goal is to have our application workloads to be able to run from anywhere, anyhow. IT is now all about as-a-service which means the expectation of Customers is all about Agility. And this varies broadly with the level of abstraction you choose. We are a team that is focused predominantly on delivering services for CaaS, PaaS and FaaS (future). Karun: Ramesh, what’s the big deal almost every company has this right? Ramesh: Big deal??? Here we go… <Take to next slide… talk about metrics>
  • #4: Ramesh: So Karun, you said “what’s the big deal?”. So why not I use data to talk about the big deal ! PCF was launched at T-Mobile in early 2016 & you quickly see how we have graduated over the last 24 months. A number of T-Mo business critical (customer facing or middle-ware) runs on PCF. Still not convinced ? In that case, let me tell you that as of this minute we have roughly 30K+ containers, 900 active users in the PCF community at T-Mo. And just in FY 2018, we have scaled out our PCF foundations from 2 to 10+. If that does not cut it, let me tell you that since the time we have moved a number of apps to micro-service SOA, we have shorter/fewer incidents and faster apps ! And guess what, on top of this we have seen an increase in # of changes made to these services, a vast majority of these being day-time changes. Karun: Alright, I get it. Where are we going with this and what is your problem statement ?
  • #5: Ramesh: What is this? Karun: Don’t know. But looks like abstract Chaos. Ramesh: What is this one? Karun: Blue Chaos? Ramesh: What is this one? Karun: Green & Incomplete Chaos? Ramesh: You are right to an extent. But let me clarify. We are engineers, we write services. A simple web app has a client making a connection to a server, server talks to a backend dependency determines what needs to be rendered to the client & responds back. But that’s one app and a SOA has thousands of these micro-services & just like how we share the world, they share a eco-systems that is complex & vulnerable to attacks. Karun : Really, what kind of attacks are these ? Ramesh : I like to call this death start diagram as Micro-service explosion, a common theme. In summary, when we design services, we make assumptions. Assumptions go wrong/not validated. A few common fallacies in distributed system The network is reliable Latency is zero Bandwidth is infinite Infinite compute resources The network is secure Topology doesn't change There is one administrator Transport cost is zero The network is homogeneous Chaos Engineering focuses on building confidence in your system by validating known recovery paths. When a recovery path fails, you get an opportunity to look at the results and fix why it failed. Ramesh: Before we move on I want to high-light that T-Mobile is not one of the mom and pop telecom companies out there. We are the Un-carrier. We care about our customers, so we want to build stuff that’s simple, secure and scalable. And this is not possible until you acknowledge that the only thing that is constant is failure. Learn to embrace it, failure is inevitable.
  • #6: Ramesh : This is certainly not T-Mobile Datacenter nor it is Ramesh and I. Chaos Engineering is the concept of injecting possible real-world failures or load which has a potential to disrupt the system with the goal of finding potential issues before they happen naturally so the system’s resilience can be improved. Think of Chaos Engineering as a fire alarm drill – you run drill occasionally so you can validate your recovery path, when the drill fails you fix it so when an actual catastrophe happens there is no room for failure in your escape route. At T-Mobile we started with below 2 challenges: Platform: Hardware failures, service failures, network connectivity and connection quality issues, and limited resources (CPU/Memory/Disk). Application: Failure of application build dependencies and random failures of application dependencies. Karun: Hey I know what you are saying… We are not a single application company! We’ve many independent and not inter-dependent apps. As said in earlier slide, we’ve about 4k applications running on our platform sharing same resources & underlying infrastructure. Platform level attacks impact several apps, which is not what we want, we want a more targeted attack simulations, targeting specific apps running in an org and space with out affecting other apps running on same hardware, org, space and using same shared instance. Ramesh: My question to you (Karun), is it doable? I want our team not to re-invent the wheel, evaluate existing tools & make a proposal around how can we deliver a tool-as-a-service with which we can build a better platform and deliver more resilient applications. Karun: Well I hope so, I can get back to you with my research. s
  • #8: Only certain features taken for comparison for now. Ramesh: Why Gremlin? Isn’t it a commercial offering? Karun: Gremlin is a commercial offering with Control plane offered as SaaS offering, which means one less software for Ops team to manage. It’s a good option that comes with a cost. Gremlin can run as a process as well as in container. We deployed Gremlin as a run time config on one of our test foundations. Gremlin falls short of app knowledge. But our recent interaction with Mr. Kolton founder of Gremlin looks like they are building app knowledge capability. Ramesh: Can turbulence replace Gremlin? Karun : It’s unfair to compare commercial Gremlin with opensource Turbulence. Original author ‘cppforlife’ (I hope he is in this conference) has put a Go package that deploys turbulence api-server and agent on each of the VMs. Again here Turbulence falls short of App knowledge aspect. No doubt better control with enhancing Turbulence, but Gremlin has one advantage esp in T-Mobile case. Since we’ve K8s and PCF in our infra, we can have a single control plane to plan our attacks for PCF and k8s. Turb on other hand is PCF only. Enters ChaosToolKit a nice little framework that orchestrates solutions like Gremlin, turbulence, aws, all at the same time. It’s driver based architecture helped us build a new capability that now knows how to interact with an app instance running in the cluster. Experiments are JSON, we need to comply with specific grammar.
  • #10: Karun : Typical PCF component diagram, each of component or a combo is a single VM or multi-processes within a VM. But high-level look at the different arrows & imagine an interaction going wrong here which might have a cascading effect. Now how to simulate these? Good news Turbulence does some of the basic stuff already, here we added bunch of new features that help you perform more serious attack simulation that are close to real time attacks. Eg: Imagine you’ve Autoscaling ON for an app. Via Turbulence, bring down Cloud Controller for n interval of time. Autoscaling queries CC every 30 seconds to get app stats, since CC is down and AS doesn’t have the app stat metrics, AS fails thus never scales the app. At this point introduce a heavy spike in traffic see what happens to your app. Also imagine what if existing diego-cell hosting multiple app containers goes down? Ramesh: Are we going to demo existing features of Turbulence? Karun: Certainly not, we will show how to Pause a process say ssh in diego cell. That will be first demo tonight.
  • #13: Karun: Before we jump into our next demo or talk about App Chaos Engineering, can we talk a bit about Ops world? Ramesh: Sure… Next slide…
  • #14: Karun: Hey Ramesh what do we hear from our customers in day to day ops? Ramesh: Of course, we are a service team. So when stuff doesn’t work, the first thing you hear is “It’s those platform guys” and if when it’s not us the next thing you hear “It’s the network team”. Let’s talk about few examples. Karun: My app isn’t picking latest configuration… Ramesh: When Bad Karma hits you back, not much anyone could do, even apps doesn’t listen to you. Karun : My app isn’t connecting to Cassandra cluster Ramesh : why would it? When the cluster was decommissioned 2 weeks ago ! Karun: oh wow! Karun: My app works locally but not on PCF! Ramesh: Well customer misbehavior, blocked them on PCF forever. Karun: Oh that’s fair! Karun: My app was working well till yesterday but not today! How about that? Ramesh: Outstanding payments due! But jokes apart folks, we like calling ourselves enablers. What I mean by that is, we built a platform for community to use. We onboard customer and we get out of the way, we trust our customers will do the right things within their app architecture. But that’s not always the case & our customers encounter problems which boils down to be an app architecture issue or a cloud anti-pattern. What we want to do now is be enablers & guardians, meaning provide a self-service mechanism with which you can find loopholes in your app/deployment. Question is how we can empower our Developers ? Karun: Awesome. So here come CF App Blocker new CTK addon! Ramesh: Do we really need CTK CF Blocker when you’ve Hystrix Circuit breaker? Karun: yes, we still would need. Not all apps deployed are Spring apps. Hystrix Circuit breaker is the design pattern to make apps fault tolerant. However not all technologies have the implementation of this pattern we saw it in Java apps and python apps too, but we’ve apps using other than these 2 stacks. Also CF Blocker complements these design patterns, if an app is bound to hystrix circuit breaker, CTK CF blocker on the app can help with failure test cases … Ramesh: Not sure I get that. Please explain more…
  • #15: Karun : No matter how good we design, no matter if we follow 12-factor design patterns, in real world as in this case Weather service is dependent on 3rd party, which if goes down would result in Concert app’s failure thus eventually web app fails. Couple of questions to keep in mind: How to verify app’s behavior if 3rd party goes offline? What if Concert database goes offline? What if Weather microservice misbehaves? Ramesh : Why can’t we use hystrix Circuit Breaker for Weather service? Karun : Yes we can… and should in fact. Having something like cf blocker programmed to run interval of time, will simulate cascading failures seamlessly every interval of time and thus generates job for circuit breaker…
  • #16: Karun : Here is more accurate interaction of microservices / spring app behavior within PCF. You can see Config server is dependent on GitRepo. & services dependent on spring cloud services that includes Service Registry and Circuit Breaker. In this both Weather and Concert are bound to SCS (internal services of PCF) and Message Broker & DBs external to services. How to target specific bound services to the app How to disable traffic to an app How to block traffic from a service to backend database, but yet allow access from another service. Note the difference we are not killing database here, which may eventually impact other services, but we are only blocking traffic from app to database. We do that via IP Table rules.
  • #19: Ramesh : Do more OSS.
  • #20: Ramesh : So what’s next ? Our high-level goals Build confidence in our services by running gamedays (targeted failure attacks). And yes, finally – we are big on contributions to the community. So we will continue to push our work outwards in to the OSS community.
  • #21: Team photo graph