SlideShare a Scribd company logo
Future with Zero Down-Time:
End-to-end resiliency with chaos engineering and lens of observability
S Vinod Kumar
Event Streaming Platform Team
Fidelity Investments
Fishing for Errors
Increase Linger
First step of solving a problem is to identify the problem
Throughput
Debugging
Kafka
Horizontal Broker Scaling
Memory
RAM
CPU
This Time Its Happening!!!???
Client-SideObservability
Enhanced Monitoring on Kafka
K a f k a H e a l t h
S u m m a r y
S y s t e m
M e t r i c s
L a t e n c y
M e t r i c s
T h r o u g h p u t
M e t r i c s
P r o d u c e r s C o n s u m e r s
P r o d u c e r s C o n s u m e r s
Streaming Platform – Resiliency Test Framework
Producers
Consumers
Breakdown
Normality
Self-healing
Recovery
Prod Release
Optimize
Analyze
Test
App Team / Developers
Chaos Mesh / AWS FIS
CPU/Memory/I
O Stress
Network
Latency &
Jitter
Single Broker
Failure
Single AZ Down
Scheduled Chaos
Grey
Failures
Hard
Failures
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Replicator
Schema
Connect
Kafka
Replicator
Schema
Connect
Kafka
Replicator
Schema
Connect
AZ1 AZ3
AZ2
Network
outage
Large
Network
Latency
Multi Broker
Down
Region Down
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Replicator
Schema
Connect
Kafka
Replicator
Schema
Connect
Kafka
Replicator
Schema
Connect
Apps Tolerate Apps Fail-Over to DR
Streaming Platform – Resiliency Test Framework
Kafka Client
Applications
Optimize Quotas
and re-evaluate
Thank You!
Kafka Summit London 2024
S Vinod Kumar
/s-vinod-kumar
F I N D Y O U R F I D E L I T Y

More Related Content

PDF
Anomaly Detection at Scale
PPTX
Operating Kafka on AutoPilot mode @ DBS Bank (Arpit Dubey, DBS Bank) Kafka Su...
PDF
Digital transformation: Highly resilient streaming architecture and strategie...
PPTX
Embracing Failure - AzureDay Rome
PDF
Digital Transformation: Highly Resilient Streaming Architecture and Strategies
PDF
Resilient Event Driven Systems With Kafka
PDF
Building zero data loss pipelines with apache kafka
PDF
Kafka At Scale in the Cloud
Anomaly Detection at Scale
Operating Kafka on AutoPilot mode @ DBS Bank (Arpit Dubey, DBS Bank) Kafka Su...
Digital transformation: Highly resilient streaming architecture and strategie...
Embracing Failure - AzureDay Rome
Digital Transformation: Highly Resilient Streaming Architecture and Strategies
Resilient Event Driven Systems With Kafka
Building zero data loss pipelines with apache kafka
Kafka At Scale in the Cloud

Similar to Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and Lens of Observability (20)

PPTX
Paris Kafka Meetup - patterns anti-patterns
PDF
Applying Chaos Engineering to Build Resilient Serverless Applications
PPTX
Stream data from Apache Kafka for processing with Apache Apex
PDF
Applying principles of chaos engineering to serverless (O'Reilly Software Arc...
PDF
Chaos patterns - architecting for failure in distributed systems
PDF
How to bring chaos engineering to serverless
PDF
Chaos Patterns
PPTX
Multi tier, multi-tenant, multi-problem kafka
PPTX
Linked in multi tier, multi-tenant, multi-problem kafka
PDF
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
PPTX
Kafka at Peak Performance
PDF
From Three Nines to Five Nines - A Kafka Journey
PDF
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
PDF
Resisting to The Shocks
PDF
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
PDF
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
PDF
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
PPTX
Chaos Engineering: Why Breaking Things Should Be Practised.
PDF
Cloud Native London 2019 Faas composition using Kafka and cloud-events
PDF
Capital One Delivers Risk Insights in Real Time with Stream Processing
Paris Kafka Meetup - patterns anti-patterns
Applying Chaos Engineering to Build Resilient Serverless Applications
Stream data from Apache Kafka for processing with Apache Apex
Applying principles of chaos engineering to serverless (O'Reilly Software Arc...
Chaos patterns - architecting for failure in distributed systems
How to bring chaos engineering to serverless
Chaos Patterns
Multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafka
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
Kafka at Peak Performance
From Three Nines to Five Nines - A Kafka Journey
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Resisting to The Shocks
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
Chaos Engineering: Why Breaking Things Should Be Practised.
Cloud Native London 2019 Faas composition using Kafka and cloud-events
Capital One Delivers Risk Insights in Real Time with Stream Processing
Ad

More from HostedbyConfluent (20)

PDF
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
PDF
Renaming a Kafka Topic | Kafka Summit London
PDF
Evolution of NRT Data Ingestion Pipeline at Trendyol
PDF
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
PDF
Exactly-once Stream Processing with Arroyo and Kafka
PDF
Fish Plays Pokemon | Kafka Summit London
PDF
Tiered Storage 101 | Kafla Summit London
PDF
Building a Self-Service Stream Processing Portal: How And Why
PDF
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
PDF
Navigating Private Network Connectivity Options for Kafka Clusters
PDF
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
PDF
Explaining How Real-Time GenAI Works in a Noisy Pub
PDF
TL;DR Kafka Metrics | Kafka Summit London
PDF
A Window Into Your Kafka Streams Tasks | KSL
PDF
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
PDF
Data Contracts Management: Schema Registry and Beyond
PDF
Code-First Approach: Crafting Efficient Flink Apps
PDF
Debezium vs. the World: An Overview of the CDC Ecosystem
PDF
Beyond Tiered Storage: Serverless Kafka with No Local Disks
PDF
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Renaming a Kafka Topic | Kafka Summit London
Evolution of NRT Data Ingestion Pipeline at Trendyol
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Exactly-once Stream Processing with Arroyo and Kafka
Fish Plays Pokemon | Kafka Summit London
Tiered Storage 101 | Kafla Summit London
Building a Self-Service Stream Processing Portal: How And Why
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
Navigating Private Network Connectivity Options for Kafka Clusters
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Explaining How Real-Time GenAI Works in a Noisy Pub
TL;DR Kafka Metrics | Kafka Summit London
A Window Into Your Kafka Streams Tasks | KSL
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Data Contracts Management: Schema Registry and Beyond
Code-First Approach: Crafting Efficient Flink Apps
Debezium vs. the World: An Overview of the CDC Ecosystem
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Ad

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
A Presentation on Artificial Intelligence
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
KodekX | Application Modernization Development
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Cloud computing and distributed systems.
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Empathic Computing: Creating Shared Understanding
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
MYSQL Presentation for SQL database connectivity
PPT
Teaching material agriculture food technology
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
20250228 LYD VKU AI Blended-Learning.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
cuic standard and advanced reporting.pdf
Electronic commerce courselecture one. Pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
A Presentation on Artificial Intelligence
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Spectral efficient network and resource selection model in 5G networks
KodekX | Application Modernization Development
Dropbox Q2 2025 Financial Results & Investor Presentation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Cloud computing and distributed systems.
Mobile App Security Testing_ A Comprehensive Guide.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Empathic Computing: Creating Shared Understanding
The AUB Centre for AI in Media Proposal.docx
MYSQL Presentation for SQL database connectivity
Teaching material agriculture food technology
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...

Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and Lens of Observability

  • 1. Future with Zero Down-Time: End-to-end resiliency with chaos engineering and lens of observability S Vinod Kumar Event Streaming Platform Team Fidelity Investments
  • 2. Fishing for Errors Increase Linger First step of solving a problem is to identify the problem Throughput Debugging Kafka Horizontal Broker Scaling Memory RAM CPU This Time Its Happening!!!???
  • 3. Client-SideObservability Enhanced Monitoring on Kafka K a f k a H e a l t h S u m m a r y S y s t e m M e t r i c s L a t e n c y M e t r i c s T h r o u g h p u t M e t r i c s P r o d u c e r s C o n s u m e r s P r o d u c e r s C o n s u m e r s
  • 4. Streaming Platform – Resiliency Test Framework Producers Consumers Breakdown Normality Self-healing Recovery Prod Release Optimize Analyze Test App Team / Developers
  • 5. Chaos Mesh / AWS FIS CPU/Memory/I O Stress Network Latency & Jitter Single Broker Failure Single AZ Down Scheduled Chaos Grey Failures Hard Failures Kafka Broker Kafka Broker Kafka Broker Kafka Broker Kafka Broker Kafka Broker Kafka Broker Kafka Broker Kafka Broker Kafka Replicator Schema Connect Kafka Replicator Schema Connect Kafka Replicator Schema Connect AZ1 AZ3 AZ2 Network outage Large Network Latency Multi Broker Down Region Down Kafka Broker Kafka Broker Kafka Broker Kafka Broker Kafka Broker Kafka Broker Kafka Broker Kafka Broker Kafka Broker Kafka Replicator Schema Connect Kafka Replicator Schema Connect Kafka Replicator Schema Connect Apps Tolerate Apps Fail-Over to DR
  • 6. Streaming Platform – Resiliency Test Framework Kafka Client Applications Optimize Quotas and re-evaluate
  • 7. Thank You! Kafka Summit London 2024 S Vinod Kumar /s-vinod-kumar F I N D Y O U R F I D E L I T Y