SlideShare a Scribd company logo
arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx
Monitoring and Resiliency
Testing our Apache Kafka
Clusters
Ameya Panse
Associate
Goldman Sachs
Sheikh Araf
Associate
Goldman Sachs
arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx
We’re helping clients build a treasury of the future and powering
software partners to enhance their offerings.
Differentiated Platform
arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx
Apache Kafka Backbone
Apache Kafka Cluster
Payment
Service
Ledger Reporting
Validation
Data
Platform
Payment
Rails
CRM
arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx
Two Sides to Monitoring Kafka Infrastructure
Kafka
Cluster
arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx
App App App
Kafka
Cluster
DB
DB
App App App
App
App
Connectors
Stream
Processors
Consumers
Producers
Two Sides to Monitoring Kafka Infrastructure
arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx
Kafka Brokers
• CPU Usage
• Disk Usage
• Network Tx Packets
• Network Rx Packets
• Leader Count
• Under Replicated Partition
Count
Kafka Clients
• Producer Error Rate
• Active Consumer Connections
• Producer Retry Rate
• Consumer Connection Close
Rate
• Produce & Consume Latency
Two Sides to Monitoring Kafka Infrastructure
arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx
Monitoring the Clients
arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx
Collecting Client Metrics
Containers
Application
Container
JMX Agent
Sidecar
Virtual Machine
Application
Process
JMX Agent
Process
Datadog
Backend
Datadog
Dashboard
PagerDuty
Alerts
arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx
Frictionless onboarding
module "my-app" {
source = "..."
...
kafka_app_name = "my-service"
}
main.tf Behind the scenes:
• Add a JMX agent sidecar
• Configure agent to collect
Kafka client metrics
• Configure agent to
authenticate and push
metrics to dashboard
arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx
Monitoring TLS Certificate Expiry
<dependency>
<groupId>com.gs.txb.security</groupId>
<artifactId>certificate-info-endpoint</artifactId>
<version>1.0.0</version>
</dependency>
pom.xml
management.server.port=PORT
management.server.keystore=KeystoreLocation
application.properties
Containers
Application Container
JMX Agent Sidecar
/certs
Poll
Metric
arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx
Monitoring the Cluster
arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx
Synthetic Monitoring with Heartbeat App
• Broker Availability
• Cross-Zone Connectivity
• Monitor Support Services
• Real Time Alerting
• End to End Monitoring
arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx
Synthetic Monitoring Algorithm
Kafka Cluster
Broker 1
Broker 2
Broker 3
Partition 1
Leader
Partition 2
Leader
Partition 3
Leader
Producer 1
JMX
Sidecar
Producer 2
JMX
Sidecar
Producer 3
JMX
Sidecar
Producer Containers
Data Center 1
Data Center 2
Data Center 3
Consumer 1
JMX
Sidecar
Consumer 2
JMX
Sidecar
Consumer 3
JMX
Sidecar
Consumer Containers
Dashboard
arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx
Consolidating the Metrics
Dashboards and Alerts
arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx
arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx
arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx
Alerts
arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx
Putting it all together
Culture of Resiliency Game Days
arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx
Failure Scenarios
Zone 1
Kafka Cluster
Broker Broker
Zone 2
Broker Broker
Zone 3
Broker Broker
Zone 4
Broker Broker
Zone 1
Kafka Cluster
Broker Broker
Zone 2
Broker Broker
Zone 3
Broker Broker
Zone 4
Broker Broker
Zone 1
Kafka Cluster
Broker Broker
Zone 2
Broker Broker
Zone 3
Broker Broker
Zone 4
Broker Broker
Loss of one broker in one zone Loss of two broker in one zone Loss of two broker in different zones
Topic: Replication factor = 3
Cluster: Min in-sync replicas = 2
Producer: Acks = all
arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx
Game Days
Business Services
Transaction Banking Stack
Apache Kafka Backbone
App App App App
Stream of
Payment Messages
Track system health
via Kafka dashboard
Assert all payments
are successful
Ensure all
applications recover
automatically
arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx
In Summary…
• Apache Kafka as backbone for processing payments
• Monitor cluster health with synthetic traffic
• Monitoring clients using JMX agent sidecar
• Simplify the onboarding process to improve monitoring coverage
• One stop view to monitor the health of the infrastructure
• A culture of regular resiliency testing
arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx
Thank You
Ameya Panse
ameya.panse@gs.com
Sheikh Araf
sheikh.araf@gs.com

More Related Content

PPTX
PDF
Kafka Streams: What it is, and how to use it?
PDF
Kafka 101 and Developer Best Practices
PPTX
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
PDF
Terraform -- Infrastructure as Code
PDF
What Is Kubernetes | Kubernetes Introduction | Kubernetes Tutorial For Beginn...
PPTX
Netflix Data Pipeline With Kafka
PDF
Fundamentals of Apache Kafka
Kafka Streams: What it is, and how to use it?
Kafka 101 and Developer Best Practices
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Terraform -- Infrastructure as Code
What Is Kubernetes | Kubernetes Introduction | Kubernetes Tutorial For Beginn...
Netflix Data Pipeline With Kafka
Fundamentals of Apache Kafka

What's hot (20)

PDF
ksqlDB: A Stream-Relational Database System
PPTX
Kafka 101
PDF
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
PDF
Producer Performance Tuning for Apache Kafka
PPTX
Apache Flink and what it is used for
PPTX
A visual introduction to Apache Kafka
PDF
Introduction To Flink
PPTX
Microservices Part 3 Service Mesh and Kafka
PPTX
Introduction To Terraform
PPTX
Terraform Basics
PDF
Developing Real-Time Data Pipelines with Apache Kafka
PDF
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
PDF
Introduction to Apache Kafka
PPTX
Kafka connect 101
PDF
PPTX
APACHE KAFKA / Kafka Connect / Kafka Streams
PDF
Building Microservices with Apache Kafka
PDF
Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...
PPTX
Kubernetes Basics
PDF
Apache Kafka Architecture & Fundamentals Explained
ksqlDB: A Stream-Relational Database System
Kafka 101
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Producer Performance Tuning for Apache Kafka
Apache Flink and what it is used for
A visual introduction to Apache Kafka
Introduction To Flink
Microservices Part 3 Service Mesh and Kafka
Introduction To Terraform
Terraform Basics
Developing Real-Time Data Pipelines with Apache Kafka
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Introduction to Apache Kafka
Kafka connect 101
APACHE KAFKA / Kafka Connect / Kafka Streams
Building Microservices with Apache Kafka
Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...
Kubernetes Basics
Apache Kafka Architecture & Fundamentals Explained
Ad

Similar to Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs | Sheikh Araf and Ameya Panse, Goldman Sachs (20)

PDF
Kafka at the Edge: an IoT scenario with OpenShift Streams for Apache Kafka | ...
PDF
Apache Kafka - A Distributed Streaming Platform
PDF
Apache kafka-a distributed streaming platform
PDF
Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...
PDF
Apache Kafka - Scalable Message-Processing and more !
PDF
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
PDF
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
PPTX
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
PPTX
Event Streaming Architectures with Confluent and ScyllaDB
PDF
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
PPTX
Kafka On YARN (KOYA): An Open Source Initiative to integrate Kafka & YARN
PPTX
Kafka for data scientists
PDF
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
PPTX
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
PDF
Connect, Test, Optimize: The Ultimate Kafka Connector Benchmarking Toolkit
PDF
Jug - ecosystem
PDF
Connecting Apache Kafka With Mule ESB
PDF
Building Streaming Data Applications Using Apache Kafka
PDF
Chti jug - 2018-06-26
PPTX
Open Source Big Data Ingestion - Without the Heartburn!
Kafka at the Edge: an IoT scenario with OpenShift Streams for Apache Kafka | ...
Apache Kafka - A Distributed Streaming Platform
Apache kafka-a distributed streaming platform
Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...
Apache Kafka - Scalable Message-Processing and more !
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
Event Streaming Architectures with Confluent and ScyllaDB
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Kafka On YARN (KOYA): An Open Source Initiative to integrate Kafka & YARN
Kafka for data scientists
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Connect, Test, Optimize: The Ultimate Kafka Connector Benchmarking Toolkit
Jug - ecosystem
Connecting Apache Kafka With Mule ESB
Building Streaming Data Applications Using Apache Kafka
Chti jug - 2018-06-26
Open Source Big Data Ingestion - Without the Heartburn!
Ad

More from HostedbyConfluent (20)

PDF
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
PDF
Renaming a Kafka Topic | Kafka Summit London
PDF
Evolution of NRT Data Ingestion Pipeline at Trendyol
PDF
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
PDF
Exactly-once Stream Processing with Arroyo and Kafka
PDF
Fish Plays Pokemon | Kafka Summit London
PDF
Tiered Storage 101 | Kafla Summit London
PDF
Building a Self-Service Stream Processing Portal: How And Why
PDF
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
PDF
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
PDF
Navigating Private Network Connectivity Options for Kafka Clusters
PDF
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
PDF
Explaining How Real-Time GenAI Works in a Noisy Pub
PDF
TL;DR Kafka Metrics | Kafka Summit London
PDF
A Window Into Your Kafka Streams Tasks | KSL
PDF
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
PDF
Data Contracts Management: Schema Registry and Beyond
PDF
Code-First Approach: Crafting Efficient Flink Apps
PDF
Debezium vs. the World: An Overview of the CDC Ecosystem
PDF
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Renaming a Kafka Topic | Kafka Summit London
Evolution of NRT Data Ingestion Pipeline at Trendyol
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Exactly-once Stream Processing with Arroyo and Kafka
Fish Plays Pokemon | Kafka Summit London
Tiered Storage 101 | Kafla Summit London
Building a Self-Service Stream Processing Portal: How And Why
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Navigating Private Network Connectivity Options for Kafka Clusters
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Explaining How Real-Time GenAI Works in a Noisy Pub
TL;DR Kafka Metrics | Kafka Summit London
A Window Into Your Kafka Streams Tasks | KSL
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Data Contracts Management: Schema Registry and Beyond
Code-First Approach: Crafting Efficient Flink Apps
Debezium vs. the World: An Overview of the CDC Ecosystem
Beyond Tiered Storage: Serverless Kafka with No Local Disks

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Big Data Technologies - Introduction.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Modernizing your data center with Dell and AMD
PDF
Approach and Philosophy of On baking technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation theory and applications.pdf
PPT
Teaching material agriculture food technology
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Machine learning based COVID-19 study performance prediction
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
KodekX | Application Modernization Development
Unlocking AI with Model Context Protocol (MCP)
Big Data Technologies - Introduction.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Modernizing your data center with Dell and AMD
Approach and Philosophy of On baking technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Per capita expenditure prediction using model stacking based on satellite ima...
Chapter 3 Spatial Domain Image Processing.pdf
NewMind AI Monthly Chronicles - July 2025
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation theory and applications.pdf
Teaching material agriculture food technology
NewMind AI Weekly Chronicles - August'25 Week I
Machine learning based COVID-19 study performance prediction
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Understanding_Digital_Forensics_Presentation.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
CIFDAQ's Market Insight: SEC Turns Pro Crypto

Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs | Sheikh Araf and Ameya Panse, Goldman Sachs

  • 1. arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx Monitoring and Resiliency Testing our Apache Kafka Clusters Ameya Panse Associate Goldman Sachs Sheikh Araf Associate Goldman Sachs
  • 2. arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx We’re helping clients build a treasury of the future and powering software partners to enhance their offerings. Differentiated Platform
  • 3. arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx Apache Kafka Backbone Apache Kafka Cluster Payment Service Ledger Reporting Validation Data Platform Payment Rails CRM
  • 4. arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx Two Sides to Monitoring Kafka Infrastructure Kafka Cluster
  • 5. arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx App App App Kafka Cluster DB DB App App App App App Connectors Stream Processors Consumers Producers Two Sides to Monitoring Kafka Infrastructure
  • 6. arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx Kafka Brokers • CPU Usage • Disk Usage • Network Tx Packets • Network Rx Packets • Leader Count • Under Replicated Partition Count Kafka Clients • Producer Error Rate • Active Consumer Connections • Producer Retry Rate • Consumer Connection Close Rate • Produce & Consume Latency Two Sides to Monitoring Kafka Infrastructure
  • 7. arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx Monitoring the Clients
  • 8. arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx Collecting Client Metrics Containers Application Container JMX Agent Sidecar Virtual Machine Application Process JMX Agent Process Datadog Backend Datadog Dashboard PagerDuty Alerts
  • 9. arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx Frictionless onboarding module "my-app" { source = "..." ... kafka_app_name = "my-service" } main.tf Behind the scenes: • Add a JMX agent sidecar • Configure agent to collect Kafka client metrics • Configure agent to authenticate and push metrics to dashboard
  • 10. arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx Monitoring TLS Certificate Expiry <dependency> <groupId>com.gs.txb.security</groupId> <artifactId>certificate-info-endpoint</artifactId> <version>1.0.0</version> </dependency> pom.xml management.server.port=PORT management.server.keystore=KeystoreLocation application.properties Containers Application Container JMX Agent Sidecar /certs Poll Metric
  • 11. arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx Monitoring the Cluster
  • 12. arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx Synthetic Monitoring with Heartbeat App • Broker Availability • Cross-Zone Connectivity • Monitor Support Services • Real Time Alerting • End to End Monitoring
  • 13. arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx Synthetic Monitoring Algorithm Kafka Cluster Broker 1 Broker 2 Broker 3 Partition 1 Leader Partition 2 Leader Partition 3 Leader Producer 1 JMX Sidecar Producer 2 JMX Sidecar Producer 3 JMX Sidecar Producer Containers Data Center 1 Data Center 2 Data Center 3 Consumer 1 JMX Sidecar Consumer 2 JMX Sidecar Consumer 3 JMX Sidecar Consumer Containers Dashboard
  • 14. arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx Consolidating the Metrics Dashboards and Alerts
  • 15. arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx
  • 16. arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx
  • 17. arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx Alerts
  • 18. arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx Putting it all together Culture of Resiliency Game Days
  • 19. arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx Failure Scenarios Zone 1 Kafka Cluster Broker Broker Zone 2 Broker Broker Zone 3 Broker Broker Zone 4 Broker Broker Zone 1 Kafka Cluster Broker Broker Zone 2 Broker Broker Zone 3 Broker Broker Zone 4 Broker Broker Zone 1 Kafka Cluster Broker Broker Zone 2 Broker Broker Zone 3 Broker Broker Zone 4 Broker Broker Loss of one broker in one zone Loss of two broker in one zone Loss of two broker in different zones Topic: Replication factor = 3 Cluster: Min in-sync replicas = 2 Producer: Acks = all
  • 20. arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx Game Days Business Services Transaction Banking Stack Apache Kafka Backbone App App App App Stream of Payment Messages Track system health via Kafka dashboard Assert all payments are successful Ensure all applications recover automatically
  • 21. arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx In Summary… • Apache Kafka as backbone for processing payments • Monitor cluster health with synthetic traffic • Monitoring clients using JMX agent sidecar • Simplify the onboarding process to improve monitoring coverage • One stop view to monitor the health of the infrastructure • A culture of regular resiliency testing
  • 22. arafshhomeMy DocumentsMonitoring and Resiliency Testing our Apache Kafka Clusters Send 1.pptx Thank You Ameya Panse ameya.panse@gs.com Sheikh Araf sheikh.araf@gs.com

Editor's Notes

  • #3: Our mission is simple: provide a global transaction banking platform that is nimble, secure, and easy for clients to use and for partners to connect to.