SlideShare a Scribd company logo
Hope is not a course of action!
-
A practical deep dive
into Observability of Streaming Applications
ING
Tim van Baarsen & Kosta Chuturkov
About the Speakers
The Netherlands
-
Amsterdam
Team Dora Romania
-
Bucharest
ING
https://www.ing.jobs/
• 60,000+ employees
• Serve 37+ million customers
• Corporate clients and financial
institutions in over 40
countries
Kafka @ ING
Frontrunners in Kafka since 2014
Running in production:
• 8 years
• 6000+ topics
• Serving 1000+ Development teams
• Self service topic management
Kafka @ ING
Traffic is growing with +10% monthly
0
200.000
400.000
600.000
800.000
1.000.000
1.200.000
2015 2016 2017 2018 2019 2020 2021 2022 2023
Messages produced per second (average)
Messages produced per second (average)
What are we going to cover today ?
• What is Observability and why do we need this?
• Three Pillars of observability
• Exposing Kafka Client Side Metrics
• How to interpret them? Consumer lag
• Kafka client-side metrics demo
• OpenTelemetry
• OpenTelemetry Collector
• Demo (Spring + Kafka + distributed tracing/logging/metrics)
• Wrap Up
• Questions
Introduction to application Observability
Observability is the ability to measure the
internal state of a system only by its external
outputs. (logs, metrics, and traces)
Why do we need this?
Why do we need this?
• Helps to investigate root causes of incidents
• Improve our software
• Prevent outages
• Better user experience for our customers
Three Pillars of Observability
Logging
Tracing
Metrics
(Aggregatable)
(Events)
Tracing:
Recording flow through the
application(s) and the
interactions between services.
Helps:
• Understand why something
happened.
Logging:
Recording individual
events and data.
Helps: to understand
what happened.
Hard to tell: context
Metrics:
Recording time series data.
Aggregate
Helps:
• understand the context
• identify trends
• alert
Hard to tell: why something
is not working as expected?
Logs
2023-04-06T09:11:47.341Z INFO [spring-kafka-
producer,210d9dc16597a0a8f1a9746c4bbd8277,6b350c56f69a6ee6] 7 --- [nio-8080-exec-1]
c.e.rest.StockQuoteRestController : Produce stock quote via Rest API
{
"scope": {
"name": ”com.example.rest.StockQuoteRestController"
},
"logRecords": [
{
"timeUnixNano": "1680772307341000000",
"severityNumber": 9,
"severityText": "INFO",
"body": {
"stringValue": "Produce stock quote via Rest API"
},
"flags": 1,
"traceId": "210d9dc16597a0a8f1a9746c4bbd8277",
"spanId": "6b350c56f69a6ee6"
}
]
}
Tracing
Trace Id
dfb987a57cbb9fb6e13a00d68413efa4
(Root) Span Id
06eeb7d5c3e8327e
Span Id
0912845005f1a035
Span Id
808740d73506eee2
Time
Sequence
of operations
Duration of single span
Duration of total trace
Service A Service B Service C
Context is propagated over the network
Kafka header
Metrics: Interceptors
• Kafka-clients API is pluggable
• Confluent Monitoring Interceptor
• Interceptors (interceptor.classes)
• Consumer
• Producer
• Send metrics to Kafka topic (_confluent-monitoring)
• Confluent Control Center
consumerProperties.put(
ProducerConfig.INTERCEPTOR_CLASSES_CONFIG,
" io.confluent.monitoring.clients.interceptor.MonitoringConsumerInterceptor");
producerProperties.put(
ConsumerConfig.INTERCEPTOR_CLASSES_CONFIG,
" io.confluent.monitoring.clients.interceptor.MonitoringProducerInterceptor");
Metrics: Interceptors
Consumer lag metrics in Confluent Control Center
Metrics: Metric Reporters
• Metric Reporters (metric.reporters)
• Default JMX
• View metrics in JConsole
• Prefer Metrics in time series
database!
Metrics: Java Agent
Your Kafka
Application
HTTP JMX Exporter
Prometheus Agent
JVM process
java -javaagent:./jmx_prometheus_javaagent-0.18.0.jar
=1234:kafka_clients.yml -jar your-kafka-application.jar
Expose a HTTP endpoint serving metrics of the local JVM
Metrics: Spring Boot & Micrometer
What is Spring Boot?
• Defacto standard building Java applications
• Very opinionated
• ‘Fat’ JAR (executable + all dependencies)
• Embedded webserver
• Production ready features
• Micrometer default instrumentation / observability library
Metrics: Spring Boot & Micrometer
What is Micrometer?
• Vendor-neutral application observability façade
• Think SLF4J, but for observability
• Metrics & Traces
• API to instrument your application
• Used by Spring projects
• Support for 19 popular monitoring systems
• Out of the box metrics, traces for Kafka (using Spring Kafka)
Micrometer.io
Metrics: Spring Boot & Micrometer
Prometheus
Grafana
/actuator/prometheus
Scrape Metrics
Kafka
Consumer
Application
Kafka
client
Spring
Kafka
Spring Boot Application
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
Helps to manage & monitor your application
• Metrics
• Health check
• Application info (version, git commit, etc)
/actuator/metrics
Query
Metrics: Spring Boot & Micrometer
Prometheus
Grafana
Elasticsearch
Kibana
Kafka
Consumer
Application
Kafka
client
Spring
Kafka
Spring Boot Application
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-elastic</artifactId>
</dependency>
Push metrics
Metrics: Spring Boot & Micrometer
Prometheus
Grafana
Elasticsearch
Kibana
OpenTelemetry
Metrics backend
Kafka
Consumer
Application
Kafka
client
Spring
Kafka
Spring Boot Application
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-otlp</artifactId>
</dependency>
Push metrics
(OTLP protocol)
Metrics: Client side metrics Demo
0 1 2 3 4 5 6 7 8
old new
Kafka topic:
‘stock-quotes’
3 partitions
Producer
Spring
Kafka
Kafka broker
Consumer
Spring
Kafka
0100101001101 0100101001101
poll
send
Rest API
Slow Service
Http API Rest Call
Prometheus
Grafana
HTTP Call:
Scrape Metrics
Kafka Lag exporter
HTTP Call:
Scrape Metrics
Calculate consumer lag
Based on broker data
Demo
Consumer lag metrics difference, why?
Problem: as a developer I don’t expect a big difference.
The consumer
• Reports lag only for the partition(s) it is actively consuming
• High lag? Doesn't switch partitions that often
• Only aware of the progress of the last offset as far as it's most recent
metadata pull
• consumes to that offset and thinks the lag is gone as it read up to that
message.
The producer
• Still producing.
• Actual offset grew in the meantime
• Consumer is not aware of that yet
More instances you start the more partition metrics will get reported
Lessons Learned
Take aways
• Metric precision should not be something critical
• Monitoring its trend
• Important: know there is lag
• Lag keeps on increasing?
• The consumer has a problem!
• Alert on increasing trend of consumer lag
What is OpenTelemetry?
+ =
Open
Census Open Tracing OpenTelemetry
What is Software Telemetry ?
• Collection of data on the use, performance and behaviour of applications and their
components
OpenTelemetry Components
API
SDK`s
Collector
Protocol (OTLP)
Per Programming Language
Functionality to collect, process and
export telemetry data
Encoding
Transport
Delivery
Prometheus
Kafka
Jaeger
Application
Telemetry Data
Using Exporter
Library
Telemetry Data
Using OTLP
(gRPC or http/protobuf)
Telemetry Data
Backend specific format
Observability and Storage
Library
Auto(Agent)
Defines the interface for instrumenting
code with traces, metrics and logs
OpenTelemetry Collector Components
• Optional component
• No change in Application needed when switching Logging/Metrics/Tracing
Backends
Collector
Receivers
Processors
Exporters
Receivers
Prometheus
Kafka
Collector
Receivers Exporters
Processors
Processors (How to handle received Data)
Collector
Receivers Exporters
Memory Limiter
Batch Processor
Processors
Filter Processor
Exporters
Tempo
Jaeger
Kafka
Collector
Receivers Exporters
Processors
OpenTelemetry Demo: Components
‘stock-quotes-exchange-
nyse’
‘stock-quotes-exchange-
nasdaq’
‘stock-quotes-exchange-
ams’
Producer
Spring
Kafka
Consumer
Spring
Kafka
poll
send
Shaky Downstream
Service
Http Rest Call
Client
REST Call
Kafka broker
Kafka Streams
App
poll
Kafka Plain
Consumer
0 1 2 3 4 5 6 7 8
0 1 2
0 1
0 1 2 3
send
‘stock-quotes’
OpenTelemetry Demo
Collector
Plain Kafka
Consumer
Application
Kafka
Producer
Application
Kafka
Kafka
Streams
Application
Logs, Metrics, Traces
Telemetry data
(logs, metrics & traces)
Kafka Consumer
Application
Shaky Downstream
Service
OpenTelemetry Demo
Collector
Plain Kafka
Consumer
Application
Kafka
Producer
Application
Prometheus
Scrape Metrics
Kafka
Streams
Application
Time series database
for Metrics
Telemetry data
(logs, metrics & traces)
Shaky Downstream
Service
Kafka Consumer
Application
OpenTelemetry Demo
Collector
Plain Kafka
Consumer
Application
Kafka
Producer
Application
Grafana
Loki
Kafka
Streams
Application
Logs
Centralized logging
Telemetry data
(logs, metrics & traces)
Grafana
Kafka Consumer
Application
Shaky Downstream
Service
OpenTelemetry Demo
Collector
Plain Kafka
Consumer
Application
Kafka
Producer
Application
Jaeger
Grafana
Tempo
Kafka
Streams
Application
Traces
Traces
Telemetry data
(logs, metrics & traces)
Kafka Consumer
Application
Shaky Downstream
Service
Demo
Wrap up
• Many different ways to observe your applications
• Aim for vendor neutral solutions
• Helps you migrate to different observability backend
• Minimal changes to your applications
• Micrometer
• Overlap with OpenTelemetry
• JVM only
• No instrumentation for Kafka Streams (Traces) yet:
• https://github.com/micrometer-metrics/micrometer/issues/3713
• Can send telemetry data using OpenTelemetry
• ✅ Metrics
• ✅ Traces (Spring Boot 3)
• ❌ Logs
Wrap up
• Consumer (lag) metrics
• Client side vs broker side metrics
• Monitor the trend!
• Select and ship only the metrics you need
Wrap up
• OpenTelemetry
• Language agnostic
• Specification for Logs not stable
• Start small
• Java Agent will give you a kickstart
• Minimum dev effort using auto instrumentation
• Sample traces. You most likely don’t want to store 100% of all traces
• Limitations
• Stateful stream processing: KAFKA-7718
Questions?
🤔
❔
Demo codebase: https://github.com/j-tim/kafka-summit-london-2023
A Practical Deep Dive into Observability of Streaming Applications with Kosta Chuturkov & Tim van Baarsen

More Related Content

PPTX
Observability for Application Developers (1)-1.pptx
PDF
OpenTelemetry: From front- to backend (2022)
PPTX
OpenTelemetry For Architects
PDF
Uncover the Root Cause of Kafka Performance Anomalies, Daniel Kim & Antón Rod...
PDF
Beginner's Guide to Observability@Devoxx PL 2024
PDF
A Practical Guide To End-to-End Tracing In Event Driven Architectures with Ro...
PPTX
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
PPTX
DockerCon SF 2019 - Observability Workshop
Observability for Application Developers (1)-1.pptx
OpenTelemetry: From front- to backend (2022)
OpenTelemetry For Architects
Uncover the Root Cause of Kafka Performance Anomalies, Daniel Kim & Antón Rod...
Beginner's Guide to Observability@Devoxx PL 2024
A Practical Guide To End-to-End Tracing In Event Driven Architectures with Ro...
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
DockerCon SF 2019 - Observability Workshop

Similar to A Practical Deep Dive into Observability of Streaming Applications with Kosta Chuturkov & Tim van Baarsen (20)

PDF
Observability with Spring-based distributed systems
PDF
Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...
PDF
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
PDF
Building an Observability Platform in 389 Difficult Steps
PPTX
How to Improve the Observability of Apache Cassandra and Kafka applications...
PPSX
Service Mesh - Observability
PDF
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
PDF
A Practical Guide To End-to-End Tracing In Event Driven Architectures
PDF
Observability, Distributed Tracing, and Open Source: The Missing Primer
PPTX
OpenTelemetry For Operators
PPTX
OpenTelemetry For Developers
PDF
Manage Microservices Chaos and Complexity with Observability
PDF
[WSO2Con EU 2018] Tooling for Observability
PDF
Observability: Beyond the Three Pillars with Spring
PDF
Metrics driven development with dedicated Observability Team
PDF
[WSO2Con Asia 2018] Tooling for Observability
PDF
Microservices observability
PDF
"Distributed Tracing: New DevOps Foundation" by Jayesh Ahire
PDF
The present and future of serverless observability (QCon London)
PDF
The present and future of Serverless observability
Observability with Spring-based distributed systems
Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
Building an Observability Platform in 389 Difficult Steps
How to Improve the Observability of Apache Cassandra and Kafka applications...
Service Mesh - Observability
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
A Practical Guide To End-to-End Tracing In Event Driven Architectures
Observability, Distributed Tracing, and Open Source: The Missing Primer
OpenTelemetry For Operators
OpenTelemetry For Developers
Manage Microservices Chaos and Complexity with Observability
[WSO2Con EU 2018] Tooling for Observability
Observability: Beyond the Three Pillars with Spring
Metrics driven development with dedicated Observability Team
[WSO2Con Asia 2018] Tooling for Observability
Microservices observability
"Distributed Tracing: New DevOps Foundation" by Jayesh Ahire
The present and future of serverless observability (QCon London)
The present and future of Serverless observability
Ad

More from HostedbyConfluent (20)

PDF
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
PDF
Renaming a Kafka Topic | Kafka Summit London
PDF
Evolution of NRT Data Ingestion Pipeline at Trendyol
PDF
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
PDF
Exactly-once Stream Processing with Arroyo and Kafka
PDF
Fish Plays Pokemon | Kafka Summit London
PDF
Tiered Storage 101 | Kafla Summit London
PDF
Building a Self-Service Stream Processing Portal: How And Why
PDF
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
PDF
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
PDF
Navigating Private Network Connectivity Options for Kafka Clusters
PDF
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
PDF
Explaining How Real-Time GenAI Works in a Noisy Pub
PDF
TL;DR Kafka Metrics | Kafka Summit London
PDF
A Window Into Your Kafka Streams Tasks | KSL
PDF
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
PDF
Data Contracts Management: Schema Registry and Beyond
PDF
Code-First Approach: Crafting Efficient Flink Apps
PDF
Debezium vs. the World: An Overview of the CDC Ecosystem
PDF
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Renaming a Kafka Topic | Kafka Summit London
Evolution of NRT Data Ingestion Pipeline at Trendyol
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Exactly-once Stream Processing with Arroyo and Kafka
Fish Plays Pokemon | Kafka Summit London
Tiered Storage 101 | Kafla Summit London
Building a Self-Service Stream Processing Portal: How And Why
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Navigating Private Network Connectivity Options for Kafka Clusters
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Explaining How Real-Time GenAI Works in a Noisy Pub
TL;DR Kafka Metrics | Kafka Summit London
A Window Into Your Kafka Streams Tasks | KSL
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Data Contracts Management: Schema Registry and Beyond
Code-First Approach: Crafting Efficient Flink Apps
Debezium vs. the World: An Overview of the CDC Ecosystem
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Ad

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PPTX
1. Introduction to Computer Programming.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
WOOl fibre morphology and structure.pdf for textiles
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Getting started with AI Agents and Multi-Agent Systems
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
project resource management chapter-09.pdf
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
Modernising the Digital Integration Hub
PDF
Developing a website for English-speaking practice to English as a foreign la...
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Getting Started with Data Integration: FME Form 101
NewMind AI Weekly Chronicles – August ’25 Week III
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
1. Introduction to Computer Programming.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Programs and apps: productivity, graphics, security and other tools
WOOl fibre morphology and structure.pdf for textiles
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
A comparative study of natural language inference in Swahili using monolingua...
Getting started with AI Agents and Multi-Agent Systems
OMC Textile Division Presentation 2021.pptx
DP Operators-handbook-extract for the Mautical Institute
project resource management chapter-09.pdf
cloud_computing_Infrastucture_as_cloud_p
Zenith AI: Advanced Artificial Intelligence
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Modernising the Digital Integration Hub
Developing a website for English-speaking practice to English as a foreign la...
observCloud-Native Containerability and monitoring.pptx
1 - Historical Antecedents, Social Consideration.pdf
Getting Started with Data Integration: FME Form 101

A Practical Deep Dive into Observability of Streaming Applications with Kosta Chuturkov & Tim van Baarsen

  • 1. Hope is not a course of action! - A practical deep dive into Observability of Streaming Applications ING Tim van Baarsen & Kosta Chuturkov
  • 2. About the Speakers The Netherlands - Amsterdam Team Dora Romania - Bucharest
  • 3. ING https://www.ing.jobs/ • 60,000+ employees • Serve 37+ million customers • Corporate clients and financial institutions in over 40 countries
  • 4. Kafka @ ING Frontrunners in Kafka since 2014 Running in production: • 8 years • 6000+ topics • Serving 1000+ Development teams • Self service topic management
  • 5. Kafka @ ING Traffic is growing with +10% monthly 0 200.000 400.000 600.000 800.000 1.000.000 1.200.000 2015 2016 2017 2018 2019 2020 2021 2022 2023 Messages produced per second (average) Messages produced per second (average)
  • 6. What are we going to cover today ? • What is Observability and why do we need this? • Three Pillars of observability • Exposing Kafka Client Side Metrics • How to interpret them? Consumer lag • Kafka client-side metrics demo • OpenTelemetry • OpenTelemetry Collector • Demo (Spring + Kafka + distributed tracing/logging/metrics) • Wrap Up • Questions
  • 7. Introduction to application Observability Observability is the ability to measure the internal state of a system only by its external outputs. (logs, metrics, and traces)
  • 8. Why do we need this?
  • 9. Why do we need this? • Helps to investigate root causes of incidents • Improve our software • Prevent outages • Better user experience for our customers
  • 10. Three Pillars of Observability Logging Tracing Metrics (Aggregatable) (Events) Tracing: Recording flow through the application(s) and the interactions between services. Helps: • Understand why something happened. Logging: Recording individual events and data. Helps: to understand what happened. Hard to tell: context Metrics: Recording time series data. Aggregate Helps: • understand the context • identify trends • alert Hard to tell: why something is not working as expected?
  • 11. Logs 2023-04-06T09:11:47.341Z INFO [spring-kafka- producer,210d9dc16597a0a8f1a9746c4bbd8277,6b350c56f69a6ee6] 7 --- [nio-8080-exec-1] c.e.rest.StockQuoteRestController : Produce stock quote via Rest API { "scope": { "name": ”com.example.rest.StockQuoteRestController" }, "logRecords": [ { "timeUnixNano": "1680772307341000000", "severityNumber": 9, "severityText": "INFO", "body": { "stringValue": "Produce stock quote via Rest API" }, "flags": 1, "traceId": "210d9dc16597a0a8f1a9746c4bbd8277", "spanId": "6b350c56f69a6ee6" } ] }
  • 12. Tracing Trace Id dfb987a57cbb9fb6e13a00d68413efa4 (Root) Span Id 06eeb7d5c3e8327e Span Id 0912845005f1a035 Span Id 808740d73506eee2 Time Sequence of operations Duration of single span Duration of total trace Service A Service B Service C Context is propagated over the network Kafka header
  • 13. Metrics: Interceptors • Kafka-clients API is pluggable • Confluent Monitoring Interceptor • Interceptors (interceptor.classes) • Consumer • Producer • Send metrics to Kafka topic (_confluent-monitoring) • Confluent Control Center consumerProperties.put( ProducerConfig.INTERCEPTOR_CLASSES_CONFIG, " io.confluent.monitoring.clients.interceptor.MonitoringConsumerInterceptor"); producerProperties.put( ConsumerConfig.INTERCEPTOR_CLASSES_CONFIG, " io.confluent.monitoring.clients.interceptor.MonitoringProducerInterceptor");
  • 14. Metrics: Interceptors Consumer lag metrics in Confluent Control Center
  • 15. Metrics: Metric Reporters • Metric Reporters (metric.reporters) • Default JMX • View metrics in JConsole • Prefer Metrics in time series database!
  • 16. Metrics: Java Agent Your Kafka Application HTTP JMX Exporter Prometheus Agent JVM process java -javaagent:./jmx_prometheus_javaagent-0.18.0.jar =1234:kafka_clients.yml -jar your-kafka-application.jar Expose a HTTP endpoint serving metrics of the local JVM
  • 17. Metrics: Spring Boot & Micrometer What is Spring Boot? • Defacto standard building Java applications • Very opinionated • ‘Fat’ JAR (executable + all dependencies) • Embedded webserver • Production ready features • Micrometer default instrumentation / observability library
  • 18. Metrics: Spring Boot & Micrometer What is Micrometer? • Vendor-neutral application observability façade • Think SLF4J, but for observability • Metrics & Traces • API to instrument your application • Used by Spring projects • Support for 19 popular monitoring systems • Out of the box metrics, traces for Kafka (using Spring Kafka) Micrometer.io
  • 19. Metrics: Spring Boot & Micrometer Prometheus Grafana /actuator/prometheus Scrape Metrics Kafka Consumer Application Kafka client Spring Kafka Spring Boot Application <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-prometheus</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-actuator</artifactId> </dependency> Helps to manage & monitor your application • Metrics • Health check • Application info (version, git commit, etc) /actuator/metrics Query
  • 20. Metrics: Spring Boot & Micrometer Prometheus Grafana Elasticsearch Kibana Kafka Consumer Application Kafka client Spring Kafka Spring Boot Application <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-elastic</artifactId> </dependency> Push metrics
  • 21. Metrics: Spring Boot & Micrometer Prometheus Grafana Elasticsearch Kibana OpenTelemetry Metrics backend Kafka Consumer Application Kafka client Spring Kafka Spring Boot Application <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-otlp</artifactId> </dependency> Push metrics (OTLP protocol)
  • 22. Metrics: Client side metrics Demo 0 1 2 3 4 5 6 7 8 old new Kafka topic: ‘stock-quotes’ 3 partitions Producer Spring Kafka Kafka broker Consumer Spring Kafka 0100101001101 0100101001101 poll send Rest API Slow Service Http API Rest Call Prometheus Grafana HTTP Call: Scrape Metrics Kafka Lag exporter HTTP Call: Scrape Metrics Calculate consumer lag Based on broker data
  • 23. Demo
  • 24. Consumer lag metrics difference, why? Problem: as a developer I don’t expect a big difference. The consumer • Reports lag only for the partition(s) it is actively consuming • High lag? Doesn't switch partitions that often • Only aware of the progress of the last offset as far as it's most recent metadata pull • consumes to that offset and thinks the lag is gone as it read up to that message. The producer • Still producing. • Actual offset grew in the meantime • Consumer is not aware of that yet More instances you start the more partition metrics will get reported
  • 25. Lessons Learned Take aways • Metric precision should not be something critical • Monitoring its trend • Important: know there is lag • Lag keeps on increasing? • The consumer has a problem! • Alert on increasing trend of consumer lag
  • 26. What is OpenTelemetry? + = Open Census Open Tracing OpenTelemetry What is Software Telemetry ? • Collection of data on the use, performance and behaviour of applications and their components
  • 27. OpenTelemetry Components API SDK`s Collector Protocol (OTLP) Per Programming Language Functionality to collect, process and export telemetry data Encoding Transport Delivery Prometheus Kafka Jaeger Application Telemetry Data Using Exporter Library Telemetry Data Using OTLP (gRPC or http/protobuf) Telemetry Data Backend specific format Observability and Storage Library Auto(Agent) Defines the interface for instrumenting code with traces, metrics and logs
  • 28. OpenTelemetry Collector Components • Optional component • No change in Application needed when switching Logging/Metrics/Tracing Backends Collector Receivers Processors Exporters
  • 30. Processors (How to handle received Data) Collector Receivers Exporters Memory Limiter Batch Processor Processors Filter Processor
  • 32. OpenTelemetry Demo: Components ‘stock-quotes-exchange- nyse’ ‘stock-quotes-exchange- nasdaq’ ‘stock-quotes-exchange- ams’ Producer Spring Kafka Consumer Spring Kafka poll send Shaky Downstream Service Http Rest Call Client REST Call Kafka broker Kafka Streams App poll Kafka Plain Consumer 0 1 2 3 4 5 6 7 8 0 1 2 0 1 0 1 2 3 send ‘stock-quotes’
  • 33. OpenTelemetry Demo Collector Plain Kafka Consumer Application Kafka Producer Application Kafka Kafka Streams Application Logs, Metrics, Traces Telemetry data (logs, metrics & traces) Kafka Consumer Application Shaky Downstream Service
  • 34. OpenTelemetry Demo Collector Plain Kafka Consumer Application Kafka Producer Application Prometheus Scrape Metrics Kafka Streams Application Time series database for Metrics Telemetry data (logs, metrics & traces) Shaky Downstream Service Kafka Consumer Application
  • 35. OpenTelemetry Demo Collector Plain Kafka Consumer Application Kafka Producer Application Grafana Loki Kafka Streams Application Logs Centralized logging Telemetry data (logs, metrics & traces) Grafana Kafka Consumer Application Shaky Downstream Service
  • 37. Demo
  • 38. Wrap up • Many different ways to observe your applications • Aim for vendor neutral solutions • Helps you migrate to different observability backend • Minimal changes to your applications • Micrometer • Overlap with OpenTelemetry • JVM only • No instrumentation for Kafka Streams (Traces) yet: • https://github.com/micrometer-metrics/micrometer/issues/3713 • Can send telemetry data using OpenTelemetry • ✅ Metrics • ✅ Traces (Spring Boot 3) • ❌ Logs
  • 39. Wrap up • Consumer (lag) metrics • Client side vs broker side metrics • Monitor the trend! • Select and ship only the metrics you need
  • 40. Wrap up • OpenTelemetry • Language agnostic • Specification for Logs not stable • Start small • Java Agent will give you a kickstart • Minimum dev effort using auto instrumentation • Sample traces. You most likely don’t want to store 100% of all traces • Limitations • Stateful stream processing: KAFKA-7718