BEGINNER'S GUIDE TO
BEGINNER'S GUIDE TO
BEGINNER'S GUIDE TO
BEGINNER'S GUIDE TO
BEGINNER'S GUIDE TO
BEGINNER'S GUIDE TO
BEGINNER'S GUIDE TO
BEGINNER'S GUIDE TO
BEGINNER'S GUIDE TO
BEGINNER'S GUIDE TO
BEGINNER'S GUIDE TO
BEGINNER'S GUIDE TO
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
Michał Niczyporuk
@mihn
ABOUT ME
WHAT IS
WHAT IS
WHAT IS
WHAT IS
WHAT IS
WHAT IS
WHAT IS
WHAT IS
WHAT IS
WHAT IS
WHAT IS
WHAT IS
OBSERVABILITY?
OBSERVABILITY?
OBSERVABILITY?
OBSERVABILITY?
OBSERVABILITY?
OBSERVABILITY?
OBSERVABILITY?
OBSERVABILITY?
OBSERVABILITY?
OBSERVABILITY?
OBSERVABILITY?
OBSERVABILITY?
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
amercanized: Observability -> o11y
THE DEFINITION
“In control theory, observability is a
measure of how well internal states of a
system can be inferred from knowledge of
its external outputs”
Wikipedia
LET'S ASK THE ORACLE
WHAT FOR?
Way of determining state of the system
Observe trends
Spot anomalies
Debug errors
Gather data to support decision process
Measure user experience
LOGGING
LOGGING
LOGGING
LOGGING
LOGGING
LOGGING
LOGGING
LOGGING
LOGGING
LOGGING
LOGGING
LOGGING
LOGGING PRIME DIRECTIVE
Don't use System.out.println()
LOGGING LIBRARY ECOSYSTEM
DEFINE LOGGER
public class UserService {
private static final org.slf4j.Logger LOGGER =
org.slf4j.LoggerFactory.getLogger(UserService.class);
(...)
}
USE LOGGER
private int myMethod(int x) {
LOGGER.info("running my great code with x={}", x);
int y = methodX(x);
LOGGER.info("my great code finished with y={}", y);
return y;
}
2023-03-21 19:24:43,829 [main] INFO UserService - running my gre
2023-03-21 19:24:43,829 [main] INFO UserService - my great code
"TALKING TO A VOID" DEBUGGING
private void thisCallsMethodX() {
LOGGER.info("PLEASE WORK");
methodY();
LOGGER.info("SHOULD HAVE WORKED");
}
DOS
Log meaningful checkpoints
Use correct logging levels
Add ids to logging context:
e.g. correlation id, user id
High-throughput = async appender
DO NOTS🍩
Logs can be lost
Logs are not audit trail
Logs are not metrics
Logs are not data warehouse/lake
Alerts based on logs are flimsy - metrics-based alerts
are better
WHAT NOT TO LOG
personal information, secrets, session tokens etc.
(thanks )
OWASP Logging Cheat sheet
@piotrprz
CENTRALIZED LOGS
CENTRALIZED LOGS
CENTRALIZED LOGS
CENTRALIZED LOGS
CENTRALIZED LOGS
CENTRALIZED LOGS
CENTRALIZED LOGS
CENTRALIZED LOGS
CENTRALIZED LOGS
CENTRALIZED LOGS
CENTRALIZED LOGS
CENTRALIZED LOGS
CENTRAL STORAGE
Software that can ingest, process and search the logs
PUBLISHING LOGS
LOGS FORMAT
Unified format (e.g. JSON, logfmt, OpenTelemetry
Protocol OTPL, vendor specific)
regexes = two problems
GROWTH RULE OF THUMB
TOOLING
Cloud providers:
AWS: Cloudwatch Logs
Azure: Monitor Logs
GCP: Cloud Logging
Elastic Stack = ElasticSearch + Logstash/Filebeat +
Kibana
Graylog
Grafana Loki
METRICS
METRICS
METRICS
METRICS
METRICS
METRICS
METRICS
METRICS
METRICS
METRICS
METRICS
METRICS
METRICS
numbers
indexed over time
and multiple additional dimensions
=== timeseries
STORAGE
timeseries database
PUBLISHING METRICS
FORMAT
Storage specific
Vendor neutral: OpenTelemetry
METRICS TYPES
Prometheus
COUNTER
Used for: requests count, jobs count, errors
GAUGE
Used for memory usage, thread/connection pools
count, etc.
Tip: Store actual current and maximum value, not the
percentages
HISTOGRAM
used for request latencies and sizes
can calculate any percentiles via query
tip: preconfigured bucket sizes = performance better
SUMMARY
Histogram - but with percentiles (e.g. p50, p75, p95)
Calculated client side - lighter then histograms
Caveat: calculated per scrape target
METRIC ATTRIBUTES
METRIC RESOLUTION
Main dimension
How often metrics are probed
Long-term storage = Down sampling/roll up
1min up to 7 days -> 10 min upto 30 days, etc.
YMMV - think about business cycles (e.g. Black Friday)
METRIC TAGS/LABELS
Additional dimensions
Metadata: HTTP attributes, service name, hosts, cloud
regions, etc.
Metrics labels don't like high cardinality
route="/users/:id" ✅
route="/users/2137" ❌
WHAT TO MEASURE?
Kirk Pepperdine's The Box model
Your framework/libraries cover some basics already
KIRK PEPPERDINE'S THE BOX
"People": incoming requests, messages, jobs...
Application (Netty, Tomcat, Spring, etc.): thread pools,
queue sizes, requests...
Runtime metrics (JVM): GC, memory usage, threads...
Hardware: CPU, memory, I/O, disk usage
HOW TO MEASURE?
HOW TO MEASURE?
Latency
Traffic
Errors
Saturation
The Four Golden Signals:
MEASUREMENT METHODS
RED vs USE
RED
Rate
Error
Duration
Application focused
USE
Utilization
Saturation
Errors
Infrastructure focused
ONE METRICS TO RULE THEM ALL
APDEX
https://www.apdex.org/
More user experience focused - mesures satisfaction
Single number
APDEX - SPLITTING THE POPULUS
Minimal sample size 100 - adjust time window
Measure as close to the user as possible
APDEX - CALCULATIONS
Result is a decimal between 0 and 1
APDEX - INTERPRETING THE RESULT
0.94 ≤ X ≤ 1 : excellent
0.85 ≤ X ≤ 0.93 : good
0.70 ≤ X ≤ 0.84 : fair
0.50 ≤ X ≤ 0.69 : poor
0.00 ≤ X ≤ 0.49 : unacceptable
APDEX - CAVEATS
Very generic metric - hides details
Should be monitored closely after deployments
Should measure one functionality
Shouldn't be only metric of application success
PERCENTILES
Latency distribution is not normal/Gaussian
Use high percentiles (p99+) - not averages or means
(p50)
DIFFERENCE OF OPINION?
COORDINATED OMISSION
1. Server mesures in wrong place
2. Requests can be stuck waiting for processing
Measure both clients and servers - if possible
"How not to measure latency" by Gil Tene
DEATH BY METRICS
Storing unnecessary amount of metrics
Made possible by automations
Bad for infrastructure and cloud bills
Bad for your mental health
ASSUMPTIONS ABOUT METRICS
WRONG
1. Metrics can't be lost
2. Metrics are precise
GROWTH RULE OF THUMB
Beginner's  Guide to Observability@Devoxx PL 2024
EXAMPLE - "MISSING" INDEX
TOOLS
OSS:
+
Libraries:
Cloud providers:
AWS: Cloudwatch Metrics
Azure: Monitor Metrics
GCP: Cloud Metrics
Prometheus Grafana
Thanos
Victoria Metrics
Elastic Stack
Micrometer - API abstraction
OpenTelemetry
DISTRIBUTED TRACING
DISTRIBUTED TRACING
DISTRIBUTED TRACING
DISTRIBUTED TRACING
DISTRIBUTED TRACING
DISTRIBUTED TRACING
DISTRIBUTED TRACING
DISTRIBUTED TRACING
DISTRIBUTED TRACING
DISTRIBUTED TRACING
DISTRIBUTED TRACING
DISTRIBUTED TRACING
TRACING
A
B
C
traceID = X
span ID = A
HTTP Request
processed
traceID = X
span ID = B
parent span ID =A
HTTP Request
processed
traceID = X
span ID = C
parent span ID =B
HTTP Request
processed
traceID = X
span ID = D
parent span ID =C
db query executed
traceID = X
span ID = E
parent span ID =A
HTTP Request
processed
traceID = X
span ID = F
parent span ID =E
HTTP Request
processed
traceID = X
span ID = G
parent span ID =F
redis query
executed
Internet
ANATOMY OF A TRACE
Single trace = multiple spans
Each span contains
Trace ID
Span ID
timestamp and duration
Parent span ID, if applicable
All the metadata with any
cardinality
STORAGE
PUBLISHING TRACES
TRACING IDS FORMAT
Propagated in headers (e.g.HTTP) or in
metadata/attributes (e.g. Kafka)
Newer:
Older:
W3C Trace Context
Zipkin's B3
TRACE PROPAGATION
Trace ID/Span ID propagation
exporting spans to storage
TRACE PROPAGATION IN SERVICES
your stack might be already doing that for you
TRACE PROPAGATION IN INFRASTRUCTURE
STORING TRACES
HEAD SAMPLING VS TAIL SAMPLING
HEAD SAMPLING
Decision made without looking at whole trace
Random sample of traffic - law of large numbers makes
its viable
TAIL SAMPLING
Decision made after considering all or most spans from
a trace
Allows for custom behaviour
TAIL SAMPLING
THROUGHPUT-BASED TAIL SAMPLING
samples up to defined number of spans per second
RULES-BASED TAIL SAMPLING
Storing traces matching custom policies
custom policies: spans with errors, slow traces, ignore
healthchecks/metric endpoints/websockets, etc.
DYNAMIC TAIL SAMPLING
Aims to have a representation of span attribute values
among collected traces in a timeframe
example: store 1 in 100 traces, but with representation
of values in attributes http.status_code,
http.method, http.route and service.name
MIXING SAMPLING METHODS
Sampling methods can and should be mixed
WHERE SAMPLING HAPPENS?
TYPICAL PROBLEMS IN TRACING
MISSING INSTRUMENTATIONS
investigation needed
add missing instrumentation
wrap with custom span
MISSING SPANS
tools are smart enough to detect that
buffers for spans are too small
...or you hit ratelimiter somewhere
GROWTH RULE OF THUMB
Beginner's  Guide to Observability@Devoxx PL 2024
EXAMPLE
LAST TIP
Put trace and span ID's in logs
TOOLS
Storage and visualization:
Abstraction API:
Cloud providers:
AWS: X-Ray
Azure: Azure Monitor Logs
GCP: Cloud Trace
Zipkin
Jaeger
OpenTelemetry
Micrometer Tracing
Beginner's  Guide to Observability@Devoxx PL 2024
Beginner's  Guide to Observability@Devoxx PL 2024
Beginner's  Guide to Observability@Devoxx PL 2024
WHAT IS OPENTELEMETRY
vendor neutral observability framework
EVENTS
aka "true" observability
aka Observability 2.0
WHAT ARE EVENTS?
Events == spans
WHAT'S DIFFERENT?
Events are the fuel
Columnar storage and query engine are the
superpower
They give ability to slice and dice data and discover
unknown unknowns
UNKNOWN UNKNOWNS
Things we don't know we don't know
Example: Migrating to OpenTelemetry and from logs to
traces
INSPIRATION
Scuba: Diving into Data at Facebook
THE O.G
How to Avoid Paying for Honeycomb
HONEYCOMB UI
THE NEW KID
Very comprehensive observability package
Distributed tracing module needs ClickHouse
COROOT DISTRIBUTED TRACING UI
DEBUGGING WITH HONEYCOMB
Based on Sandbox on their website
HEATMAP
SELECTION
BUBBLEUP
BUBBLEUP - BASELINE VS SELECTION
BUBBLEUP - ENHANCE!
BUBBLEUP - POSSIBILITIES
BUBBLEUP - CHOSEN ONE
BUBBLEUP - PARAMETER VALUE
BUBBLEUP - FILTERED HEATMAP
BUBBLEUP - SELECTING TRACE
CULPRIT TRACE
CULPRIT TRACE - ZOOM
CULPRIT TRACE - ZOOM
CULPRIT TRACE - SIMILAR SPANS
CULPRIT TRACE - SIMILAR SPANS
CULPRIT TRACE - DEPLOYMENT
CULPRIT TRACE - DEPLOYMENT
THINGS TO REMEMBER
Observability = iterative & continuous process
WHAT AFTER?
SLI and SLO
Alerting and on-call
THANK YOU!
THANK YOU!
THANK YOU!
THANK YOU!
THANK YOU!
THANK YOU!
THANK YOU!
THANK YOU!
THANK YOU!
THANK YOU!
THANK YOU!
THANK YOU!
QUESTIONS?

More Related Content

PDF
Observability cookbook JUGtoberfest Poznan 2024-10-16
PPTX
Observability for Application Developers (1)-1.pptx
PDF
Observability: Beyond the Three Pillars with Spring
PPTX
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
PDF
Observability, Distributed Tracing, and Open Source: The Missing Primer
PPTX
PPTX
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
PDF
Go Observability (in practice)
Observability cookbook JUGtoberfest Poznan 2024-10-16
Observability for Application Developers (1)-1.pptx
Observability: Beyond the Three Pillars with Spring
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
Observability, Distributed Tracing, and Open Source: The Missing Primer
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
Go Observability (in practice)

Similar to Beginner's Guide to Observability@Devoxx PL 2024 (20)

PDF
Monitoring and Observability: Building Products That Don't Break in Silence
PDF
Metrics driven development with dedicated Observability Team
PDF
Manage Microservices Chaos and Complexity with Observability
PDF
Short Data Rules for Observability.pdf
PDF
Demystifying observability
PPTX
ThroughTheLookingGlass_EffectiveObservability.pptx
PDF
"Distributed Tracing: New DevOps Foundation" by Jayesh Ahire
PDF
Observability with Spring-based distributed systems
PDF
The present and future of serverless observability (QCon London)
PDF
The present and future of Serverless observability
PDF
The present and future of Serverless observability
PPTX
The Incremental Path to Observability
PDF
Final observability starts_with_data
PDF
Observe 2020-d mc
PPTX
ADDO Open Source Observability Tools
PDF
Observability foundations in dynamically evolving architectures
PDF
A Practical Deep Dive into Observability of Streaming Applications with Kosta...
PDF
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
PDF
The Present and Future of Serverless Observability
PDF
Monitoring and observability
Monitoring and Observability: Building Products That Don't Break in Silence
Metrics driven development with dedicated Observability Team
Manage Microservices Chaos and Complexity with Observability
Short Data Rules for Observability.pdf
Demystifying observability
ThroughTheLookingGlass_EffectiveObservability.pptx
"Distributed Tracing: New DevOps Foundation" by Jayesh Ahire
Observability with Spring-based distributed systems
The present and future of serverless observability (QCon London)
The present and future of Serverless observability
The present and future of Serverless observability
The Incremental Path to Observability
Final observability starts_with_data
Observe 2020-d mc
ADDO Open Source Observability Tools
Observability foundations in dynamically evolving architectures
A Practical Deep Dive into Observability of Streaming Applications with Kosta...
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
The Present and Future of Serverless Observability
Monitoring and observability
Ad

Recently uploaded (20)

PDF
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
PPTX
Introduction to Windows Operating System
PDF
Types of Token_ From Utility to Security.pdf
PDF
Wondershare Recoverit Full Crack New Version (Latest 2025)
PDF
Designing Intelligence for the Shop Floor.pdf
PPTX
GSA Content Generator Crack (2025 Latest)
PDF
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
PDF
How Tridens DevSecOps Ensures Compliance, Security, and Agility
PPTX
Oracle Fusion HCM Cloud Demo for Beginners
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
PPTX
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
PDF
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
PDF
Time Tracking Features That Teams and Organizations Actually Need
PPTX
Computer Software - Technology and Livelihood Education
PDF
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
PPTX
Weekly report ppt - harsh dattuprasad patel.pptx
PDF
MCP Security Tutorial - Beginner to Advanced
PDF
DNT Brochure 2025 – ISV Solutions @ D365
PDF
Microsoft Office 365 Crack Download Free
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
Introduction to Windows Operating System
Types of Token_ From Utility to Security.pdf
Wondershare Recoverit Full Crack New Version (Latest 2025)
Designing Intelligence for the Shop Floor.pdf
GSA Content Generator Crack (2025 Latest)
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
How Tridens DevSecOps Ensures Compliance, Security, and Agility
Oracle Fusion HCM Cloud Demo for Beginners
Computer Software and OS of computer science of grade 11.pptx
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
Time Tracking Features That Teams and Organizations Actually Need
Computer Software - Technology and Livelihood Education
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
Weekly report ppt - harsh dattuprasad patel.pptx
MCP Security Tutorial - Beginner to Advanced
DNT Brochure 2025 – ISV Solutions @ D365
Microsoft Office 365 Crack Download Free
Ad

Beginner's Guide to Observability@Devoxx PL 2024