Beginner's Guide to Observability@Devoxx PL 2024

BEGINNER'S GUIDE TO
BEGINNER'S GUIDE TO
BEGINNER'S GUIDE TO
BEGINNER'S GUIDE TO
BEGINNER'S GUIDE TO
BEGINNER'S GUIDE TO
BEGINNER'S GUIDE TO
BEGINNER'S GUIDE TO
BEGINNER'S GUIDE TO
BEGINNER'S GUIDE TO
BEGINNER'S GUIDE TO
BEGINNER'S GUIDE TO
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
Michał Niczyporuk
@mihn

WHAT IS
WHAT IS
WHAT IS
WHAT IS
WHAT IS
WHAT IS
WHAT IS
WHAT IS
WHAT IS
WHAT IS
WHAT IS
WHAT IS
OBSERVABILITY?
OBSERVABILITY?
OBSERVABILITY?
OBSERVABILITY?
OBSERVABILITY?
OBSERVABILITY?
OBSERVABILITY?
OBSERVABILITY?
OBSERVABILITY?
OBSERVABILITY?
OBSERVABILITY?
OBSERVABILITY?

OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
OBSERVABILITY
amercanized: Observability -> o11y

THE DEFINITION
“In control theory, observability is a
measure of how well internal states of a
system can be inferred from knowledge of
its external outputs”
Wikipedia

WHAT FOR?
Way of determining state of the system
Observe trends
Spot anomalies
Debug errors
Gather data to support decision process
Measure user experience

LOGGING
LOGGING
LOGGING
LOGGING
LOGGING
LOGGING
LOGGING
LOGGING
LOGGING
LOGGING
LOGGING
LOGGING

LOGGING PRIME DIRECTIVE
Don't use System.out.println()

DEFINE LOGGER
public class UserService {
private static final org.slf4j.Logger LOGGER =
org.slf4j.LoggerFactory.getLogger(UserService.class);
(...)
}

USE LOGGER
private int myMethod(int x) {
LOGGER.info("running my great code with x={}", x);
int y = methodX(x);
LOGGER.info("my great code finished with y={}", y);
return y;
}
2023-03-21 19:24:43,829 [main] INFO UserService - running my gre
2023-03-21 19:24:43,829 [main] INFO UserService - my great code

"TALKING TO A VOID" DEBUGGING
private void thisCallsMethodX() {
LOGGER.info("PLEASE WORK");
methodY();
LOGGER.info("SHOULD HAVE WORKED");
}

DOS
Log meaningful checkpoints
Use correct logging levels
Add ids to logging context:
e.g. correlation id, user id
High-throughput = async appender

DO NOTS🍩
Logs can be lost
Logs are not audit trail
Logs are not metrics
Logs are not data warehouse/lake
Alerts based on logs are flimsy - metrics-based alerts
are better

WHAT NOT TO LOG
personal information, secrets, session tokens etc.
(thanks )
OWASP Logging Cheat sheet
@piotrprz

CENTRALIZED LOGS
CENTRALIZED LOGS
CENTRALIZED LOGS
CENTRALIZED LOGS
CENTRALIZED LOGS
CENTRALIZED LOGS
CENTRALIZED LOGS
CENTRALIZED LOGS
CENTRALIZED LOGS
CENTRALIZED LOGS
CENTRALIZED LOGS
CENTRALIZED LOGS

CENTRAL STORAGE
Software that can ingest, process and search the logs

LOGS FORMAT
Unified format (e.g. JSON, logfmt, OpenTelemetry
Protocol OTPL, vendor specific)
regexes = two problems

TOOLING
Cloud providers:
AWS: Cloudwatch Logs
Azure: Monitor Logs
GCP: Cloud Logging
Elastic Stack = ElasticSearch + Logstash/Filebeat +
Kibana
Graylog
Grafana Loki

METRICS
METRICS
METRICS
METRICS
METRICS
METRICS
METRICS
METRICS
METRICS
METRICS
METRICS
METRICS

METRICS
numbers
indexed over time
and multiple additional dimensions
=== timeseries

FORMAT
Storage specific
Vendor neutral: OpenTelemetry

COUNTER
Used for: requests count, jobs count, errors

GAUGE
Used for memory usage, thread/connection pools
count, etc.
Tip: Store actual current and maximum value, not the
percentages

HISTOGRAM
used for request latencies and sizes
can calculate any percentiles via query
tip: preconfigured bucket sizes = performance better

SUMMARY
Histogram - but with percentiles (e.g. p50, p75, p95)
Calculated client side - lighter then histograms
Caveat: calculated per scrape target

METRIC RESOLUTION
Main dimension
How often metrics are probed
Long-term storage = Down sampling/roll up
1min up to 7 days -> 10 min upto 30 days, etc.
YMMV - think about business cycles (e.g. Black Friday)

METRIC TAGS/LABELS
Additional dimensions
Metadata: HTTP attributes, service name, hosts, cloud
regions, etc.
Metrics labels don't like high cardinality
route="/users/:id" ✅
route="/users/2137" ❌

WHAT TO MEASURE?
Kirk Pepperdine's The Box model
Your framework/libraries cover some basics already

KIRK PEPPERDINE'S THE BOX
"People": incoming requests, messages, jobs...
Application (Netty, Tomcat, Spring, etc.): thread pools,
queue sizes, requests...
Runtime metrics (JVM): GC, memory usage, threads...
Hardware: CPU, memory, I/O, disk usage

HOW TO MEASURE?
Latency
Traffic
Errors
Saturation
The Four Golden Signals:

MEASUREMENT METHODS
RED vs USE

RED
Rate
Error
Duration
Application focused

USE
Utilization
Saturation
Errors
Infrastructure focused

ONE METRICS TO RULE THEM ALL
APDEX
https://www.apdex.org/
More user experience focused - mesures satisfaction
Single number

APDEX - SPLITTING THE POPULUS
Minimal sample size 100 - adjust time window
Measure as close to the user as possible

APDEX - CALCULATIONS
Result is a decimal between 0 and 1

APDEX - INTERPRETING THE RESULT
0.94 ≤ X ≤ 1 : excellent
0.85 ≤ X ≤ 0.93 : good
0.70 ≤ X ≤ 0.84 : fair
0.50 ≤ X ≤ 0.69 : poor
0.00 ≤ X ≤ 0.49 : unacceptable

APDEX - CAVEATS
Very generic metric - hides details
Should be monitored closely after deployments
Should measure one functionality
Shouldn't be only metric of application success

PERCENTILES
Latency distribution is not normal/Gaussian
Use high percentiles (p99+) - not averages or means
(p50)

COORDINATED OMISSION
1. Server mesures in wrong place
2. Requests can be stuck waiting for processing
Measure both clients and servers - if possible
"How not to measure latency" by Gil Tene

DEATH BY METRICS
Storing unnecessary amount of metrics
Made possible by automations
Bad for infrastructure and cloud bills
Bad for your mental health

ASSUMPTIONS ABOUT METRICS
WRONG
1. Metrics can't be lost
2. Metrics are precise

Beginner's Guide to Observability@Devoxx PL 2024

TOOLS
OSS:
+
Libraries:
Cloud providers:
AWS: Cloudwatch Metrics
Azure: Monitor Metrics
GCP: Cloud Metrics
Prometheus Grafana
Thanos
Victoria Metrics
Elastic Stack
Micrometer - API abstraction
OpenTelemetry

DISTRIBUTED TRACING
DISTRIBUTED TRACING
DISTRIBUTED TRACING
DISTRIBUTED TRACING
DISTRIBUTED TRACING
DISTRIBUTED TRACING
DISTRIBUTED TRACING
DISTRIBUTED TRACING
DISTRIBUTED TRACING
DISTRIBUTED TRACING
DISTRIBUTED TRACING
DISTRIBUTED TRACING

TRACING
A
B
C
traceID = X
span ID = A
HTTP Request
processed
traceID = X
span ID = B
parent span ID =A
HTTP Request
processed
traceID = X
span ID = C
parent span ID =B
HTTP Request
processed
traceID = X
span ID = D
parent span ID =C
db query executed
traceID = X
span ID = E
parent span ID =A
HTTP Request
processed
traceID = X
span ID = F
parent span ID =E
HTTP Request
processed
traceID = X
span ID = G
parent span ID =F
redis query
executed
Internet

Single trace = multiple spans
Each span contains
Trace ID
Span ID
timestamp and duration
Parent span ID, if applicable
All the metadata with any
cardinality

TRACING IDS FORMAT
Propagated in headers (e.g.HTTP) or in
metadata/attributes (e.g. Kafka)
Newer:
Older:
W3C Trace Context
Zipkin's B3

TRACE PROPAGATION
Trace ID/Span ID propagation
exporting spans to storage

TRACE PROPAGATION IN SERVICES
your stack might be already doing that for you

TRACE PROPAGATION IN INFRASTRUCTURE

HEAD SAMPLING VS TAIL SAMPLING

HEAD SAMPLING
Decision made without looking at whole trace
Random sample of traffic - law of large numbers makes
its viable

TAIL SAMPLING
Decision made after considering all or most spans from
a trace
Allows for custom behaviour

THROUGHPUT-BASED TAIL SAMPLING
samples up to defined number of spans per second

RULES-BASED TAIL SAMPLING
Storing traces matching custom policies
custom policies: spans with errors, slow traces, ignore
healthchecks/metric endpoints/websockets, etc.

DYNAMIC TAIL SAMPLING
Aims to have a representation of span attribute values
among collected traces in a timeframe
example: store 1 in 100 traces, but with representation
of values in attributes http.status_code,
http.method, http.route and service.name

MIXING SAMPLING METHODS
Sampling methods can and should be mixed

MISSING INSTRUMENTATIONS
investigation needed
add missing instrumentation
wrap with custom span

MISSING SPANS
tools are smart enough to detect that
buffers for spans are too small
...or you hit ratelimiter somewhere

LAST TIP
Put trace and span ID's in logs

TOOLS
Storage and visualization:
Abstraction API:
Cloud providers:
AWS: X-Ray
Azure: Azure Monitor Logs
GCP: Cloud Trace
Zipkin
Jaeger
OpenTelemetry
Micrometer Tracing

WHAT IS OPENTELEMETRY
vendor neutral observability framework

EVENTS
aka "true" observability
aka Observability 2.0

WHAT ARE EVENTS?
Events == spans

WHAT'S DIFFERENT?
Events are the fuel
Columnar storage and query engine are the
superpower
They give ability to slice and dice data and discover
unknown unknowns

UNKNOWN UNKNOWNS
Things we don't know we don't know
Example: Migrating to OpenTelemetry and from logs to
traces

INSPIRATION
Scuba: Diving into Data at Facebook

How to Avoid Paying for Honeycomb

THE NEW KID
Very comprehensive observability package
Distributed tracing module needs ClickHouse

DEBUGGING WITH HONEYCOMB
Based on Sandbox on their website

BUBBLEUP - BASELINE VS SELECTION

THINGS TO REMEMBER
Observability = iterative & continuous process

WHAT AFTER?
SLI and SLO
Alerting and on-call

THANK YOU!
THANK YOU!
THANK YOU!
THANK YOU!
THANK YOU!
THANK YOU!
THANK YOU!
THANK YOU!
THANK YOU!
THANK YOU!
THANK YOU!
THANK YOU!
QUESTIONS?

Beginner's Guide to Observability@Devoxx PL 2024

More Related Content

Similar to Beginner's Guide to Observability@Devoxx PL 2024 (20)

Recently uploaded (20)

Beginner's Guide to Observability@Devoxx PL 2024