From Cardinal(ity) Sins to Cost-Efficient Metrics Aggregation

chronosphere.io
From Cardinal(ity) Sins to
Cost Efﬁcient Metrics
Aggregation
Paige Cruz, retired SRE
open source observability advocate

chronosphere.io
CFO looking at the
o11y bill

Cloud Native Observability
bills are outrageous

Cloud Native Data Growth
7
Cloud
(IaaS,
VM-based)
2008 - 2018
Cloud Native
(Microservices and Containers)
2018 - ?
On-Premises
(Data center)
1998 - 2008
Business
Increase in
Scale
Observability
Data Increase
in Scale
*Source: ESG Distributed Cloud Series: Observability, Feb 2022, Scott Sinclair and Rob Strechay

chronosphere.io
Most recently [vendor] was looked at to help monitor a small Kubernetes test cluster. 3
nodes.
Now the base rate of $18/mo is ﬁne…except now they charge $1 per container per month
past 10 containers per host.
Since K8s (depending on how you install it) runs a bunch of little containers handling various
back end things, you might not deploy anything to the cluster and still be WAY over that 10
container limit.
In our case it came out to like $200/mo to monitor 3 nodes - that were nowhere fully
loaded.
- Hacker News thread

chronosphere.io
Data volume
Experiment:
- Hello World app on 4 node
Kubernetes cluster with
Tracing, End User Metrics
(EUM), Logs, Metrics
(containers / nodes)
- 30 days == +450 GB

“ 1 in 10 metrics are
actually directly
queried
- ServiceNow

Contributing Factors to the Metrics Bill
12
How many
things you’re
monitoring
# of containers
and infra
components
How often each
metric is
scraped
Metric
Granularity
How long you
keep the data
Retention
Window
How many
unique combos
of dimensions
on metrics
Cardinality
12

14
Cost of monitoring can be a factor in determining how quickly
to deprecate or sunset features/services/environments
# of containers and infra components
14

15
Emission time = adjust scrape_interval (from 10s samples ->
30s samples)
Ingest time = aggregate
Over time post-storage = downsampling
Metric Granularity
15

17
For operational metrics……most (99.9%) of queries do
not pass 7 days but average retention at original
granularity ranges from 2-4 weeks
Retention Window
17

18
Low value tags or entire metrics should be dropped
as early as possible
Dropping Data
18

“
What is the value
of this metric?
- You, when auditing metrics

Auditing Your Metrics
22
22
● Scope what your team is responsible for
○ filter queries with team:YOURS
● Identify easy wins. Metrics that aren’t
○ In a monitor definition
○ Directly queried by end users
○ Powering charts for visited dashboards
● Identify labels that are unnecessary
○ e.g. prometheus instance label or instance_type
● Share your successes!

24
24
CFO looking at the
cost efﬁciency of
metrics

Resources
- How Gloo uses the OTel Collector to drop metrics/labels
and provide the Minimum Metrics Set
- How to drop and delete metrics in Prometheus
- How can recording and data roll-up rules help your
metrics?
- Observability is Too Damn Expensive - DevOpsDays London

Catch up with me:
- Rescuing On-Call Engineers
(send your manager)
- KubeCon OTel 101: Let’s
Instrument! (tracing) workshop
- There’s No Place Like
Production Conf42 Incident
Management
paigerduty@
chronosphere.io
hachyderm.io
LinkedIn

From Cardinal(ity) Sins to Cost-Efficient Metrics Aggregation

More Related Content

Similar to From Cardinal(ity) Sins to Cost-Efficient Metrics Aggregation (20)

More from Paige Cruz (20)

Recently uploaded (20)

From Cardinal(ity) Sins to Cost-Efficient Metrics Aggregation