#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo

How to improve database
observability?
@Charles_JUDITH
Paris Open Source Summit 2019

About me
● Senior Site Reliability Engineer at Criteo
● Working on monitoring topics since few years
● Currently providing the (open source) database service
at Criteo
● @Charles_JUDITH on Twitter

Agenda
1. Context
2. First iteration
3. Second iteration
4. Next steps
5. Resources

Goal
● Alerting
● No hidden issues
● An observable platform!
● The DBA team shouldn’t be a “blocker” for the users!

OBSERVABILITY IS A MEASURE OF HOW WELL
INTERNAL STATES OF A SYSTEM CAN BE
INFERRED FROM KNOWLEDGE OF ITS EXTERNAL
OUTPUTS. »
SOURCE: WIKIPEDIA

My opinion about observability
● It’s not only about the tools
● It’s not a fancy name to say “monitoring”
● It’s more about “transparency”

Why a system needs to be
observable?

Why a system needs to be observable?
● Is it working as expected by the users?
● To answer basic questions about your service/platform
● Increase the visibility for you and your users/customers
● Long term tends analysis
● “If can’t measure it, you can’t manage it”

Observability is fundamental for reliability
Analogy to the Maslow’s hierarchy of needs

The observability eﬀects
● Giving superpowers
● It’s like a roller coaster
● You need to be patient

USE method
● USE was introduced by @brendangregg
● Utilization: disk,CPU usage …
● Saturation: disk I/O
● Errors: network interface errors

The four golden signals
● Introduced in the Google SRE book
● Latency: response time, queue/wait time
● Trafﬁc: A measure of how much demand is being placed on the service
● Errors: The rate of requests that fail
● Saturation: How “full” is the service

RED method
● RED was introduced by @tom_wilkie
● (Request) Rate - the number of requests, per second, you services are serving.
● (Request) Errors - the number of failed requests per second.
● (Request) Duration - distributions of the amount of time each request takes.
● Subset of “The Four Golden Signals”

The seven golden signals
● CELT + USE introduced by @xaprb
● Concurrency: number of simultaneous requests
● Error rate
● Latency: response time
● Throughput: query per seconds (QPS)

CASE method
● CASE was introduced by @gphat
● Context-heavy
● Actionnable
● Symptom-based
● Evaluated

Preferred approach
● The seven golden signals
● Good to measure the service quality
● System and application metrics are valuable in our case

How to collect the metrics?
● Collectd
● Node exporter
● MySQLD exporter
● Python MySQL plugin for CollectD
● Few others

What to do with all these metrics?
● Pick some useful “indicators” like:
○ thread usage
○ service status
○ backup status, duration, size
○ replication lag

How to show/use those
metrics?

Disk partition full with
tmp_table

Database cleaning and
optimize table

DATABASES EXPOSE LOTS OF METRICS ABOUT
THEIR STATUS, BUT MUCH LESS ABOUT THE
DETAILS OF THEIR WORKLOAD.

“WE THINK OUR DATABASE IS SLOW?”
“Last week week we noticed that
the database was slow.”

#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo

Logs
● Logs all the SQL queries (general log)
● Install an agent to ship those logs with “custom ﬁelds”
● Make the logs available for our users

Logs
● Logs all the SQL queries (general log)
● Install an agent to ship those logs with “custom ﬁelds”
● Conﬁgure MySQL/MariaDB to log the slow queries
● Use Rsyslog with a custom template!
● Make the logs available for our users

Conclusions
● The DBA is not a blocker for the developers
● The visibility and transparency on the database service
● Happy customers/developers/users
● Effective monitoring
● Shipping slow queries is not easy
● In that case metrics and logs is a good combo but we want more!

Next steps
● Continue to improve the SQL logging
● Leverage the usage of sys_schema
● Metrics per database
● Publish the SLA
● Open source our probe for MySQL/MariaDB

Resources
https://github.com/CharlesJUDITH/database-observability-toolkit

#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo

More Related Content

What's hot (20)

Similar to #OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo (20)

More from Paris Open Source Summit (20)

Recently uploaded (20)

#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo