SlideShare a Scribd company logo
How to improve database
observability?
@Charles_JUDITH
Paris Open Source Summit 2019
About me
● Senior Site Reliability Engineer at Criteo
● Working on monitoring topics since few years
● Currently providing the (open source) database service
at Criteo
● @Charles_JUDITH on Twitter
Agenda
1. Context
2. First iteration
3. Second iteration
4. Next steps
5. Resources
Context
Goal
● Alerting
● No hidden issues
● An observable platform!
● The DBA team shouldn’t be a “blocker” for the users!
What is observability?
OBSERVABILITY IS A MEASURE OF HOW WELL
INTERNAL STATES OF A SYSTEM CAN BE
INFERRED FROM KNOWLEDGE OF ITS EXTERNAL
OUTPUTS. »
SOURCE: WIKIPEDIA
My opinion about observability
● It’s not only about the tools
● It’s not a fancy name to say “monitoring”
● It’s more about “transparency”
Why a system needs to be
observable?
Why a system needs to be observable?
● Is it working as expected by the users?
● To answer basic questions about your service/platform
● Increase the visibility for you and your users/customers
● Long term tends analysis
● “If can’t measure it, you can’t manage it”
Observability is fundamental for reliability
Analogy to the Maslow’s hierarchy of needs
The observability effects
The observability effects
● Giving superpowers
● It’s like a roller coaster
● You need to be patient
Let’s go!
Metrics
How to start?
USE method
● USE was introduced by @brendangregg
● Utilization: disk,CPU usage …
● Saturation: disk I/O
● Errors: network interface errors
The four golden signals
● Introduced in the Google SRE book
● Latency: response time, queue/wait time
● Traffic: A measure of how much demand is being placed on the service
● Errors: The rate of requests that fail
● Saturation: How “full” is the service
RED method
● RED was introduced by @tom_wilkie
● (Request) Rate - the number of requests, per second, you services are serving.
● (Request) Errors - the number of failed requests per second.
● (Request) Duration - distributions of the amount of time each request takes.
● Subset of “The Four Golden Signals”
The seven golden signals
● CELT + USE introduced by @xaprb
● Concurrency: number of simultaneous requests
● Error rate
● Latency: response time
● Throughput: query per seconds (QPS)
CASE method
● CASE was introduced by @gphat
● Context-heavy
● Actionnable
● Symptom-based
● Evaluated
Preferred approach
● The seven golden signals
● Good to measure the service quality
● System and application metrics are valuable in our case
How to collect the metrics?
● Collectd
● Node exporter
● MySQLD exporter
● Python MySQL plugin for CollectD
● Few others
What to do with all these metrics?
● Pick some useful “indicators” like:
○ thread usage
○ service status
○ backup status, duration, size
○ replication lag
How to show/use those
metrics?
Global overview
InnoDB metrics
Simple user view
USE dashboard
Disk partition full with
tmp_table
Max connection reached
Database cleaning and
optimize table
DATABASES EXPOSE LOTS OF METRICS ABOUT
THEIR STATUS, BUT MUCH LESS ABOUT THE
DETAILS OF THEIR WORKLOAD.
“WE THINK OUR DATABASE IS SLOW?”
“Last week week we noticed that
the database was slow.”
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
Logs
Logs
● Logs all the SQL queries (general log)
● Install an agent to ship those logs with “custom fields”
● Make the logs available for our users
Logs
● Logs all the SQL queries (general log)
● Install an agent to ship those logs with “custom fields”
● Configure MySQL/MariaDB to log the slow queries
● Use Rsyslog with a custom template!
● Make the logs available for our users
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
Conclusions
● The DBA is not a blocker for the developers
● The visibility and transparency on the database service
● Happy customers/developers/users
● Effective monitoring
● Shipping slow queries is not easy
● In that case metrics and logs is a good combo but we want more!
Next steps
● Continue to improve the SQL logging
● Leverage the usage of sys_schema
● Metrics per database
● Publish the SLA
● Open source our probe for MySQL/MariaDB
Resources
https://github.com/CharlesJUDITH/database-observability-toolkit
Thank you!

More Related Content

PDF
Why the database is at the heart of DevOps success
PDF
Monitoring modern applications using Elastic
PDF
Monitor everything from physical hardware to application functionality
PDF
Elasticsearch: From development to production in 15 minutes
PDF
Advanced correlations for threat detection and more
PDF
Get involved with the security community at Elastic
PDF
Elastic Observability keynote
PDF
Monitoring and Securing a Geo-Dispersed Data Center at Hill AFB
Why the database is at the heart of DevOps success
Monitoring modern applications using Elastic
Monitor everything from physical hardware to application functionality
Elasticsearch: From development to production in 15 minutes
Advanced correlations for threat detection and more
Get involved with the security community at Elastic
Elastic Observability keynote
Monitoring and Securing a Geo-Dispersed Data Center at Hill AFB

What's hot (20)

PDF
How eStruxture Data Centers is Using ECE to Rapidly Scale Their Business
PDF
Countering Threats with the Elastic Stack at CERDEC/ARL
PDF
Securing the Elastic Stack for free
PDF
Public sector keynote
PDF
Centralized logging in a changing environment at the UK’s DVLA
PPTX
Realtime data processing with Flink and Druid by Youngpyo Lee, SKT
PDF
Improving Response Times at Optum with Elastic APM
PDF
Streamline search with Elasticsearch Service on Microsoft Azure
PDF
Elastic @ Adobe: Making Search Smarter with Machine Learning at Scale
PDF
Grab: Building a Healthy Elasticsearch Ecosystem
PDF
From secure VPC links to SSO with Elastic Cloud
PDF
Combining Logs, Metrics, and Traces for Unified Observability
PDF
Empowering agencies using Elastic as a Service inside Government
PDF
Combinação de logs, métricas e rastreamentos para observabilidade unificada
PDF
Building Identity Graph at Scale for Programmatic Media Buying Using Apache S...
PDF
Monitoring MongoDB Atlas with Datadog
PDF
Siscale Lightning Talk: Automated Root Cause Analysis with Elastic Stack
PPTX
Platform for the Research and Analysis of Cybernetic Threats
PDF
Combinación de logs, métricas y rastreos para observabilidad unificada
PDF
Building a reliable and cost effect logging system at Box
How eStruxture Data Centers is Using ECE to Rapidly Scale Their Business
Countering Threats with the Elastic Stack at CERDEC/ARL
Securing the Elastic Stack for free
Public sector keynote
Centralized logging in a changing environment at the UK’s DVLA
Realtime data processing with Flink and Druid by Youngpyo Lee, SKT
Improving Response Times at Optum with Elastic APM
Streamline search with Elasticsearch Service on Microsoft Azure
Elastic @ Adobe: Making Search Smarter with Machine Learning at Scale
Grab: Building a Healthy Elasticsearch Ecosystem
From secure VPC links to SSO with Elastic Cloud
Combining Logs, Metrics, and Traces for Unified Observability
Empowering agencies using Elastic as a Service inside Government
Combinação de logs, métricas e rastreamentos para observabilidade unificada
Building Identity Graph at Scale for Programmatic Media Buying Using Apache S...
Monitoring MongoDB Atlas with Datadog
Siscale Lightning Talk: Automated Root Cause Analysis with Elastic Stack
Platform for the Research and Analysis of Cybernetic Threats
Combinación de logs, métricas y rastreos para observabilidad unificada
Building a reliable and cost effect logging system at Box
Ad

Similar to #OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo (20)

PDF
OSMC 2019 | How to improve database Observability by Charles Judith
PDF
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
PPTX
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...
PPTX
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
PPTX
Training Webinar: Detect Performance Bottlenecks of Applications
PDF
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
PPTX
3 Keys to Performance Testing at the Speed of Agile
PDF
Geo2tag performance evaluation, Zaslavsky, Krinkin
PDF
Best Practices for Becoming an Exceptional Postgres DBA
 
PPTX
Observability for Application Developers (1)-1.pptx
PPTX
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
PDF
Easy Microservices with JHipster - Devoxx BE 2017
PDF
Devoxx Belgium 2017 - easy microservices with JHipster
PPTX
Scaling apps for the big time
PPTX
Dynomite @ RedisConf 2017
PDF
EnterpriseDB's Best Practices for Postgres DBAs
 
PPTX
Design patterns for scaling web applications
PDF
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
PDF
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
PDF
Liferay portals in real projects
OSMC 2019 | How to improve database Observability by Charles Judith
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
Training Webinar: Detect Performance Bottlenecks of Applications
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
3 Keys to Performance Testing at the Speed of Agile
Geo2tag performance evaluation, Zaslavsky, Krinkin
Best Practices for Becoming an Exceptional Postgres DBA
 
Observability for Application Developers (1)-1.pptx
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Easy Microservices with JHipster - Devoxx BE 2017
Devoxx Belgium 2017 - easy microservices with JHipster
Scaling apps for the big time
Dynomite @ RedisConf 2017
EnterpriseDB's Best Practices for Postgres DBAs
 
Design patterns for scaling web applications
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
Liferay portals in real projects
Ad

More from Paris Open Source Summit (20)

PDF
#OSSPARIS19 : Control your Embedded Linux remotely by using WebSockets - Gian...
PDF
#OSSPARIS19 : A virtual machine approach for microcontroller programming : th...
PDF
#OSSPARIS19 : RIOT: towards open source, secure DevOps on microcontroller-bas...
PDF
#OSSPARIS19 : The evolving (IoT) security landscape - Gianluca Varisco, Arduino
PDF
#OSSPARIS19: Construire des applications IoT "secure-by-design" - Thomas Gaza...
PDF
#OSSPARIS19 : Detecter des anomalies de séries temporelles à la volée avec Wa...
PDF
#OSSPARIS19 : Supervision d'objets connectés industriels - Eric DOANE, Zabbix
PDF
#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria
PPTX
#OSSPARIS19 - Fostering disruptive innovation in AI with JEDI - André Loesekr...
PDF
#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches ...
PDF
#OSSPARIS19 : MDPH : une solution collaborative open source pour l'instructio...
PDF
#OSSPARIS19 - Understanding Open Source Governance - Gilles Gravier, Wipro Li...
PDF
#OSSPARIS19 : Publier du code Open Source dans une banque : Mission impossibl...
PDF
#OSSPARIS19 : Libre à vous ! Raconter les libertés informatiques à la radio -...
PDF
#OSSPARIS19 - Le logiciel libre : un enjeu politique et social - Etienne Gonn...
PDF
#OSSPARIS19 - Conflits d’intérêt & concurrence : la place de l’éditeur dans l...
PDF
#OSSPARIS19 - Table ronde : souveraineté des données
PDF
#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...
PDF
#OSSPARIS19 - BlueMind v4 : les dessous technologiques de 10 ans de travail p...
PDF
#OSSPARIS19 - Tuto de première installation de VITAM, un système d'archivage ...
#OSSPARIS19 : Control your Embedded Linux remotely by using WebSockets - Gian...
#OSSPARIS19 : A virtual machine approach for microcontroller programming : th...
#OSSPARIS19 : RIOT: towards open source, secure DevOps on microcontroller-bas...
#OSSPARIS19 : The evolving (IoT) security landscape - Gianluca Varisco, Arduino
#OSSPARIS19: Construire des applications IoT "secure-by-design" - Thomas Gaza...
#OSSPARIS19 : Detecter des anomalies de séries temporelles à la volée avec Wa...
#OSSPARIS19 : Supervision d'objets connectés industriels - Eric DOANE, Zabbix
#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria
#OSSPARIS19 - Fostering disruptive innovation in AI with JEDI - André Loesekr...
#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches ...
#OSSPARIS19 : MDPH : une solution collaborative open source pour l'instructio...
#OSSPARIS19 - Understanding Open Source Governance - Gilles Gravier, Wipro Li...
#OSSPARIS19 : Publier du code Open Source dans une banque : Mission impossibl...
#OSSPARIS19 : Libre à vous ! Raconter les libertés informatiques à la radio -...
#OSSPARIS19 - Le logiciel libre : un enjeu politique et social - Etienne Gonn...
#OSSPARIS19 - Conflits d’intérêt & concurrence : la place de l’éditeur dans l...
#OSSPARIS19 - Table ronde : souveraineté des données
#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...
#OSSPARIS19 - BlueMind v4 : les dessous technologiques de 10 ans de travail p...
#OSSPARIS19 - Tuto de première installation de VITAM, un système d'archivage ...

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Approach and Philosophy of On baking technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Electronic commerce courselecture one. Pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Empathic Computing: Creating Shared Understanding
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
The AUB Centre for AI in Media Proposal.docx
Unlocking AI with Model Context Protocol (MCP)
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Understanding_Digital_Forensics_Presentation.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Approach and Philosophy of On baking technology
Chapter 3 Spatial Domain Image Processing.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Review of recent advances in non-invasive hemoglobin estimation
Electronic commerce courselecture one. Pdf
MIND Revenue Release Quarter 2 2025 Press Release
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Profit Center Accounting in SAP S/4HANA, S4F28 Col11

#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo

  • 1. How to improve database observability? @Charles_JUDITH Paris Open Source Summit 2019
  • 2. About me ● Senior Site Reliability Engineer at Criteo ● Working on monitoring topics since few years ● Currently providing the (open source) database service at Criteo ● @Charles_JUDITH on Twitter
  • 3. Agenda 1. Context 2. First iteration 3. Second iteration 4. Next steps 5. Resources
  • 5. Goal ● Alerting ● No hidden issues ● An observable platform! ● The DBA team shouldn’t be a “blocker” for the users!
  • 7. OBSERVABILITY IS A MEASURE OF HOW WELL INTERNAL STATES OF A SYSTEM CAN BE INFERRED FROM KNOWLEDGE OF ITS EXTERNAL OUTPUTS. » SOURCE: WIKIPEDIA
  • 8. My opinion about observability ● It’s not only about the tools ● It’s not a fancy name to say “monitoring” ● It’s more about “transparency”
  • 9. Why a system needs to be observable?
  • 10. Why a system needs to be observable? ● Is it working as expected by the users? ● To answer basic questions about your service/platform ● Increase the visibility for you and your users/customers ● Long term tends analysis ● “If can’t measure it, you can’t manage it”
  • 11. Observability is fundamental for reliability Analogy to the Maslow’s hierarchy of needs
  • 13. The observability effects ● Giving superpowers ● It’s like a roller coaster ● You need to be patient
  • 17. USE method ● USE was introduced by @brendangregg ● Utilization: disk,CPU usage … ● Saturation: disk I/O ● Errors: network interface errors
  • 18. The four golden signals ● Introduced in the Google SRE book ● Latency: response time, queue/wait time ● Traffic: A measure of how much demand is being placed on the service ● Errors: The rate of requests that fail ● Saturation: How “full” is the service
  • 19. RED method ● RED was introduced by @tom_wilkie ● (Request) Rate - the number of requests, per second, you services are serving. ● (Request) Errors - the number of failed requests per second. ● (Request) Duration - distributions of the amount of time each request takes. ● Subset of “The Four Golden Signals”
  • 20. The seven golden signals ● CELT + USE introduced by @xaprb ● Concurrency: number of simultaneous requests ● Error rate ● Latency: response time ● Throughput: query per seconds (QPS)
  • 21. CASE method ● CASE was introduced by @gphat ● Context-heavy ● Actionnable ● Symptom-based ● Evaluated
  • 22. Preferred approach ● The seven golden signals ● Good to measure the service quality ● System and application metrics are valuable in our case
  • 23. How to collect the metrics? ● Collectd ● Node exporter ● MySQLD exporter ● Python MySQL plugin for CollectD ● Few others
  • 24. What to do with all these metrics? ● Pick some useful “indicators” like: ○ thread usage ○ service status ○ backup status, duration, size ○ replication lag
  • 25. How to show/use those metrics?
  • 30. Disk partition full with tmp_table
  • 33. DATABASES EXPOSE LOTS OF METRICS ABOUT THEIR STATUS, BUT MUCH LESS ABOUT THE DETAILS OF THEIR WORKLOAD.
  • 34. “WE THINK OUR DATABASE IS SLOW?” “Last week week we noticed that the database was slow.”
  • 36. Logs
  • 37. Logs ● Logs all the SQL queries (general log) ● Install an agent to ship those logs with “custom fields” ● Make the logs available for our users
  • 38. Logs ● Logs all the SQL queries (general log) ● Install an agent to ship those logs with “custom fields” ● Configure MySQL/MariaDB to log the slow queries ● Use Rsyslog with a custom template! ● Make the logs available for our users
  • 40. Conclusions ● The DBA is not a blocker for the developers ● The visibility and transparency on the database service ● Happy customers/developers/users ● Effective monitoring ● Shipping slow queries is not easy ● In that case metrics and logs is a good combo but we want more!
  • 41. Next steps ● Continue to improve the SQL logging ● Leverage the usage of sys_schema ● Metrics per database ● Publish the SLA ● Open source our probe for MySQL/MariaDB