SlideShare a Scribd company logo
https://tech.showmax.com@ShowmaxDevs
PostgreSQL Monitoring
Using modern software stacks
Roman Fišer
https://tech.showmax.com@ShowmaxDevs
About me
● Roman Fišer
● Head of Infrastructure
● Showmax Engineering
● roman.fiser@showmax.com
https://tech.showmax.com@ShowmaxDevs
● VOD (Video On Demand) service
● Focusing on clients in Africa
● Engineering in Prague
● Based on Open source technologies
What is Showmax?
https://tech.showmax.com@ShowmaxDevs
Open Source
https://tech.showmax.com@ShowmaxDevs
PostgreSQL
● Cluster management handled by
Patroni
● High-availability
■ Automatic master/slave election
■ Auto-failover
● Streaming replication
● Written in python & Open Source
■ Multiple patches to upstream
● Secrets management w Vault
● Barman for Backups
https://tech.showmax.com@ShowmaxDevs
PostgreSQL @ Showmax
● Backend - data store for Showmax business entities
● Microservice architecture, each DB has RESTful microservice
● CMS
● Stores CMS in PostgreSQL (then denormalized to Elastic)
● Cache invalidations
● Analytics DWH
● Copy over from all databases
● Events digestion
https://tech.showmax.com@ShowmaxDevs
What type of metrics are important?
● Four SRE Golden Signals
● Latency
■ The time it takes to service a request
● Traffic
■ A measure of how much demand is being placed on your system
● Error
■ The rate of requests that fail
● Saturations
■ How "full" your service is
https://tech.showmax.com@ShowmaxDevs
Latency
● pg_stat_statements
● Average call time
● Maximum call time
● Replication delay
https://tech.showmax.com@ShowmaxDevs
Traffic
● Pg_stat_statements
● Calls
● Returned rows
● Network traffic
● System IO Statistics (IOPS, traffic)
https://tech.showmax.com@ShowmaxDevs
Errors
● Rollback / Commit ratio
● Deadlocks
https://tech.showmax.com@ShowmaxDevs
Saturation
● num_backends / max_connections
● High IO utilization (iowait, await)
● High CPU utilization
● Checkpoints
● Tempfile usage
● Disk usage (free space, inodes)
https://tech.showmax.com@ShowmaxDevs
Other important metrics
● Long running idle in transaction
● Blocked autovacuum
● Error events in PostgreSQL log
● Server crashes
● I/O Errors
● Data corruption
● Index corruption
https://tech.showmax.com@ShowmaxDevs
Tools - Prometheus Stack
● Prometheus
● Prometheus is a time-series database. Suitable for white-box monitoring
● Alert Manager
● Part of the Prometheus project. Used for Alerting.
● Exporters
● Patroni_exporter, Postgres_exporter - Exports PostgreSQL metrics
● Node_exporter - Expose OS metrics
● Grafana
● Web frontend for Prometheus data
https://tech.showmax.com@ShowmaxDevs
Prometheus vs Nagios/Icinga
Nagios/Icinga
● Focus on black-box monitoring
● Checks usually complicated bash scripts
● Can’t base alerts on relations between different metric types
Prometheus
● Promotes white-box monitoring
● High-performance TSDB
● Cloud-native ready with multiple service discovery providers
● Standardized interface for exporters
https://tech.showmax.com@ShowmaxDevs
● Monitoring based on metrics with metadata (CPU,
RAM, disk IO, disk utilization, etc.)
● Custom labels for metrics
● Functions to filter, change, remove …. metadata
while fetching them
● Multiple exporters - expose data via HTTP API
● Effective data fetching:
● Based on intervals measured in seconds
● Million of data points
● Notifications can be reported via Email, Slack, etc.
Prometheus
Prometheus
https://prometheus.io
https://tech.showmax.com@ShowmaxDevs
Prometheus Pipeline
https://tech.showmax.com@ShowmaxDevs
Prometheus
PSQL exporter
Postgresql
GET /metrics
(HTTP query)
SQL query
(dedicated user)
Plain text
response
How does it work?
https://tech.showmax.com@ShowmaxDevs
PromQL
https://tech.showmax.com@ShowmaxDevs
PromQL
https://tech.showmax.com@ShowmaxDevs
PromQL
https://tech.showmax.com@ShowmaxDevs
Alert Manager Rules - Latency
- alert: PgSQLSlowLoginQuery
expr: avg_over_time(pg_stat_statements_mean_time_seconds{datname="cms", queryid="3985044216"}[1m])
> 100
for: 5m
labels:
severity: critical
team: ops
annotations:
summary: "Slow login query (instance {{ $labels.instance }})"
description: |
Login queries are too slow (> 30s). Average is {{ $value }}.
Check the PostgreSQL instance {{ $labels.instance }}
runbook: pgsql@pgsqlslowloginquery
title: PgSQLSlowLoginQuery
https://tech.showmax.com@ShowmaxDevs
Alert Manager Rules - Latency
- alert: PgSQLReplicationLagIsTooBig
expr: pg_replication_lag{instance!~"^(ba-patroni|analytics-patroni).*"} > 300 and pg_is_in_recovery == 1
for: 15m
labels:
severity: critical
team: ops
annotations:
description:
Replication lag on PostgreSQL Slave {{ $labels.instance }} is
{{ humanizeDuration $value }}. If lag is too big, it might be
impossible for Slave to recover, it might not be considered as a new
Patroni leader, or data loss could occur should such Slave be chosen as next
leader.
summary: PostgreSQL Slave Replication lag is too big.
runbook: pgsql#pgsqlreplicationlagistoobig
title: PgSQLReplicationLagIsTooBig
https://tech.showmax.com@ShowmaxDevs
Alert Manager Rules - Traffic
- alert: PgSQLCommitRateTooLow
expr: |
rate(pg_stat_database_xact_commit{datname="oauth", sm_env="prod"}[5m]) < 200
for: 5m
labels:
severity: warn
team: ops
annotations:
description: |
Commit Rate {{$labels.instance}} for database {{$labels.datname}}
is {{$value}} which is suspiciously low.
runbook: pgsql#pgsqlcommitrateislow
title: PgSQLCommitRateTooLow
https://tech.showmax.com@ShowmaxDevs
Alert Manager Rules - Saturation
- alert: PgSQLNumberOfConnectionsHigh
expr: (100 * (sum(pg_stat_database_numbackends) by (instance, job) / pg_settings_max_connections)) > 90
for: 10m
labels:
severity: critical
team: ops
annotations:
description:
Number of active/open connections to PostgreSQL on {{ $labels.instance }}
is {{ $value }}. It's possible PostgreSQL won't be able to accept any
new connections.
summary: Number of active connections to Postgresql too high.
runbook: pgsql#pgsqlnumberofconnectionshigh
title: PgSQLNumberOfConnectionsHigh
https://tech.showmax.com@ShowmaxDevs
Alert Manager Rules - Errors
- alert: PgSQLRollbackRateTooHigh
expr: |
rate(pg_stat_database_xact_rollback{datname="oauth"}[5m])
/ ON(instance, datname)
rate(pg_stat_database_xact_commit{datname="oauth"}[5m])
> 0.05
for: 5m
labels:
severity: warn
team: ops
annotations:
description: |
Ratio of transactions being aborted compared to committed is
{{$value | printf "%.2f" }} on {{$labels.instance}}
runbook: pgsql@pgsqlrollbackrateishigh
title: PgSQLRollbackRateTooHigh
https://tech.showmax.com@ShowmaxDevs
Alert Manager Rules - Errors
- alert: PgSQLDeadLocks
expr: rate(pg_stat_database_deadlocks{datname!~"template.*|postgres"}[1m]) > 0
for: 5m
labels:
severity: warn
team: ops
annotations:
description: |
Deadlocks has been detected on PostgreSQL {{ $labels.instance }}.
Number of deadlocks: {{ $value }}
summary: "Dead locks (instance {{ $labels.instance }})"
runbook: pgsql#pgsqldeadlocksdetected
title: PgSQLDeadLocks
https://tech.showmax.com@ShowmaxDevs
Alert Manager Rules - Errors
- alert: PgSQLTableNotVaccumed
expr: time() - pg_stat_user_tables_last_autovacuum{datname="oauth", sm_env="prod"} > 60 * 60 * 24
for: 5m
labels:
severity: warn
team: ops
annotations:
summary: "Table not vaccumed (instance {{ $labels.instance }})"
description: |
Table has not been vaccum for 24 hours {{ $labels.relname }}
(vacuumed before {{ humanizeDuration $value }}). There may be not enough vacuum workers.
runbook: pgsql#pgsqltablenotvacuumed
title: PgSQLTableNotVaccumed
https://tech.showmax.com@ShowmaxDevs
Grafana
https://tech.showmax.com@ShowmaxDevs
https://tech.showmax.com@ShowmaxDevs
https://tech.showmax.com@ShowmaxDevs
https://tech.showmax.com@ShowmaxDevs
https://tech.showmax.com@ShowmaxDevs
https://tech.showmax.com@ShowmaxDevs
https://tech.showmax.com@ShowmaxDevs
https://tech.showmax.com@ShowmaxDevs
https://tech.showmax.com@ShowmaxDevs
https://tech.showmax.com@ShowmaxDevs
https://tech.showmax.com@ShowmaxDevs
https://tech.showmax.com@ShowmaxDevs
https://tech.showmax.com@ShowmaxDevs
https://tech.showmax.com@ShowmaxDevs
https://tech.showmax.com@ShowmaxDevs
https://tech.showmax.com@ShowmaxDevs
https://tech.showmax.com@ShowmaxDevs
https://tech.showmax.com@ShowmaxDevs
Prometheus caveats
● Beware of labels with high cardinality
○ Significant performance penalty
○ It is not possible to remove labels from DB
https://tech.showmax.com@ShowmaxDevs
Tools - ELK Stack
● Elastic
● Search and analytics engine
● RabbitMQ
● OSS Message broker. Used for log messages delivery
● Logstash
● Server‑side data processing pipeline
● Kibana
● Kibana lets users visualize data with charts and graphs in Elasticsearch.
https://tech.showmax.com@ShowmaxDevs
Logging Pipeline
https://tech.showmax.com@ShowmaxDevs
Kibana
https://tech.showmax.com@ShowmaxDevs
Kibana
https://tech.showmax.com@ShowmaxDevs
● Simple framework to alert anomalies, spikes, or
other patterns of interest from data in
Elasticsearch
● Two types of components:
● Rule types (frequency, spike, flatline, etc.)
● Alert types (email, slack, OpsGenie, etc.)
● Alerts can include:
● Link to Kibana dashboards
● Aggregate counts for arbitrary fields
● Combine alerts into periodic reports
● Intercept and enhance match data
Elastalert
https://tech.showmax.com@ShowmaxDevs
Elastalert
https://tech.showmax.com@ShowmaxDevs
Tracing the API requests
● Trace request across services
● Down to DB statements
https://tech.showmax.com@ShowmaxDevs
Going down to the DB statements
● Comments to the rescue
● SELECT "user_profiles".* FROM "user_profiles" WHERE
"user_profiles"."user_id" = '6881a8eb-5e54-4073-a2ec-a62eb4e8e746' /*
74CA2798:B364_904C6C7E:0050_5E1D733D_12189B7:0428 */
● Instrument the ORM layers to include the tracing information
● Active Record for Ruby
● adapter.prepend(::ActiveRecord::Tags::ExecuteWithTags)
● ELK stack to trace the sql requests
https://tech.showmax.com@ShowmaxDevs
Why: #1 - Watchdog for the excellence
● Regularly analyze the queries, identify the weak points
Database queries taking too long based on logs-app-postgresql-2020.01.02-*
See https://kibana.showmax.cc/app/kibana#/discover/c5869b80-fe1d-11e8-a107-856e4e008c55
backend@showmax.com
download
002 1,605: DELETE FROM "download_events" WHERE "download_events"."download_id" = '<param>'
001 1,289: SELECT MAX("downloads"."updated_at") FROM "downloads" WHERE "downloads"."user_id" = '<param>'
001 1,209: SELECT "downloads".* FROM "downloads" WHERE ("downloads"."state" != '<param>') AND "downloads"."master_user_id" =
'<param>'
https://tech.showmax.com@ShowmaxDevs
Why: #2 - Easy investigation of failures
● Kibana queries based on request ID are completely trackable
● Easy to analyze abnormal patterns, and track back to user
actions
https://tech.showmax.com@ShowmaxDevs
Why: #3 - Auditing
● Find who messed with given data?
● Additionally to classical audit log, we are able to track the API operations
down to SQL statements.
https://tech.showmax.com@ShowmaxDevs
Next steps
● Monitoring based on trends
● Create usual traffic envelope using recording rules
● Alert on anomalies in the traffic
● Loki
● “Prometheus for logs” from Grafana
● Tightly integrated with performance data
● Thanos
● HA, Long term storage, Downsampling
● Single interface to all Prometheus instances
https://tech.showmax.com@ShowmaxDevs
Links
● Grafana dashboards
● https://github.com/Showmax/p2d2-2020/tree/master/dashboards
● Alert manager rules
● https://github.com/Showmax/p2d2-2020/tree/master/alerts
● PostgreSQL exporter queries.yaml
● https://github.com/Showmax/p2d2-2020/tree/master/postgres_exporter
https://tech.showmax.com@ShowmaxDevs
Come and join us!
We’re looking for new colleagues
tech.showmax.com
https://tech.showmax.com@ShowmaxDevs
Thanks!
Questions?
roman.fiser@showmax.com

More Related Content

PDF
Networking fundamentals
PPTX
PostgreSQL Terminology
PDF
Streaming huge databases using logical decoding
PDF
GOTO 2013: Why Zalando trusts in PostgreSQL
PDF
Adding replication protocol support for psycopg2
PPTX
Zendcon zray
PDF
Gdb basics for my sql db as (percona live europe 2019)
PDF
Pgcenter overview
Networking fundamentals
PostgreSQL Terminology
Streaming huge databases using logical decoding
GOTO 2013: Why Zalando trusts in PostgreSQL
Adding replication protocol support for psycopg2
Zendcon zray
Gdb basics for my sql db as (percona live europe 2019)
Pgcenter overview

What's hot (20)

PDF
Massively Scaled High Performance Web Services with PHP
PDF
Managing PostgreSQL with PgCenter
PDF
Tracing and profiling my sql (percona live europe 2019) draft_1
PDF
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
PDF
PLNOG 4: Leszek Urbański - A modern HTTP accelerator for content providers
PDF
PDF
Troubleshooting PostgreSQL Streaming Replication
PDF
Storing 16 Bytes at Scale
PDF
Patroni - HA PostgreSQL made easy
PDF
Logical Replication in PostgreSQL - FLOSSUK 2016
PDF
FOSDEM 2015: gdb tips and tricks for MySQL DBAs
PDF
In Memory Database In Action by Tanel Poder and Kerry Osborne
PDF
Instant add column for inno db in mariadb 10.3+ (fosdem 2018, second draft)
ODP
Logical replication with pglogical
PDF
Open Source Monitoring for Java with JMX and Graphite (GeeCON 2013)
PDF
How Booking.com avoids and deals with replication lag
PPT
Wait Events 10g
PDF
Advanced Oracle Troubleshooting
ODP
IT Operations for Web Developers
PDF
PostgreSQL for Oracle Developers and DBA's
Massively Scaled High Performance Web Services with PHP
Managing PostgreSQL with PgCenter
Tracing and profiling my sql (percona live europe 2019) draft_1
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
PLNOG 4: Leszek Urbański - A modern HTTP accelerator for content providers
Troubleshooting PostgreSQL Streaming Replication
Storing 16 Bytes at Scale
Patroni - HA PostgreSQL made easy
Logical Replication in PostgreSQL - FLOSSUK 2016
FOSDEM 2015: gdb tips and tricks for MySQL DBAs
In Memory Database In Action by Tanel Poder and Kerry Osborne
Instant add column for inno db in mariadb 10.3+ (fosdem 2018, second draft)
Logical replication with pglogical
Open Source Monitoring for Java with JMX and Graphite (GeeCON 2013)
How Booking.com avoids and deals with replication lag
Wait Events 10g
Advanced Oracle Troubleshooting
IT Operations for Web Developers
PostgreSQL for Oracle Developers and DBA's
Ad

Similar to PostgreSQL Monitoring using modern software stacks (20)

PDF
Monitoring Kafka w/ Prometheus
PDF
Integrating ChatGPT with Apache Airflow
PDF
SamzaSQL QCon'16 presentation
PDF
Prometheus and Docker (Docker Galway, November 2015)
PPT
Monitoring using Prometheus and Grafana
PDF
[245] presto 내부구조 파헤치기
PDF
(Fios#02) 2. elk 포렌식 분석
PDF
Apache Samza 1.0 - What's New, What's Next
ODP
Dynamic Tracing of your AMP web site
PPTX
When third parties stop being polite... and start getting real
PPTX
Codemotion Rome 2018 Docker Swarm Mode
PPTX
Speed up R with parallel programming in the Cloud
PDF
When Third Parties Stop Being Polite... and Start Getting Real
PDF
Fluent 2018: When third parties stop being polite... and start getting real
PDF
Clug 2012 March web server optimisation
PDF
Docker Monitoring Webinar
PPSX
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
PDF
Osol Pgsql
PDF
Presto anatomy
PDF
Spark streaming
Monitoring Kafka w/ Prometheus
Integrating ChatGPT with Apache Airflow
SamzaSQL QCon'16 presentation
Prometheus and Docker (Docker Galway, November 2015)
Monitoring using Prometheus and Grafana
[245] presto 내부구조 파헤치기
(Fios#02) 2. elk 포렌식 분석
Apache Samza 1.0 - What's New, What's Next
Dynamic Tracing of your AMP web site
When third parties stop being polite... and start getting real
Codemotion Rome 2018 Docker Swarm Mode
Speed up R with parallel programming in the Cloud
When Third Parties Stop Being Polite... and Start Getting Real
Fluent 2018: When third parties stop being polite... and start getting real
Clug 2012 March web server optimisation
Docker Monitoring Webinar
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
Osol Pgsql
Presto anatomy
Spark streaming
Ad

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation theory and applications.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Big Data Technologies - Introduction.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
KodekX | Application Modernization Development
PDF
Empathic Computing: Creating Shared Understanding
PDF
cuic standard and advanced reporting.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Cloud computing and distributed systems.
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Approach and Philosophy of On baking technology
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation theory and applications.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Big Data Technologies - Introduction.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Digital-Transformation-Roadmap-for-Companies.pptx
KodekX | Application Modernization Development
Empathic Computing: Creating Shared Understanding
cuic standard and advanced reporting.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Review of recent advances in non-invasive hemoglobin estimation
Cloud computing and distributed systems.
NewMind AI Monthly Chronicles - July 2025
Encapsulation_ Review paper, used for researhc scholars
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Reach Out and Touch Someone: Haptics and Empathic Computing
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf

PostgreSQL Monitoring using modern software stacks