SlideShare a Scribd company logo
Monitoring with Prometheus
Shiao-An Yuan
2017/04/22
Requirements
1. Able to “see” the abnormal on the dashboard
2. Notify me when abnormal happen
http://play.grafana.org/
Time Series Databases
A time series database (TSDB) is a software system that is optimized for handling
time series data, arrays of numbers indexed by time -- from Wikipedia
● InfluxDB
● OpenTSDB
● Graphite
● Prometheus
https://prometheus.io/docs/introduction/overview/
Node Exporter
$ curl http://localhost:9100/metrics
…
# HELP node_filesystem_avail Filesystem space available to
non-root users in bytes.
# TYPE node_filesystem_avail gauge
node_filesystem_avail{mountpoint="/"} 6.301462528e+09
…
Metric Name Labels Value
Scrape Configs
scrape_configs:
- job_name: "node"
scrape_interval: "1m"
static_configs:
- targets: ["localhost:9100"]
Before scrape:
node_filesystem_avail{mountpoint="/"}
After scrape:
node_filesystem_avail{instance="localhost:9100",job="node",mountpoint="/"}
https://prometheus.io/blog/2016/09/21/interview-with-compose/
Exporters
https://prometheus.io/docs/instrumenting/exporters/
● Lots of official & 3rd-party exporters
○ Node Exporter
○ JMX Exporter
○ ...
● Directly instrumented software
○ Kubernetes
○ cAdvisor
○ ScyllaDB (C++ implementation of Cassandra)
○ ...
● Client Libraries
○ Go, Java/Scala, Python, Ruby (official)
○ 12+ other languages (3rd-party)
Cassandra - by JMX Exporter
Writing Exporter
● Metric Types
○ Counter
○ Gauge
○ Histogram
○ Summary
Kafka Offset Exporter
https://github.com/sayuan/kafka-offset-exporter
PromQL
> node_filesystem_free
node_filesystem_free{instance="host1:9100",mountpoint="/"} 11111
node_filesystem_free{instance="host1:9100",mountpoint="/root"} 22222
node_filesystem_free{instance="host2:9100",mountpoint="/"} 33333
node_filesystem_free{instance="host2:9100",mountpoint="/root"} 44444
Instant Vector
Instant Vector Selector
> node_filesystem_free{instance="host1:9100"}
node_filesystem_free{instance="host1:9100",mountpoint="/"} 11111
node_filesystem_free{instance="host1:9100",mountpoint="/root"} 22222
Instant Vector
Range Vector Selector
> node_filesystem_free{instance="host1:9100"}[3m]
node_filesystem_free{instance="host1:9100",mountpoint="/"}
11111
11112
11113
node_filesystem_free{instance="host1:9100",mountpoint="/root"}
22222
22223
22224
Range Vector
Aggregation Operator
> sum(node_filesystem_free{instance="host1:9100"})
{} 33333
Scalar
Aggregation Operator
> sum(node_filesystem_free) by (instance)
{instance="host1:9100"} 33333
{instance="host2:9100"} 77777
Instant Vector
Data Types & Selectors
● Data Types
○ Instant Vector
○ Range Vector
○ Scalar
○ String (unused)
● Selectors
○ Instant Vector Selectors
○ Range Vector Selectors
○ Offset Modifier
■ node_filesystem_free offset 5m
Operations
● Arithmetic operators
○ +, -, *, /, %, ^
● Comparison operators
○ ==, !=, >, <, >=, <=
● Logical/set operators
○ and, or, unless
● Aggregation operators
○ sum, min, max, avg, stddev, stdvar
○ count, count_values, bottomk, topk, quantile
Functions
● day_of_month(), day_of_week(), days_in_month(), hour(),
minute(), month(), time(), year()
● abs(), ceil(), exp(), floor(), ln(), log10(), log2(), round(),
sqrt()
● absent(), changes(), clamp_max(), clamp_min(), count_scalar(),
delta(), deriv(), drop_common_labels(), histogram_quantile(),
holt_winters(), idelta(), increase(), irate(),
label_replace(), predict_linear(), rate(), resets(), scalar(),
sort(), sort_desc(), vector(), <aggregation>_over_time()
Query Example: Disk Usage
1 - ( node_filesystem_avail{instance="localhost:9090"}
/ node_filesystem_size{instance="localhost:9090"} )
Query Example: Network Traffics
irate(node_network_receive_bytes{instance="localhost:9100"}[5m])
-irate(node_network_transmit_bytes{instance="localhost:9100"}[5m])
Query Example: CPU Usage
avg by (mode)(
irate(node_cpu{instance="localhost:9100", mode!="idle"}[5m]))
Alert Rules
ALERT DiskUsageOver80Percent
IF node_filesystem_avail / node_filesystem_size < 0.2
FOR 5m
Alert Rules
ALERT DiskUsageOver80Percent
IF node_filesystem_avail / node_filesystem_size < 0.2
FOR 5m
LABELS { severity = "critical" }
ANNOTATIONS {
description = "{{ $labels.instance }} disk usage has over 80%."
link = "<Grafana URL>"
}
Alert Rules
ALERT DiskWillFillIn4Hours
IF predict_linear(node_filesystem_free[1h], 4*3600) < 0
FOR 5m
https://www.robustperception.io/reduce-noise-from-disk-space-alerts/
Alert Rules for Kafka
● Are brokers ingesting messages?
● Does the traffic increase/decrease rapidly than last hour/yesterday?
● Does the consumer lag over 30 minutes?
https://prometheus.io/docs/introduction/overview/
Alert Manager
route:
routes:
- match:
severity: 'warning'
receiver: 'email'
receiver: 'slack'
receivers:
- name: 'slack'
slack_configs:
- channel: '#alert'
api_url: 'https://hooks.slack.com/services/...'
text: '{{ .CommonAnnotations.description }} {{ .CommonAnnotations.link }}'
...
Alert Manager
● Routing
● Grouping
● Inhibition
● Silences
● Receivers
○ Slack, Email, Webhook, ...
“My Philosophy on Alerting”
https://docs.google.com/a/boxever.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit
Rob Ewaschuk (former Site Reliability Engineer at Google)
● Pages should be urgent, important, actionable, and real.
● Over-monitoring is a harder problem to solve than under-monitoring.
● Cause vs. symptom
● Every page should require intelligence to deal with.
Pull doesn’t scale?
● Prometheus is not an event-based system
● No spawning subprocess
● A single big Prometheus server can easily store 800,000
incoming samples per second
● Federation
https://prometheus.io/blog/2016/07/23/pull-does-not-scale-or-does-it/
https://www.robustperception.io/scaling-and-federating-prometheus/
Alert Alternatives
● Nagios
● Grafana
● TICK (Telegraf, Influxdb, Chronograf, Kapacitor)
● ELK+Beats (ElasticSearch, Logstash, Kibana)
Conclusion
● Prometheus is easy, but monitoring is difficult
● Read all documents on the official site/blog
● Keep improving the monitoring & alert rules

More Related Content

PPTX
Prometheus and Grafana
PPTX
Airflow presentation
PPTX
Monitoring With Prometheus
PDF
Infrastructure & System Monitoring using Prometheus
PPT
Monitoring using Prometheus and Grafana
PDF
Monitoring with prometheus
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PDF
Prometheus Overview
Prometheus and Grafana
Airflow presentation
Monitoring With Prometheus
Infrastructure & System Monitoring using Prometheus
Monitoring using Prometheus and Grafana
Monitoring with prometheus
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Prometheus Overview

What's hot (20)

PDF
Hudi architecture, fundamentals and capabilities
PPTX
Comprehensive Terraform Training
PDF
Airflow Best Practises & Roadmap to Airflow 2.0
PDF
The basics of fluentd
ODP
Monitoring With Prometheus
PDF
Prometheus - basics
PPTX
Running Airflow Workflows as ETL Processes on Hadoop
PDF
Introduction to MongoDB
PDF
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
PPTX
An Introduction to Prometheus (GrafanaCon 2016)
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
[Meetup] a successful migration from elastic search to clickhouse
PDF
Introduction to Docker Compose
PDF
Server monitoring using grafana and prometheus
PDF
Introduction to Redis
PDF
Getting Started Monitoring with Prometheus and Grafana
PPTX
Monitoring_with_Prometheus_Grafana_Tutorial
PPTX
Elastic stack Presentation
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PPTX
Elastic Stack Introduction
Hudi architecture, fundamentals and capabilities
Comprehensive Terraform Training
Airflow Best Practises & Roadmap to Airflow 2.0
The basics of fluentd
Monitoring With Prometheus
Prometheus - basics
Running Airflow Workflows as ETL Processes on Hadoop
Introduction to MongoDB
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
An Introduction to Prometheus (GrafanaCon 2016)
A Thorough Comparison of Delta Lake, Iceberg and Hudi
[Meetup] a successful migration from elastic search to clickhouse
Introduction to Docker Compose
Server monitoring using grafana and prometheus
Introduction to Redis
Getting Started Monitoring with Prometheus and Grafana
Monitoring_with_Prometheus_Grafana_Tutorial
Elastic stack Presentation
Apache Iceberg - A Table Format for Hige Analytic Datasets
Elastic Stack Introduction
Ad

Similar to Monitoring with Prometheus (20)

PDF
System monitoring
PPTX
How to Improve the Observability of Apache Cassandra and Kafka applications...
PDF
Regain Control Thanks To Prometheus
PDF
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
PDF
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
PDF
Monitoring Kafka w/ Prometheus
PDF
Monitoring Cloud Native Applications with Prometheus
PDF
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
PDF
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
PDF
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
PDF
Thanos: Global, durable Prometheus monitoring
PPTX
Prometheus (Prometheus London, 2016)
PDF
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
PDF
DevOps Braga #15: Agentless monitoring with icinga and prometheus
PDF
Microservices and Prometheus (Microservices NYC 2016)
PDF
Time series denver an introduction to prometheus
PPTX
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
PDF
Monitoring a Kubernetes-backed microservice architecture with Prometheus
PPTX
K8s monitoring with prometheus
PPTX
Prometheus - Open Source Forum Japan
System monitoring
How to Improve the Observability of Apache Cassandra and Kafka applications...
Regain Control Thanks To Prometheus
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
Monitoring Kafka w/ Prometheus
Monitoring Cloud Native Applications with Prometheus
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
Thanos: Global, durable Prometheus monitoring
Prometheus (Prometheus London, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
DevOps Braga #15: Agentless monitoring with icinga and prometheus
Microservices and Prometheus (Microservices NYC 2016)
Time series denver an introduction to prometheus
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
Monitoring a Kubernetes-backed microservice architecture with Prometheus
K8s monitoring with prometheus
Prometheus - Open Source Forum Japan
Ad

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPT
Teaching material agriculture food technology
PDF
Encapsulation_ Review paper, used for researhc scholars
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Cloud computing and distributed systems.
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Spectral efficient network and resource selection model in 5G networks
Advanced methodologies resolving dimensionality complications for autism neur...
Teaching material agriculture food technology
Encapsulation_ Review paper, used for researhc scholars
The AUB Centre for AI in Media Proposal.docx
20250228 LYD VKU AI Blended-Learning.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Cloud computing and distributed systems.
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Diabetes mellitus diagnosis method based random forest with bat algorithm
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Agricultural_Statistics_at_a_Glance_2022_0.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Chapter 3 Spatial Domain Image Processing.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?

Monitoring with Prometheus