DevOps Braga #15: Agentless monitoring with icinga and prometheus

https://bit.ly/2Cs2ql4

Thank you
Hosted by Sponsored by

Today
● Agentless monitoring with Icinga and Prometheus by Diogo Machado
● Coffee break and networking

Agentless monitoring with Icinga and
Prometheus
Diogo Machado
dgm@eurotux.com
04/11/2019
DevOps Braga #15

Agenda
● From Icinga to Prometheus
● Prometheus Basic Concepts
● Prometheus Server Conﬁguration
● Getting data into Prometheus
● Implement custom metrics
● How to integrate Icinga with Prometheus?

From Icinga to Prometheus - Introduction
● Open-source computer system and network
monitoring application;
● Monitor the availability of hosts and
services;
● Distributed Monitoring;
● Agent Based Monitoring;
● Notiﬁcations and Downtimes;
● Written in PHP and C++;
● Open-source systems monitoring and
alerting toolkit;
● Monitoring of highly service-oriented
architectures;
● Collect metrics with Exporters;
● Store aggregated metrics;
● Written in Go;

From Icinga to Prometheus - Comparison
● Alerting based on the exit codes of Checks;
● Host-based;
● There is no notion of labels or query language.
● No storage per-se, beyond the check state;
● All configuration are made via file;
● Monitoring of small and/or static systems
where blackbox probing is sufficient.
● Multi-dimensional data model with time
series data;
● Rule-based alerts;
● Prometheus Query Language - PromQL;
● Centralized data store;
● Suitable for dynamic or cloud based
environment - Whitebox monitoring;

From Icinga to Prometheus - Comparison
If there are both to system monitoring….
Why not choose only one?
What does Prometheus have that Icinga doesn't have?
Why should we combine both?

From Icinga to Prometheus - Conclusion
So … Why should we combine both?
● Prometheus have:
○ Exporters to third-party systems and applications;
○ Centralized control and HTTP API;
● Icinga have:
○ Easy configuration of host and services (Icinga Director);
○ Good alerting system with notifications and schedule downtimes;
Combine to get the best of each one:
● Scrape metrics with Prometheus
● Configure and Alert with Icinga

Prometheus's main features are:
● multi-dimensional data model with time series data identified by metric name and key/value pairs;
● PromQL, a flexible query language to leverage this dimensionality;
● no reliance on distributed storage - single server nodes are autonomous;
● time series collection happens via a pull model over HTTP;
● pushing time series is supported via an intermediary gateway;
● targets are discovered via service discovery or static configuration;
● support for graph and dashboard mode.
Prometheus Basic Concepts

Prometheus Basic Concepts - Architecture
The Prometheus ecosystem consists of
multiple components:
● the main Prometheus server which
scrapes and stores time series data;
● a Pushgateway for supporting
short-lived jobs;
● speciﬁc exporters for services like
HAProxy, StatsD, Graphite, etc;
● an Alertmanager to handle alerts;
● data visualization tools like Grafana;

Prometheus Basic Concepts - Data Model
● Data is stored as time series: streams of
timestamped values belonging to the same
metric and the same set of labeled dimensions;
● Key Value Data Model:
○ Key - Metric name and a set of labels;
○ Value - Metric measuring;

Prometheus Basic Concepts - Metric Types
Counter
● Cumulative metric;
● Monotonically increasing
counter;
● Examples:
○ Nº requests served;
○ Nº tasks completed;
○ Number of errors;
Gauge
● Numerical value that can
arbitrarily go up or down;
● Examples:
○ Temperature;
○ Memory usage;
○ Nº concurrent requests;
Histogram
● Values are aggregated in
buckets;
● Expose total sum of all
observed values;
● Count of events that have
been observed;
● Example:
○ Request latency;
Summary
● Similar to a histogram;
● Calculates configurable
quantiles over a sliding
time window;
● Examples:
○ Request durations;
○ Response sizes;
requests_time_seconds_bucket{app=”projectx”,le=”0.005"} 2343340162
… (buckets)
requests_time_seconds_sum{app=”projectx”} 5.366133242442994e+07
requests_time_seconds_count{app=”projectx”} 3973861256
go_gc_duration_seconds{quantile="0"} 4.274e-05
… (quantiles)
go_gc_duration_seconds_sum 0.467543895
go_gc_duration_seconds_count 92
Histogram Summary

Prometheus Basic Concepts - Jobs and Instances
● An endpoint you can scrape is called an Instance;
● A collection of instances with the same purpose is called a Job;
● Prometheus scrapes a Target and attaches labels automatically: job name, instance host and port;
● Example of job with 3 instances:
job: 1:
instance 1: 128.0.0.1:9030
instance 2: 128.0.0.2:9030
instance 3: 128.0.0.3:9030

Prometheus Basic Concepts - PromQL
Expression language data types:
● Instant vector - a set of time series containing a single sample for each time series;
○ Example: http_requests_total{environment=~"staging|development",method!="GET"}
● Range vector - a set of time series containing a range of data points for each time series;
○ Example: http_requests_total{job="prometheus"}[5m]
● Scalar - a simple numeric ﬂoating point value;
○ Example: -2.43
● String - a simple string value;
○ Example: 'these are unescaped: n t'

Prometheus Basic Concepts - REST API
● Response format is JSON: ● Methods: GET, POST
● Endpoints:
○ /api/v1/query
○ /api/v1/query_range
○ /api/v1/series
○ /api/v1/labels
○ /api/v1/targets
○ /api/v1/rules
○ /api/v1/targets/metadata
○ /api/v1/status/conﬁg
○ /api/v1/status/ﬂags

Prometheus Basic Concepts - Scrape Metrics
● Prometheus works essentially pulling metrics metadata from targets;
● Pulling over HTTP offers a number of advantages:
○ Easily tell if a target is down;
○ Manually inspect target health with a web browser.
● However, push model can be implemented with Pushgateway:
○ Intermediary service which allows to push metrics from jobs that cannot be scraped;
○ Disadvantages:
■ Pushgateway becomes both a single point of failure (SPOF) and a potential
bottleneck;
■ Lose Prometheus's automatic instance health monitoring via the UP metric.

Prometheus Server - Installation
● Using pre-compiled binaries;
● From source (Makefile);
● Using Docker (Quay.io or Docker Hub)
● Using configuration management systems:
○ Ansible
○ Chef
○ Puppet
Docker command:
docker run -p 9090:9090 -v /prometheus-data
prom/prometheus --config.file=prometheus.yml
Dockerfile:
FROM prom/prometheus
ADD prometheus.yml /etc/prometheus/
Build:
docker build -t my-prometheus .
docker run -p 9090:9090 my-prometheus

Prometheus Server - Configuration
● Prometheus is configured via command-line flags and
a configuration file: prometheus.yml;
● Prometheus default port is 9090;
● Prometheus can reload its configuration at runtime:
○ SIGHUP to the Prometheus process;
○ HTTP POST request to the “/-/reload” endpoint (when
the --web.enable-lifecycle flag is enabled).
● Recording rules and Alerting rules should be written in
a YAML file (rule_files);
● It’s possible to use service discovery mechanism to
automatically update scrape target list;
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 15s
external_labels:
environment: tst
rule_files:
- /opt/prometheus/rules/*.rules
scrape_configs:
- job_name: node
scrape_interval: 30s
metrics_path: /metrics
scheme: http
ec2_sd_configs:
- endpoint: ""
region: eu-west-3
refresh_interval: 1m
port: 9100
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
scheme: http
timeout: 10s
api_version: v1

Prometheus Server
Conﬁguration
Consult Prometheus
Conﬁguration via Web browser

Prometheus Server
Targets and Rules
Targets and Rules deﬁned on
Prometheus conﬁguration

Getting data into Prometheus
● Data can be export to Prometheus using Pushgateway or via Exporter;
● Prometheus have multiple Exporters, of which stand out:
○ Node/system metrics exporter;
○ MySQL server exporter;
○ Memcached exporter;
○ Kafka exporter;
○ HAProxy exporter;
○ Tomcat exporter;
○ AWS CloudWatch exporter;
● Each Exporter has a default port allocated and a route (‘/metrics’’);
● Besides ofﬁcial and community Exporters, it’s possible to write custom Exporters;
● Most exporters can be run as a service or using Docker;

● Pushgateway isn’t an event store but an intermediary service to push metrics from jobs that can’t be
scraped;
● Deploy using binary or Docker image:
○ Example: docker run -d -p 9091:9091 prom/pushgateway
● To change the listen address, use the “--web.listen-address” ﬂag;
● By default, Pushgateway doesn’t persist metrics. To cache them, use “--persistence.ﬁle” option;
● All pushes are done via HTTP. The interface is REST-like. Metrics are available on ‘/metrics’ route ;
● Examples:
○ echo "some_metric 3.14" | curl --data-binary @- http://localhost:9091/metrics/job/job1
○ curl -X DELETE http://localhost:9091/metrics/job/job1
Getting data into Prometheus - Pushgateway

Getting data into Prometheus - Node Exporter
● Expose metrics of Hardware and Operating System based on collectors, usually running on port 9100;
● By default, there is a specific set of collectors for each operating system: cpu, diskstats, filesystem,
netstat, nfs, textfile, …
● Others can be enabled with collector option: ntp, processes, systemd, …
○ Example: --collector.processes
● Exporter can be run using third-party repository for RHEL/CentOS/Fedora (Corp), using the source
code or using Docker;
○ Example:
docker run -d --net=”host” --pid=”host” -v “/:/host:ro” prom/node-exporter --path.rootfs=/host --collector.ntp
--collector.processes --collector.textfile.directory /var/lib/node_exporter/textfile_collector --cap-add=SYS_TIME

Implement custom metrics on Node Exporter
● Custom metrics can be implemented with textfile collector;
● Textfile collector is similar to Pushgateway, in that it allows exporting of statistics from batch jobs;
● Pushgateway should be used for service-level metrics, while textfile should be used for machine metrics;
● The collector will parse all files in textfile directory matching the glob *.prom using the text format;
● Example to automatically push logged in users on machine:
echo node_users_logged_in $(who /host/var/run/utmp | wc -l) > /var/lib/node_exporter/textfile_collector/users.prom.$$
mv /var/lib/node_exporter/textfile_collector/users.prom.$$ /var/lib/node_exporter/textfile_collector/users.prom

Getting data into
Prometheus - Node
Exporter
Example of metrics available at
endpoint “/metrics”

Prometheus query
Node Exporter Data
Example of a query in
PromQL to check CPU usage

How to integrate Icinga with Prometheus?
● Icinga can be integrated with Prometheus via Nagitheus (Claranet);
● Nagitheus is a Nagios plugin for querying Prometheus, written in Go;
● Nagitheus process vector or scalar results and return an exit code, according with warning/critical
values and comparison method (ge, gt, le, lt);
● Allows basic authentication on Prometheus with username and password (-u and -pw options);
● Example:
/usr/lib64/nagios/plugins/nagitheus -H http://localhost:9090 -i 10.0.0.10 -p 9100 -l 'Check CPU' -d yes -q
'(avg by (mode) (irate(node_cpu_seconds_total{instance="", mode!="idle"}[5m])) * 100)' -m ge -w 70 -c 80

● Basic Linux Checks:
○ CPU
○ Disk
○ Load
○ Memory
○ Total procs
○ Ntp
● Steps to implement a check:
○ Identify metrics to use on query;
○ Create query considering PromQL operators and metric types;
○ Test query on Promehteus and deﬁne label, method and critical/warning values;
○ Implement icinga command using the query and values speciﬁed for each option.

● Check Disk (Percentage of disk free):
1. Metrics to use on check:
○ node_filesystem_free_bytes
○ node_filesystem_size_bytes
2. Data types: Instant vectores;
3. PromQL operators:
○ / (division)
○ * (multiplication)
4. Query:
(node_filesystem_free_bytes / node_filesystem_size_bytes{fstype!~"none|tmpfs|sysfs", mountpoint=~"/var/.*", instance=”172.27.68.163:9030”} )* 100
5. Analyze query result and define method, critical and warning value
# HELP node_filesystem_free_bytes Filesystem free space in bytes.
# TYPE node_filesystem_free_bytes gauge
node_filesystem_free_bytes{device="shm",fstype="ext4",mountpoint="/var"} 1.01808578e+10
…
# HELP node_filesystem_size_bytes Filesystem size in bytes.
# TYPE node_filesystem_size_bytes gauge
node_filesystem_size_bytes{device="shm",fstype="ext4",mountpoint="/var"} 1.05017712e+10
…

● Query Result
Method: le Warning: 20% Critical:10%
● Icinga Command:
/usr/lib64/nagios/plugins/nagitheus -H http://localhost:9090 -i 172.27.68.163 -p 9030 -l 'Check disk' -d yes -m le -w 20 -c 10 -q
'(node_ﬁlesystem_free_bytes / node_ﬁlesystem_size_bytes{fstype!~"none|tmpfs|sysfs", mountpoint=~"/var/.*"} )* 100'
Warning
(20) Critical
(10)

DevOps Braga #15: Agentless monitoring with icinga and prometheus

DevOps Braga #15: Agentless monitoring with icinga and prometheus

More Related Content

Similar to DevOps Braga #15: Agentless monitoring with icinga and prometheus (20)

More from DevOps Braga (8)

Recently uploaded (20)

DevOps Braga #15: Agentless monitoring with icinga and prometheus