OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad Hoffmann

Hardware-level data-center
monitoring with Prometheus
Conrad Hoffmann

Outline
I. Our data-center
II. Brief intro to Prometheus
III. All my exporters
IV. TL;DR & Soon™

Outline
I. Our data-center
IV. TL;DR & Soon™
AMS5

2118 servers
56 racks
200 network devices

2118 servers
56 racks
200 network devices
2 * 2 generic uplinks
3 AWS Direct Connect
3 Google X-Connect

Where we started...
& NRPE
Cloud Watch
Cacti

What’s paging you at night?
Collection Visualization Alerting
Cacti ✔ ✔ ✔
CloudWatch ✔ ✔ ✔
Ganglia ✔
Graphite ✔ ✔
Icinga/Nagios ✔ ✔ ✔
Smokeping ✔ ✔ ✔
Statsd ✔

https://xkcd.com/927/

Outline
I. Our data-center
IV. TL;DR & Soon™
prometheus.io

The Promise of Prometheus
Prometheus is a reliable, scalable, flexible monitoring and
alerting system that is easy to integrate and focused on real
time metrics.

Prometheus: reliability
● Pull-based (“scrape”)
● List of known targets
○ Can be dynamic, e.g. DNS or service discovery
● Built-in meta-monitoring
● Redundancy is easy

Prometheus: scalability
● Performant, efficient storage
● Scales well to available resources
● Easy to scale horizontally
● Federation

Prometheus: flexibility
● Multi-dimensional, label-based data model
● Each data point is defined by
○ A metric name
○ An arbitrary number of key-value pairs (labels)
○ A value
○ A timestamp (added by Prometheus)
● Data points with identical metric names and labels form a time series
● Powerful query language allows for easy aggregation based on labels

Prometheus: flexibility
Target exposes:
http_responses_total{backend="foo",code="2xx"} 804
http_responses_total{backend="foo",code="4xx"} 3170
http_responses_total{backend="bar",code="2xx"} 6637
http_responses_total{backend="bar",code="4xx"} 26
Possible query:
sum(http_responses_total{backend="foo"})

Prometheus: ease of integration
● Data format is text based
● Scrapes are HTTP requests
● Many integrations exist already
● Excellent tooling/libraries to write new ones

Application

Host node
exporter

Host SNMP
exporter
Router B
Router A
Network

Nomen est omen...
● Alerting
● Silencing
● Alert grouping & routing
● High availability
Alertmanager

Displays data from many sources:
● Prometheus
● Graphite
● Influx
● OpenTSDB
● Elasticsearch
● MySQL/Postgres
● CloudWatch
● ...
Grafana
grafana.com

Outline
I. Our data-center
IV. TL;DR & Soon™
Now withProtips!

Node exporter
● Exports: OS- and hardware-level metrics for running systems
● Replaces: Ganglia, some Icinga/NRPE checks
● Noteworthy:
○ Comes with many collectors built-in
○ Use WMI exporter on Windows

Protip I
Use the node exporter’s text file collector as an easy integration point for
custom metrics!
Examples: Chef data, RAID controller data, SMART data, cron jobs, ...
node
exporter
script
Text
file
Host

Blackbox exporter
● Exports: data about probes against endpoints that don’t support
Prometheus natively (DNS, HTTP(S), ICMP, TCP)
● Replaces: Smokeping, some Icinga checks
● Noteworthy:
○ Monitor TLS certificate expiry :)

Blackbox exporter - Smokeping replacement
1. Send ICMP probe every five seconds

2. Alert on target down and packet loss
ALERT SmokepingTargetDown
IF probe_success{job="smokeping"} == 0
FOR 2m
ALERT SmokepingTargetPacketLoss
IF 100*(1-avg_over_time(probe_success{job="smokeping"}[2m]))> 20

3. Use Prometheus aggregation functions in Grafana

Protip II
Scrape more, scrape faster!
● ~ 1M metrics
● > 5000 targets
● Mostly 10s scrape interval, some 5s, some longer
● 50 days retention time
● 250 GB storage ¯_(ツ)_/¯

SNMP exporter
● Exports: SNMP data from network devices
● Replaces: Cacti
● Noteworthy:
○ a pain to configure

SNMP exporter - Cacti replacement
Once you have got the right SNMP config, alerts and nice graphs are easy!

Cacti’s killer feature: the weathermap plugin!
https://network-weathermap.com/

There is a diagram panel type in Grafana, but…
… we’re not quite there yet ¯_(ツ)_/¯

Protip III
Build a dedicated long-term Prometheus server:
● Scrape only a few selected metrics
● Yank retention time way up
● Make backups (hot backups possible in Prometheus >2.1)
Very useful data for estimating e.g. future bandwidth needs!

Collins exporter - Collins?
● https://tumblr.github.io/collins
● Infrastructure management / IPAM
● Server inventory, classification and lifecycle management

Collins exporter
● Exports: asset inventory data from Collins
● Replaces: a bunch of scripts
● Noteworthy:
○ https://github.com/soundcloud/collins_exporter

Collins exporter
● Another candidate for long-term storage
● Valuable data for capacity planning

Protip IV
Build your own integrations!
Collins exporter:
● Written in Go
● 1 source file
● 264 lines total ¯_(ツ)_/¯

IPMI exporter
● Exports: IPMI data retrieved from BMCs
● Replaces: many Nagios/NRPE checks
● Noteworthy:
○ https://github.com/soundcloud/ipmi_exporter
○ Works regardless of hosts power state

IPMI exporter
● Mostly sensor data: temperature, fans, power consumption
● Mostly used for alerting:
○ Fans
○ Power supplies
○ Batteries

Protip V
Make use of techniques to ingest non-numeric data!*
● Use labels to expose (semi-)static data of interest
*...but do it with some caution!
ipmi_bmc_info{firmware_revision="2.52",manufacturer_id="Dell_Inc"} 1

Protip V
Make use of techniques to ingest non-numeric data!*
● Use labels and binary values to represent state
*...but do it with some caution!
collins_asset_state{tag="ABCD1234",state="Allocated"}
collins_asset_state{tag="ABCD1234",state="Maintenance"}
collins_asset_state{tag="ABCD1234",state="Unallocated"}
1
0
0

And now: merging data sources
Example: BMC Firmware revisions of certain server types

Query: ipmi_bmc_info{firmware_revision!="2.52"}
Result: ipmi_bmc_info{firmware_revision="2.41",instance="10.1.2.3",...}

Query: collins_asset_details{nodeclass="app-2"}
Result: collins_asset_details{ipmi_address="10.1.2.3",...}

Query: collins_asset_details{nodeclass="app-2"}
Result: collins_asset_details{ipmi_address="10.1.2.3",...}
Query: label_replace(ipmi_bmc_info, "ipmi_address", "$1", "instance", "(.*)")
Result: ipmi_bmc_info{firmware_revision="2.41",ipmi_address="10.1.2.3",...}

Query: collins_asset_details{nodeclass="app-2"} *
on (ipmi_address)
group_left(firmware_revision)
label_replace(ipmi_bmc_info{firmware_revision!="2.52"}, "ipmi_address", "$1", "instance", "(.*)")
Result: {firmware_revision="2.41",ipmi_address="10.1.2.3",
nodeclass="app-2",primary_address="10.10.20.30",tag="ABCD1234"}

Query: collins_asset_details{nodeclass="app-2"} *
on (ipmi_address)
group_left(firmware_revision)
label_replace(ipmi_bmc_info{firmware_revision!="2.52"}, "ipmi_address", "$1", "instance", "(.*)")
* on (tag) group_left(status) (collins_asset_status == 1)
Result: {firmware_revision="2.41",ipmi_address="10.1.2.3",
nodeclass="app-2",primary_address="10.10.20.30",tag="ABCD1234",status="Allocated"}

Where we are now...
& NRPE
Cloud Watch
Cacti
✘
✘ ✘ ✘
✘ ✘
✘

Collection Visualization Alerting
CloudWatch ✔ ✔ ✔
Graphite (✔)
Prometheus ✔
Grafana ✔
Alertmanager ✔
What’s paging you at night?

What’s up with this CloudWatch thing?
● There is a CloudWatch exporter
● However, CloudWatch internal architecture is fundamentally
incompatible with Prometheus
● Using CloudWatch as Grafana data source can incur costs

Why was it worth it?
● Many integrations readily available
● New ones are easy to write
● Quality and quantity of monitoring has
increased
● Monitoring and alerting has become much
more consistent
● Easy to merge data sources for alerting or
graphing
This is true across the entire organization, not just infrastructure!

Soon: long term storage
● Not a primary concern for Prometheus
● Simple solution as explained
● Remote (read/)write interface
● Some features in Prometheus 2.0 to allow external solutions
○ Check out e.g. Thanos: https://github.com/improbable-eng/thanos

Soon: forging a standard?
OpenMetrics working group
● https://github.com/RichiH/OpenMetrics

OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad Hoffmann

More Related Content

What's hot (20)

Similar to OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad Hoffmann (20)

Recently uploaded (20)

OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad Hoffmann