SlideShare a Scribd company logo
Hardware-level data-center
monitoring with Prometheus
Conrad Hoffmann
Outline
I. Our data-center
II. Brief intro to Prometheus
III. All my exporters
IV. TL;DR & Soon™
Outline
I. Our data-center
II. Brief intro to Prometheus
III. All my exporters
IV. TL;DR & Soon™
AMS5
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad Hoffmann
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad Hoffmann
2118 servers
56 racks
2118 servers
56 racks
200 network devices
2118 servers
56 racks
200 network devices
2 * 2 generic uplinks
3 AWS Direct Connect
3 Google X-Connect
Where we started...
& NRPE
Cloud Watch
Cacti
What’s paging you at night?
Collection Visualization Alerting
Cacti ✔ ✔ ✔
CloudWatch ✔ ✔ ✔
Ganglia ✔
Graphite ✔ ✔
Icinga/Nagios ✔ ✔ ✔
Smokeping ✔ ✔ ✔
Statsd ✔
https://xkcd.com/927/
Outline
I. Our data-center
II. Brief intro to Prometheus
III. All my exporters
IV. TL;DR & Soon™
prometheus.io
The Promise of Prometheus
Prometheus is a reliable, scalable, flexible monitoring and
alerting system that is easy to integrate and focused on real
time metrics.
Prometheus: reliability
● Pull-based (“scrape”)
● List of known targets
○ Can be dynamic, e.g. DNS or service discovery
● Built-in meta-monitoring
● Redundancy is easy
Prometheus: scalability
● Performant, efficient storage
● Scales well to available resources
● Easy to scale horizontally
● Federation
Prometheus: flexibility
● Multi-dimensional, label-based data model
● Each data point is defined by
○ A metric name
○ An arbitrary number of key-value pairs (labels)
○ A value
○ A timestamp (added by Prometheus)
● Data points with identical metric names and labels form a time series
● Powerful query language allows for easy aggregation based on labels
Prometheus: flexibility
Target exposes:
http_responses_total{backend="foo",code="2xx"} 804
http_responses_total{backend="foo",code="4xx"} 3170
http_responses_total{backend="bar",code="2xx"} 6637
http_responses_total{backend="bar",code="4xx"} 26
Possible query:
sum(http_responses_total{backend="foo"})
Prometheus: ease of integration
● Data format is text based
● Scrapes are HTTP requests
● Many integrations exist already
● Excellent tooling/libraries to write new ones
Application
Prometheus: ease of integration
Host node
exporter
Prometheus: ease of integration
Host SNMP
exporter
Router B
Router A
Prometheus: ease of integration
Network
Host SNMP
exporter
Router B
Router A
Prometheus: ease of integration
Network
Nomen est omen...
● Alerting
● Silencing
● Alert grouping & routing
● High availability
Alertmanager
Displays data from many sources:
● Prometheus
● Graphite
● Influx
● OpenTSDB
● Elasticsearch
● MySQL/Postgres
● CloudWatch
● ...
Grafana
grafana.com
Outline
I. Our data-center
II. Brief intro to Prometheus
III. All my exporters
IV. TL;DR & Soon™
Now withProtips!
Node exporter
● Exports: OS- and hardware-level metrics for running systems
● Replaces: Ganglia, some Icinga/NRPE checks
● Noteworthy:
○ Comes with many collectors built-in
○ Use WMI exporter on Windows
Protip I
Use the node exporter’s text file collector as an easy integration point for
custom metrics!
Examples: Chef data, RAID controller data, SMART data, cron jobs, ...
node
exporter
script
Text
file
Host
Blackbox exporter
● Exports: data about probes against endpoints that don’t support
Prometheus natively (DNS, HTTP(S), ICMP, TCP)
● Replaces: Smokeping, some Icinga checks
● Noteworthy:
○ Monitor TLS certificate expiry :)
Blackbox exporter - Smokeping replacement
1. Send ICMP probe every five seconds
Blackbox exporter - Smokeping replacement
2. Alert on target down and packet loss
ALERT SmokepingTargetDown
IF probe_success{job="smokeping"} == 0
FOR 2m
ALERT SmokepingTargetPacketLoss
IF 100*(1-avg_over_time(probe_success{job="smokeping"}[2m]))> 20
Blackbox exporter - Smokeping replacement
3. Use Prometheus aggregation functions in Grafana
Blackbox exporter - Smokeping replacement
Protip II
Scrape more, scrape faster!
● ~ 1M metrics
● > 5000 targets
● Mostly 10s scrape interval, some 5s, some longer
● 50 days retention time
● 250 GB storage ¯_(ツ)_/¯
SNMP exporter
● Exports: SNMP data from network devices
● Replaces: Cacti
● Noteworthy:
○ a pain to configure
SNMP exporter - Cacti replacement
Once you have got the right SNMP config, alerts and nice graphs are easy!
SNMP exporter - Cacti replacement
Cacti’s killer feature: the weathermap plugin!
https://network-weathermap.com/
SNMP exporter - Cacti replacement
There is a diagram panel type in Grafana, but…
… we’re not quite there yet ¯_(ツ)_/¯
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad Hoffmann
Protip III
Build a dedicated long-term Prometheus server:
● Scrape only a few selected metrics
● Yank retention time way up
● Make backups (hot backups possible in Prometheus >2.1)
Very useful data for estimating e.g. future bandwidth needs!
Collins exporter - Collins?
● https://tumblr.github.io/collins
● Infrastructure management / IPAM
● Server inventory, classification and lifecycle management
Collins exporter
● Exports: asset inventory data from Collins
● Replaces: a bunch of scripts
● Noteworthy:
○ https://github.com/soundcloud/collins_exporter
Collins exporter
Collins exporter
● Another candidate for long-term storage
● Valuable data for capacity planning
Protip IV
Build your own integrations!
Collins exporter:
● Written in Go
● 1 source file
● 264 lines total ¯_(ツ)_/¯
IPMI exporter
● Exports: IPMI data retrieved from BMCs
● Replaces: many Nagios/NRPE checks
● Noteworthy:
○ https://github.com/soundcloud/ipmi_exporter
○ Works regardless of hosts power state
IPMI exporter
● Mostly sensor data: temperature, fans, power consumption
● Mostly used for alerting:
○ Fans
○ Power supplies
○ Batteries
Protip V
Make use of techniques to ingest non-numeric data!*
● Use labels to expose (semi-)static data of interest
*...but do it with some caution!
ipmi_bmc_info{firmware_revision="2.52",manufacturer_id="Dell_Inc"} 1
Protip V
Make use of techniques to ingest non-numeric data!*
● Use labels and binary values to represent state
*...but do it with some caution!
collins_asset_state{tag="ABCD1234",state="Allocated"}
collins_asset_state{tag="ABCD1234",state="Maintenance"}
collins_asset_state{tag="ABCD1234",state="Unallocated"}
1
0
0
And now: merging data sources
Example: BMC Firmware revisions of certain server types
And now: merging data sources
Query: ipmi_bmc_info{firmware_revision!="2.52"}
Result: ipmi_bmc_info{firmware_revision="2.41",instance="10.1.2.3",...}
And now: merging data sources
Query: ipmi_bmc_info{firmware_revision!="2.52"}
Result: ipmi_bmc_info{firmware_revision="2.41",instance="10.1.2.3",...}
Query: collins_asset_details{nodeclass="app-2"}
Result: collins_asset_details{ipmi_address="10.1.2.3",...}
And now: merging data sources
Query: ipmi_bmc_info{firmware_revision!="2.52"}
Result: ipmi_bmc_info{firmware_revision="2.41",instance="10.1.2.3",...}
Query: collins_asset_details{nodeclass="app-2"}
Result: collins_asset_details{ipmi_address="10.1.2.3",...}
Query: label_replace(ipmi_bmc_info, "ipmi_address", "$1", "instance", "(.*)")
Result: ipmi_bmc_info{firmware_revision="2.41",ipmi_address="10.1.2.3",...}
And now: merging data sources
Query: collins_asset_details{nodeclass="app-2"} *
on (ipmi_address)
group_left(firmware_revision)
label_replace(ipmi_bmc_info{firmware_revision!="2.52"}, "ipmi_address", "$1", "instance", "(.*)")
Result: {firmware_revision="2.41",ipmi_address="10.1.2.3",
nodeclass="app-2",primary_address="10.10.20.30",tag="ABCD1234"}
And now: merging data sources
Query: collins_asset_details{nodeclass="app-2"} *
on (ipmi_address)
group_left(firmware_revision)
label_replace(ipmi_bmc_info{firmware_revision!="2.52"}, "ipmi_address", "$1", "instance", "(.*)")
* on (tag) group_left(status) (collins_asset_status == 1)
Result: {firmware_revision="2.41",ipmi_address="10.1.2.3",
nodeclass="app-2",primary_address="10.10.20.30",tag="ABCD1234",status="Allocated"}
Where we are now...
& NRPE
Cloud Watch
Cacti
✘
✘ ✘ ✘
✘ ✘
✘
Collection Visualization Alerting
CloudWatch ✔ ✔ ✔
Graphite (✔)
Prometheus ✔
Grafana ✔
Alertmanager ✔
What’s paging you at night?
What’s up with this CloudWatch thing?
● There is a CloudWatch exporter
● However, CloudWatch internal architecture is fundamentally
incompatible with Prometheus
● Using CloudWatch as Grafana data source can incur costs
Outline
I. Our data-center
II. Brief intro to Prometheus
III. All my exporters
IV. TL;DR & Soon™
So, is it working?
● Yes
Was it worth it?
● Yes
Why was it worth it?
● Many integrations readily available
● New ones are easy to write
● Quality and quantity of monitoring has
increased
● Monitoring and alerting has become much
more consistent
● Easy to merge data sources for alerting or
graphing
This is true across the entire organization, not just infrastructure!
Soon: long term storage
● Not a primary concern for Prometheus
● Simple solution as explained
● Remote (read/)write interface
● Some features in Prometheus 2.0 to allow external solutions
○ Check out e.g. Thanos: https://github.com/improbable-eng/thanos
Soon: forging a standard?
OpenMetrics working group
● https://github.com/RichiH/OpenMetrics
This is the end...
Thank you!

More Related Content

PDF
OSDC 2018 | Monitoring Kubernetes at Scale by Monica Sarbu
PDF
OSDC 2018 | From batch to pipelines – why Apache Mesos and DC/OS are a soluti...
PDF
Elk for applications on k8s
PDF
Cloud Native User Group: Shift-Left Testing IaC With PaC
PDF
Testing kubernetes and_open_shift_at_scale_20170209
PDF
OSDC 2018 | Three years running containers with Kubernetes in Production by T...
PDF
OSDC 2018 | Ops hates containers. Why? by Martin Alfke
PDF
OSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius Schumacher
OSDC 2018 | Monitoring Kubernetes at Scale by Monica Sarbu
OSDC 2018 | From batch to pipelines – why Apache Mesos and DC/OS are a soluti...
Elk for applications on k8s
Cloud Native User Group: Shift-Left Testing IaC With PaC
Testing kubernetes and_open_shift_at_scale_20170209
OSDC 2018 | Three years running containers with Kubernetes in Production by T...
OSDC 2018 | Ops hates containers. Why? by Martin Alfke
OSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius Schumacher

What's hot (20)

PDF
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
PDF
CNTUG x SDN Meetup #33 Talk 1: 從 Cilium 認識 cgroup ebpf - Ruian
PDF
How to Prepare for CKA Exam
PDF
Red Hat Forum Benelux 2015
PDF
From Code to Kubernetes
PDF
Deploy Prometheus - Grafana and EFK stack on Kubic k8s Clusters
PDF
Cloud Native User Group: Prometheus Day 2
PDF
Challenges in a Microservices Age: Monitoring, Logging and Tracing on Red Hat...
PDF
KubeCon EU 2016: Heroku to Kubernetes
PDF
Linuxcon secureefficientcontainerimagemanagementharbor
PDF
OSDC 2018 | Lifecycle of a resource. Codifying infrastructure with Terraform ...
PDF
Neutron high availability open stack architecture openstack israel event 2015
PDF
A Kong retrospective: from 0.10 to 0.13
PDF
KubeCon EU 2016 Keynote: Pushing Kubernetes Forward
PDF
ОЛЕКСАНДР ЛИПКО «Graceful Shutdown Node.js + k8s» Online WDDay 2021
PDF
Introduction of eBPF - 時下最夯的Linux Technology
PDF
Kubernetes Monitoring & Best Practices
PPTX
Enabling Production Grade Containerized Applications through Policy Based Inf...
PDF
Serverless, Tekton, and Argo CD: How to craft modern CI/CD workflows | DevNat...
PDF
Monitoring Kubernetes with Prometheus
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
CNTUG x SDN Meetup #33 Talk 1: 從 Cilium 認識 cgroup ebpf - Ruian
How to Prepare for CKA Exam
Red Hat Forum Benelux 2015
From Code to Kubernetes
Deploy Prometheus - Grafana and EFK stack on Kubic k8s Clusters
Cloud Native User Group: Prometheus Day 2
Challenges in a Microservices Age: Monitoring, Logging and Tracing on Red Hat...
KubeCon EU 2016: Heroku to Kubernetes
Linuxcon secureefficientcontainerimagemanagementharbor
OSDC 2018 | Lifecycle of a resource. Codifying infrastructure with Terraform ...
Neutron high availability open stack architecture openstack israel event 2015
A Kong retrospective: from 0.10 to 0.13
KubeCon EU 2016 Keynote: Pushing Kubernetes Forward
ОЛЕКСАНДР ЛИПКО «Graceful Shutdown Node.js + k8s» Online WDDay 2021
Introduction of eBPF - 時下最夯的Linux Technology
Kubernetes Monitoring & Best Practices
Enabling Production Grade Containerized Applications through Policy Based Inf...
Serverless, Tekton, and Argo CD: How to craft modern CI/CD workflows | DevNat...
Monitoring Kubernetes with Prometheus
Ad

Similar to OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad Hoffmann (20)

PDF
System monitoring
PPTX
Monitoring_with_Prometheus_Grafana_Tutorial
PDF
Infrastructure & System Monitoring using Prometheus
PDF
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
PDF
DevOps Braga #15: Agentless monitoring with icinga and prometheus
PPTX
Prometheus Training
PDF
Regain Control Thanks To Prometheus
PPTX
Prometheus and Grafana
PDF
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
PDF
Monitoring Cloud Native Applications with Prometheus
PPTX
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
PDF
Prometheus - basics
PDF
The hitchhiker’s guide to Prometheus
PDF
The hitchhiker’s guide to Prometheus
PDF
Prometheus monitoring
PDF
Monitoring at scale: Migrating to Prometheus at Fastly
PDF
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
PDF
Prometheus Course from beginners to expert course
PPTX
Herding cats & catching fire: Workday's telemetry & middleware
PDF
OSMC 2017 | Monitoring MySQL with Prometheus and Grafana by Julien Pivotto
System monitoring
Monitoring_with_Prometheus_Grafana_Tutorial
Infrastructure & System Monitoring using Prometheus
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
DevOps Braga #15: Agentless monitoring with icinga and prometheus
Prometheus Training
Regain Control Thanks To Prometheus
Prometheus and Grafana
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
Monitoring Cloud Native Applications with Prometheus
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - basics
The hitchhiker’s guide to Prometheus
The hitchhiker’s guide to Prometheus
Prometheus monitoring
Monitoring at scale: Migrating to Prometheus at Fastly
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
Prometheus Course from beginners to expert course
Herding cats & catching fire: Workday's telemetry & middleware
OSMC 2017 | Monitoring MySQL with Prometheus and Grafana by Julien Pivotto
Ad

Recently uploaded (20)

PDF
Understanding Forklifts - TECH EHS Solution
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Nekopoi APK 2025 free lastest update
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPT
Introduction Database Management System for Course Database
PDF
System and Network Administration Chapter 2
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
medical staffing services at VALiNTRY
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Online Work Permit System for Fast Permit Processing
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
Understanding Forklifts - TECH EHS Solution
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Operating system designcfffgfgggggggvggggggggg
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Nekopoi APK 2025 free lastest update
Odoo Companies in India – Driving Business Transformation.pdf
Introduction Database Management System for Course Database
System and Network Administration Chapter 2
ManageIQ - Sprint 268 Review - Slide Deck
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PTS Company Brochure 2025 (1).pdf.......
medical staffing services at VALiNTRY
Design an Analysis of Algorithms II-SECS-1021-03
Online Work Permit System for Fast Permit Processing
How to Choose the Right IT Partner for Your Business in Malaysia
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Wondershare Filmora 15 Crack With Activation Key [2025

OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad Hoffmann

  • 1. Hardware-level data-center monitoring with Prometheus Conrad Hoffmann
  • 2. Outline I. Our data-center II. Brief intro to Prometheus III. All my exporters IV. TL;DR & Soon™
  • 3. Outline I. Our data-center II. Brief intro to Prometheus III. All my exporters IV. TL;DR & Soon™ AMS5
  • 7. 2118 servers 56 racks 200 network devices
  • 8. 2118 servers 56 racks 200 network devices 2 * 2 generic uplinks 3 AWS Direct Connect 3 Google X-Connect
  • 9. Where we started... & NRPE Cloud Watch Cacti
  • 10. What’s paging you at night? Collection Visualization Alerting Cacti ✔ ✔ ✔ CloudWatch ✔ ✔ ✔ Ganglia ✔ Graphite ✔ ✔ Icinga/Nagios ✔ ✔ ✔ Smokeping ✔ ✔ ✔ Statsd ✔
  • 12. Outline I. Our data-center II. Brief intro to Prometheus III. All my exporters IV. TL;DR & Soon™ prometheus.io
  • 13. The Promise of Prometheus Prometheus is a reliable, scalable, flexible monitoring and alerting system that is easy to integrate and focused on real time metrics.
  • 14. Prometheus: reliability ● Pull-based (“scrape”) ● List of known targets ○ Can be dynamic, e.g. DNS or service discovery ● Built-in meta-monitoring ● Redundancy is easy
  • 15. Prometheus: scalability ● Performant, efficient storage ● Scales well to available resources ● Easy to scale horizontally ● Federation
  • 16. Prometheus: flexibility ● Multi-dimensional, label-based data model ● Each data point is defined by ○ A metric name ○ An arbitrary number of key-value pairs (labels) ○ A value ○ A timestamp (added by Prometheus) ● Data points with identical metric names and labels form a time series ● Powerful query language allows for easy aggregation based on labels
  • 17. Prometheus: flexibility Target exposes: http_responses_total{backend="foo",code="2xx"} 804 http_responses_total{backend="foo",code="4xx"} 3170 http_responses_total{backend="bar",code="2xx"} 6637 http_responses_total{backend="bar",code="4xx"} 26 Possible query: sum(http_responses_total{backend="foo"})
  • 18. Prometheus: ease of integration ● Data format is text based ● Scrapes are HTTP requests ● Many integrations exist already ● Excellent tooling/libraries to write new ones
  • 21. Host SNMP exporter Router B Router A Prometheus: ease of integration Network
  • 22. Host SNMP exporter Router B Router A Prometheus: ease of integration Network
  • 23. Nomen est omen... ● Alerting ● Silencing ● Alert grouping & routing ● High availability Alertmanager
  • 24. Displays data from many sources: ● Prometheus ● Graphite ● Influx ● OpenTSDB ● Elasticsearch ● MySQL/Postgres ● CloudWatch ● ... Grafana grafana.com
  • 25. Outline I. Our data-center II. Brief intro to Prometheus III. All my exporters IV. TL;DR & Soon™ Now withProtips!
  • 26. Node exporter ● Exports: OS- and hardware-level metrics for running systems ● Replaces: Ganglia, some Icinga/NRPE checks ● Noteworthy: ○ Comes with many collectors built-in ○ Use WMI exporter on Windows
  • 27. Protip I Use the node exporter’s text file collector as an easy integration point for custom metrics! Examples: Chef data, RAID controller data, SMART data, cron jobs, ... node exporter script Text file Host
  • 28. Blackbox exporter ● Exports: data about probes against endpoints that don’t support Prometheus natively (DNS, HTTP(S), ICMP, TCP) ● Replaces: Smokeping, some Icinga checks ● Noteworthy: ○ Monitor TLS certificate expiry :)
  • 29. Blackbox exporter - Smokeping replacement 1. Send ICMP probe every five seconds
  • 30. Blackbox exporter - Smokeping replacement 2. Alert on target down and packet loss ALERT SmokepingTargetDown IF probe_success{job="smokeping"} == 0 FOR 2m ALERT SmokepingTargetPacketLoss IF 100*(1-avg_over_time(probe_success{job="smokeping"}[2m]))> 20
  • 31. Blackbox exporter - Smokeping replacement 3. Use Prometheus aggregation functions in Grafana
  • 32. Blackbox exporter - Smokeping replacement
  • 33. Protip II Scrape more, scrape faster! ● ~ 1M metrics ● > 5000 targets ● Mostly 10s scrape interval, some 5s, some longer ● 50 days retention time ● 250 GB storage ¯_(ツ)_/¯
  • 34. SNMP exporter ● Exports: SNMP data from network devices ● Replaces: Cacti ● Noteworthy: ○ a pain to configure
  • 35. SNMP exporter - Cacti replacement Once you have got the right SNMP config, alerts and nice graphs are easy!
  • 36. SNMP exporter - Cacti replacement Cacti’s killer feature: the weathermap plugin! https://network-weathermap.com/
  • 37. SNMP exporter - Cacti replacement There is a diagram panel type in Grafana, but… … we’re not quite there yet ¯_(ツ)_/¯
  • 39. Protip III Build a dedicated long-term Prometheus server: ● Scrape only a few selected metrics ● Yank retention time way up ● Make backups (hot backups possible in Prometheus >2.1) Very useful data for estimating e.g. future bandwidth needs!
  • 40. Collins exporter - Collins? ● https://tumblr.github.io/collins ● Infrastructure management / IPAM ● Server inventory, classification and lifecycle management
  • 41. Collins exporter ● Exports: asset inventory data from Collins ● Replaces: a bunch of scripts ● Noteworthy: ○ https://github.com/soundcloud/collins_exporter
  • 43. Collins exporter ● Another candidate for long-term storage ● Valuable data for capacity planning
  • 44. Protip IV Build your own integrations! Collins exporter: ● Written in Go ● 1 source file ● 264 lines total ¯_(ツ)_/¯
  • 45. IPMI exporter ● Exports: IPMI data retrieved from BMCs ● Replaces: many Nagios/NRPE checks ● Noteworthy: ○ https://github.com/soundcloud/ipmi_exporter ○ Works regardless of hosts power state
  • 46. IPMI exporter ● Mostly sensor data: temperature, fans, power consumption ● Mostly used for alerting: ○ Fans ○ Power supplies ○ Batteries
  • 47. Protip V Make use of techniques to ingest non-numeric data!* ● Use labels to expose (semi-)static data of interest *...but do it with some caution! ipmi_bmc_info{firmware_revision="2.52",manufacturer_id="Dell_Inc"} 1
  • 48. Protip V Make use of techniques to ingest non-numeric data!* ● Use labels and binary values to represent state *...but do it with some caution! collins_asset_state{tag="ABCD1234",state="Allocated"} collins_asset_state{tag="ABCD1234",state="Maintenance"} collins_asset_state{tag="ABCD1234",state="Unallocated"} 1 0 0
  • 49. And now: merging data sources Example: BMC Firmware revisions of certain server types
  • 50. And now: merging data sources Query: ipmi_bmc_info{firmware_revision!="2.52"} Result: ipmi_bmc_info{firmware_revision="2.41",instance="10.1.2.3",...}
  • 51. And now: merging data sources Query: ipmi_bmc_info{firmware_revision!="2.52"} Result: ipmi_bmc_info{firmware_revision="2.41",instance="10.1.2.3",...} Query: collins_asset_details{nodeclass="app-2"} Result: collins_asset_details{ipmi_address="10.1.2.3",...}
  • 52. And now: merging data sources Query: ipmi_bmc_info{firmware_revision!="2.52"} Result: ipmi_bmc_info{firmware_revision="2.41",instance="10.1.2.3",...} Query: collins_asset_details{nodeclass="app-2"} Result: collins_asset_details{ipmi_address="10.1.2.3",...} Query: label_replace(ipmi_bmc_info, "ipmi_address", "$1", "instance", "(.*)") Result: ipmi_bmc_info{firmware_revision="2.41",ipmi_address="10.1.2.3",...}
  • 53. And now: merging data sources Query: collins_asset_details{nodeclass="app-2"} * on (ipmi_address) group_left(firmware_revision) label_replace(ipmi_bmc_info{firmware_revision!="2.52"}, "ipmi_address", "$1", "instance", "(.*)") Result: {firmware_revision="2.41",ipmi_address="10.1.2.3", nodeclass="app-2",primary_address="10.10.20.30",tag="ABCD1234"}
  • 54. And now: merging data sources Query: collins_asset_details{nodeclass="app-2"} * on (ipmi_address) group_left(firmware_revision) label_replace(ipmi_bmc_info{firmware_revision!="2.52"}, "ipmi_address", "$1", "instance", "(.*)") * on (tag) group_left(status) (collins_asset_status == 1) Result: {firmware_revision="2.41",ipmi_address="10.1.2.3", nodeclass="app-2",primary_address="10.10.20.30",tag="ABCD1234",status="Allocated"}
  • 55. Where we are now... & NRPE Cloud Watch Cacti ✘ ✘ ✘ ✘ ✘ ✘ ✘
  • 56. Collection Visualization Alerting CloudWatch ✔ ✔ ✔ Graphite (✔) Prometheus ✔ Grafana ✔ Alertmanager ✔ What’s paging you at night?
  • 57. What’s up with this CloudWatch thing? ● There is a CloudWatch exporter ● However, CloudWatch internal architecture is fundamentally incompatible with Prometheus ● Using CloudWatch as Grafana data source can incur costs
  • 58. Outline I. Our data-center II. Brief intro to Prometheus III. All my exporters IV. TL;DR & Soon™
  • 59. So, is it working? ● Yes
  • 60. Was it worth it? ● Yes
  • 61. Why was it worth it? ● Many integrations readily available ● New ones are easy to write ● Quality and quantity of monitoring has increased ● Monitoring and alerting has become much more consistent ● Easy to merge data sources for alerting or graphing This is true across the entire organization, not just infrastructure!
  • 62. Soon: long term storage ● Not a primary concern for Prometheus ● Simple solution as explained ● Remote (read/)write interface ● Some features in Prometheus 2.0 to allow external solutions ○ Check out e.g. Thanos: https://github.com/improbable-eng/thanos
  • 63. Soon: forging a standard? OpenMetrics working group ● https://github.com/RichiH/OpenMetrics
  • 64. This is the end... Thank you!