SlideShare a Scribd company logo
Fact-based Monitoring 
puppetconf 2014 
Alexis Lê-Quôc @alq
Alexis Lê-Quôc, @alq 
CTO at Datadog
Poll: Monitoring makes me… 
happy 
proud 
cry 
want to hide
Puppet brings Automation to 
Systems Management
Improve 
Monitoring 
the way Puppet has 
improved 
Systems Management
“The good old days” 
• Your “CMDB” was Excel 
• SSH in and hack away 
• Little time for anything else
Then Puppet came… 
• Expressive rules that capture expected result 
• Using facts and classifiers, a.k.a. metadata to figure out where to 
apply changes 
• That freed up a lot of our time* 
* on a per-machine basis
“Puppet brings immunity of configuration to change in 
infrastructure” 
–Me (just now)
I have seen this before…
“[SQL brings] immunity of application to change in storage 
structure and access strategy” 
–C.J. Date (1977) 
http://www.cs.berkeley.edu/~brewer/cs262/SystemR.pdf
SQL 
• 1974 IBM introduces System R and its Structured Query Language 
• Expressive rules that capture expected result 
• Using facts and predicates, a.k.a. metadata to figure out what data 
to get 
• That freed up a lot of development time
SQL 
• From a time-consuming, imperative mess (“how”) 
• … to expressive data queries (“what”) 
SQL query 
SELECT (desired facts) 
FROM (existing facts) 
WHERE (matching criteria)
Puppet 
• From a time-consuming, imperative mess (“how”) 
• … to expressive configuration queries (“what”) 
puppet apply 
CHANGE (desired facts) 
FROM (existing puppet facts) 
WHERE (matching puppet classes)
Is there a pattern?
“Break free from ever more complex naming conventions for 
hostnames as a means of identity. Use a very rich set of meta 
data provided by each machine to address them.” 
–MCollective overview
MCollective 
• From a time-consuming, imperative mess (“how”) 
• … to expressive orchestration queries (“what”) 
mco rpc service restart service=nginx 
-F webpool=A 
EXEC (desired actions) 
FROM (existing puppet facts) 
WHERE (matching puppet classes)
Back to monitoring 
• Monitoring is to behavior what Puppet is to configuration 
• Monitoring is to behavior what MCollective is to orchestration
Monitoring 
• From a time-consuming, imperative mess (“how”) 
• … to expressive monitoring queries (“what”) 
Monitoring query 
MONITOR (desired behavior) 
FROM (existing heartbeats/metrics) 
WHERE (matching puppet facts)
Examples 
• “All provisioned web servers in the production environment, 
datacenter ABC must respond to queries within 200ms” 
• “All PostgreSQL servers must have a postgres: bgwriter process 
running” 
• “At least one ActiveMQ server is up to support mcollective" 
• Never mention a hostname
Hosts are not the center of the 
monitoring universe. 
Facts are! 
Hosts are just places where facts occur.
The proof is in the pudding…
Hosts at the center of the universe 
a.k.a. the Wrong Way
“Its fairly straightforward, so hopefully you find things easy to 
understand…” 
–Nagios Core 4 manual on monitoring clusters
Host-centric: Monitor a DNS cluster 
check_command 
check_service_cluster!"DNS Cluster"!0!1! 
$SERVICESTATEID:host1:DNS Service$,$SERVICESTATEID:host2:DNS 
Service$,$SERVICESTATEID:host3:DNS Service$ 
Where do host1, host2, host3 come from?
Host-centric: can’t use facts directly 
• “Host groups solve this problem”. No, they don’t. 
• Combinatorial explosion, e.g. trivially 
• 4 data centers (us-1, us-2, eu, apac) 
• 5 classes (web, db, cache, appserver, hadoop) 
• 3 environments (test, staging, prod) 
• => up to 119 materialized host groups
Nagios-bashing? 
• No! 
• Same fatal flaw with all host-centric monitoring tools 
• Host-centric monitoring forces an extra, expensive step: 
• replicate fact-based conditionals in host-centric templates
“Please note that this module is not for the faint of heart. Even I 
(the author) have my head hurt each time I have to make 
modifications to it…” 
–puppet-nagios author
Facts at the center of the universe 
a.k.a. the Right Way 
"De Revolutionibus manuscript p9b" by Nicolas Copernicus - www.bj.uj.edu.pl. Licensed under Public domain via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:De_Revolutionibus_manuscript_p9b.jpg#mediaviewer/ 
File:De_Revolutionibus_manuscript_p9b.jpga
Earlier Examples 
• “All provisioned web servers in the production environment, 
datacenter ABC must respond to queries within 200ms” 
• “All PostgreSQL servers must have a postgres: bgwriter process 
running” 
• “At least one ActiveMQ server is up to support mcollective"
In Sensu (heartbeats) 
• “All PostgreSQL servers must have a postgres: bgwriter process 
running” 
class postgres::monitoring::sensu { 
sensu::subscription { 'postgres': } 
} 
• Monitoring using a fact-based query 
• Is node of class “postgres” and subscribed to “postgres” or not? 
• If so, it will execute the postgres check
In Datadog (metrics) 
• “All provisioned web servers in the production environment, 
datacenter ABC must respond to queries within 200ms” 
$ puppet module install datadog-datadog_agent 
class { 
‘datadog_agent’: 
api_key => …, 
tags => [$environment], 
fact_to_tags => [“datacenter”] 
} 
include datadog_agent::integrations::nginx
In Datadog (metrics) 
• Monitoring using a fact-based query 
• Puppet facts directly reused 
max(nginx.request.latency{production,datacenter:ABC}) < 200
What to take away
Fact-based monitoring 
1. Hosts are not at the center of the monitoring universe 
2. Expressive monitoring uses queries 
3. Monitoring queries should use Puppet facts
Thank you!

More Related Content

PDF
Fact-Based Monitoring - PuppetConf 2014
PDF
Monitoring Docker at Scale - Docker San Francisco Meetup - August 11, 2015
PDF
Docker Usage Patterns - Meetup Docker Paris - November, 10th 2015
PDF
Virtualization at Gilt - Rangarajan Radhakrishnan
PPTX
Monitoring Docker containers - Docker NYC Feb 2015
PPTX
Lifting the Blinds: Monitoring Windows Server 2012
PDF
Running & Monitoring Docker at Scale
PDF
The Data Mullet: From all SQL to No SQL back to Some SQL
Fact-Based Monitoring - PuppetConf 2014
Monitoring Docker at Scale - Docker San Francisco Meetup - August 11, 2015
Docker Usage Patterns - Meetup Docker Paris - November, 10th 2015
Virtualization at Gilt - Rangarajan Radhakrishnan
Monitoring Docker containers - Docker NYC Feb 2015
Lifting the Blinds: Monitoring Windows Server 2012
Running & Monitoring Docker at Scale
The Data Mullet: From all SQL to No SQL back to Some SQL

What's hot (20)

PDF
Monitoring kubernetes across data center and cloud
PDF
Events and metrics the Lifeblood of Webops
PPTX
Re:invent 2016 Container Scheduling, Execution and AWS Integration
PDF
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事
PDF
Netflix Container Runtime - Titus - for Container Camp 2016
PPTX
RENCI User Group Meeting 2017 - I Upgraded iRODS and I still have all my hair
PDF
Sanger OpenStack presentation March 2017
PDF
Native container monitoring
PDF
Introduction to Akka-Streams
PDF
QCon NYC: Distributed systems in practice, in theory
PDF
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
PDF
Sf bay area Kubernetes meetup dec8 2016 - deployment models
PDF
What's new in Kubernetes
KEY
Handling Redis failover with ZooKeeper
PDF
Python & Cassandra - Best Friends
PPTX
Arc305 how netflix leverages multiple regions to increase availability an i...
PDF
Diagnosing Problems in Production: Cassandra Summit 2014
PPTX
How Yelp does Service Discovery
PDF
Managing Stateful Services with the Operator Pattern in Kubernetes - Kubernet...
PPTX
Stabilising the jenga tower
Monitoring kubernetes across data center and cloud
Events and metrics the Lifeblood of Webops
Re:invent 2016 Container Scheduling, Execution and AWS Integration
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事
Netflix Container Runtime - Titus - for Container Camp 2016
RENCI User Group Meeting 2017 - I Upgraded iRODS and I still have all my hair
Sanger OpenStack presentation March 2017
Native container monitoring
Introduction to Akka-Streams
QCon NYC: Distributed systems in practice, in theory
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
Sf bay area Kubernetes meetup dec8 2016 - deployment models
What's new in Kubernetes
Handling Redis failover with ZooKeeper
Python & Cassandra - Best Friends
Arc305 how netflix leverages multiple regions to increase availability an i...
Diagnosing Problems in Production: Cassandra Summit 2014
How Yelp does Service Discovery
Managing Stateful Services with the Operator Pattern in Kubernetes - Kubernet...
Stabilising the jenga tower
Ad

Similar to Fact-Based Monitoring (20)

PPTX
Cassandra
PDF
CBDW2014 - MockBox, get ready to mock your socks off!
PDF
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
PDF
Kubernetes Walk Through from Technical View
PPT
Real-Time Streaming with Apache Spark Streaming and Apache Storm
PPTX
Real time analytics using Hadoop and Elasticsearch
PPTX
Devnexus 2018
KEY
DjangoCon 2010 Scaling Disqus
PDF
Introduction to Galaxy and RNA-Seq
PDF
Building a Complex, Real-Time Data Management Application
PDF
TIAD : Automating the modern datacenter
PPTX
Protect Your Payloads: Modern Keying Techniques
PPTX
Benchmarking Solr Performance at Scale
PPTX
Learn you some Ansible for great good!
PPTX
Tech4Africa 2014
PDF
Using Apache Spark and MySQL for Data Analysis
PPT
Reactive programming with examples
PPTX
Ansible: How to Get More Sleep and Require Less Coffee
PDF
Python Utilities for Managing MySQL Databases
PPTX
Real-Time Inverted Search NYC ASLUG Oct 2014
Cassandra
CBDW2014 - MockBox, get ready to mock your socks off!
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
Kubernetes Walk Through from Technical View
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real time analytics using Hadoop and Elasticsearch
Devnexus 2018
DjangoCon 2010 Scaling Disqus
Introduction to Galaxy and RNA-Seq
Building a Complex, Real-Time Data Management Application
TIAD : Automating the modern datacenter
Protect Your Payloads: Modern Keying Techniques
Benchmarking Solr Performance at Scale
Learn you some Ansible for great good!
Tech4Africa 2014
Using Apache Spark and MySQL for Data Analysis
Reactive programming with examples
Ansible: How to Get More Sleep and Require Less Coffee
Python Utilities for Managing MySQL Databases
Real-Time Inverted Search NYC ASLUG Oct 2014
Ad

More from Datadog (20)

PPTX
What it Means to be a Next-Generation Managed Service Provider
PDF
Datadog + VictorOps Webinar
PDF
Dataday Texas 2016 - Datadog
PDF
PyData NYC 2015 - Automatically Detecting Outliers with Datadog
PDF
Treating Infrastructure as Garbage
PDF
Big (IT) data
PDF
Deep dive into Nagios analytics
PDF
Just enough web ops for web developers
PDF
Customer Ops: DevOps &lt;3 customer support
PDF
I &lt;3 graphs in 20 slides
PDF
Effective monitoring with StatsD
PDF
Alerting: more signal, less noise, less pain
PDF
Fact based monitoring
PDF
Monitoring NGINX (plus): key metrics and how-to
PDF
What’s in this Cookbook? - Mike Fiedler
PDF
I Love Graphs - Alexis Lê-Quôc
PDF
Why Puppet Sucks - Rob Terhaar
PDF
Welcome to a Computing Revolution - Alex Lesser
PDF
Cosa Nostra - Tom Santero
PDF
Bulk Exporting from Cassandra - Carlo Cabanilla
What it Means to be a Next-Generation Managed Service Provider
Datadog + VictorOps Webinar
Dataday Texas 2016 - Datadog
PyData NYC 2015 - Automatically Detecting Outliers with Datadog
Treating Infrastructure as Garbage
Big (IT) data
Deep dive into Nagios analytics
Just enough web ops for web developers
Customer Ops: DevOps &lt;3 customer support
I &lt;3 graphs in 20 slides
Effective monitoring with StatsD
Alerting: more signal, less noise, less pain
Fact based monitoring
Monitoring NGINX (plus): key metrics and how-to
What’s in this Cookbook? - Mike Fiedler
I Love Graphs - Alexis Lê-Quôc
Why Puppet Sucks - Rob Terhaar
Welcome to a Computing Revolution - Alex Lesser
Cosa Nostra - Tom Santero
Bulk Exporting from Cassandra - Carlo Cabanilla

Recently uploaded (20)

PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Network Security Unit 5.pdf for BCA BBA.
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Programs and apps: productivity, graphics, security and other tools
MYSQL Presentation for SQL database connectivity
Chapter 3 Spatial Domain Image Processing.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Approach and Philosophy of On baking technology
Network Security Unit 5.pdf for BCA BBA.
The AUB Centre for AI in Media Proposal.docx
Encapsulation_ Review paper, used for researhc scholars
Understanding_Digital_Forensics_Presentation.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf

Fact-Based Monitoring

  • 1. Fact-based Monitoring puppetconf 2014 Alexis Lê-Quôc @alq
  • 2. Alexis Lê-Quôc, @alq CTO at Datadog
  • 3. Poll: Monitoring makes me… happy proud cry want to hide
  • 4. Puppet brings Automation to Systems Management
  • 5. Improve Monitoring the way Puppet has improved Systems Management
  • 6. “The good old days” • Your “CMDB” was Excel • SSH in and hack away • Little time for anything else
  • 7. Then Puppet came… • Expressive rules that capture expected result • Using facts and classifiers, a.k.a. metadata to figure out where to apply changes • That freed up a lot of our time* * on a per-machine basis
  • 8. “Puppet brings immunity of configuration to change in infrastructure” –Me (just now)
  • 9. I have seen this before…
  • 10. “[SQL brings] immunity of application to change in storage structure and access strategy” –C.J. Date (1977) http://www.cs.berkeley.edu/~brewer/cs262/SystemR.pdf
  • 11. SQL • 1974 IBM introduces System R and its Structured Query Language • Expressive rules that capture expected result • Using facts and predicates, a.k.a. metadata to figure out what data to get • That freed up a lot of development time
  • 12. SQL • From a time-consuming, imperative mess (“how”) • … to expressive data queries (“what”) SQL query SELECT (desired facts) FROM (existing facts) WHERE (matching criteria)
  • 13. Puppet • From a time-consuming, imperative mess (“how”) • … to expressive configuration queries (“what”) puppet apply CHANGE (desired facts) FROM (existing puppet facts) WHERE (matching puppet classes)
  • 14. Is there a pattern?
  • 15. “Break free from ever more complex naming conventions for hostnames as a means of identity. Use a very rich set of meta data provided by each machine to address them.” –MCollective overview
  • 16. MCollective • From a time-consuming, imperative mess (“how”) • … to expressive orchestration queries (“what”) mco rpc service restart service=nginx -F webpool=A EXEC (desired actions) FROM (existing puppet facts) WHERE (matching puppet classes)
  • 17. Back to monitoring • Monitoring is to behavior what Puppet is to configuration • Monitoring is to behavior what MCollective is to orchestration
  • 18. Monitoring • From a time-consuming, imperative mess (“how”) • … to expressive monitoring queries (“what”) Monitoring query MONITOR (desired behavior) FROM (existing heartbeats/metrics) WHERE (matching puppet facts)
  • 19. Examples • “All provisioned web servers in the production environment, datacenter ABC must respond to queries within 200ms” • “All PostgreSQL servers must have a postgres: bgwriter process running” • “At least one ActiveMQ server is up to support mcollective" • Never mention a hostname
  • 20. Hosts are not the center of the monitoring universe. Facts are! Hosts are just places where facts occur.
  • 21. The proof is in the pudding…
  • 22. Hosts at the center of the universe a.k.a. the Wrong Way
  • 23. “Its fairly straightforward, so hopefully you find things easy to understand…” –Nagios Core 4 manual on monitoring clusters
  • 24. Host-centric: Monitor a DNS cluster check_command check_service_cluster!"DNS Cluster"!0!1! $SERVICESTATEID:host1:DNS Service$,$SERVICESTATEID:host2:DNS Service$,$SERVICESTATEID:host3:DNS Service$ Where do host1, host2, host3 come from?
  • 25. Host-centric: can’t use facts directly • “Host groups solve this problem”. No, they don’t. • Combinatorial explosion, e.g. trivially • 4 data centers (us-1, us-2, eu, apac) • 5 classes (web, db, cache, appserver, hadoop) • 3 environments (test, staging, prod) • => up to 119 materialized host groups
  • 26. Nagios-bashing? • No! • Same fatal flaw with all host-centric monitoring tools • Host-centric monitoring forces an extra, expensive step: • replicate fact-based conditionals in host-centric templates
  • 27. “Please note that this module is not for the faint of heart. Even I (the author) have my head hurt each time I have to make modifications to it…” –puppet-nagios author
  • 28. Facts at the center of the universe a.k.a. the Right Way "De Revolutionibus manuscript p9b" by Nicolas Copernicus - www.bj.uj.edu.pl. Licensed under Public domain via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:De_Revolutionibus_manuscript_p9b.jpg#mediaviewer/ File:De_Revolutionibus_manuscript_p9b.jpga
  • 29. Earlier Examples • “All provisioned web servers in the production environment, datacenter ABC must respond to queries within 200ms” • “All PostgreSQL servers must have a postgres: bgwriter process running” • “At least one ActiveMQ server is up to support mcollective"
  • 30. In Sensu (heartbeats) • “All PostgreSQL servers must have a postgres: bgwriter process running” class postgres::monitoring::sensu { sensu::subscription { 'postgres': } } • Monitoring using a fact-based query • Is node of class “postgres” and subscribed to “postgres” or not? • If so, it will execute the postgres check
  • 31. In Datadog (metrics) • “All provisioned web servers in the production environment, datacenter ABC must respond to queries within 200ms” $ puppet module install datadog-datadog_agent class { ‘datadog_agent’: api_key => …, tags => [$environment], fact_to_tags => [“datacenter”] } include datadog_agent::integrations::nginx
  • 32. In Datadog (metrics) • Monitoring using a fact-based query • Puppet facts directly reused max(nginx.request.latency{production,datacenter:ABC}) < 200
  • 33. What to take away
  • 34. Fact-based monitoring 1. Hosts are not at the center of the monitoring universe 2. Expressive monitoring uses queries 3. Monitoring queries should use Puppet facts