SlideShare a Scribd company logo
Beyond Nagios


      NYC DevOps 2011/07/21
Alexis Lê-Quôc - alq@datadoghq.com
Beyond Nagios


      NYC DevOps 2011/07/21
Alexis Lê-Quôc - alq@datadoghq.com
What I’m Going To Talk About

    • Super-quick   Nagios summary

    • Monitoring/Alerting   Pathologies

    • How   to fix it
What Is

• “Industry   Standard in IT Infrastructure Monitoring”

  • For   once it’s true...

• Scheduler    & Notification server
(+) Robust, Mature code-base

(-) Configuration can be daunting

(-) Not human-friendly
“OVERWHELMING”
A “NORMAL” HOUR
THE “OTHER” NAGIOS UI
Process alerts
                  & Fix things




Receive alerts                    Add more checks




     THE HAPPY START
Missed alerts




Ignore Alerts                   Add more checks




 THE SPIRAL OF DEATH
Quality
      of life


Few checks
Few alerts




                 More checks
                 Too many alerts

                                   # of alerts
             FIGHT OR FLIGHT
Effective                                    Checks n^2
 Coverage                                     Fault-tolerant
                                              Less urgency

Few checks
Few alerts
Every host counts




                    More checks
                    Too many alerts
                    Every host still counts             Scale
                                                    Complexity

    THE TROUGH OF DESPAIR
Effective
Coverage




                           Scale
    IF ONLY I ADDED MORE
           CHECKS...
Reset!
Way Out
‣Breathe!
‣Measure
‣Look for Patterns
‣Put Alerts in Context
‣Focus on the Business
Turn Nagios logs into structured data




                            Analyze


              day     | success_pct | warning_pct | error_pct | events
---------------------+-------------+-------------+-----------+--------
           2011-07-12 00:00:00 |       89 |       0|       2 | 9628
           2011-07-13 00:00:00 |       90 |       0|       2 | 9210
           2011-07-14 00:00:00 |       90 |       0|       2 | 9735
           2011-07-15 00:00:00 |       89 |       0|       2 | 9531




                    MEASURE
day     | success_pct | warning_pct | error_pct | events
---------------------+-------------+-------------+-----------+--------
           2011-07-12 00:00:00 |       89 |       0|       2 | 9628
           2011-07-13 00:00:00 |       90 |       0|       2 | 9210
           2011-07-14 00:00:00 |       90 |       0|       2 | 9735
           2011-07-15 00:00:00 |       89 |       0|       2 | 9531




VISUALIZATION MATTERS
In Time




      Flapping




LOOK FOR PATTERNS
PUT ALERTS IN CONTEXT
    https://app.datad0g.com/dash/dash/1000#/date_range/1310682467000.0-1310684267000.0
Ultimate (hard) question
‣Does this alert impact the business?
 ‣If so by how much?
 ‣Assumes that you track business metrics...
 ‣And they can be accessed programatically



FOCUS ON THE BUSINESS
What applies to Nagios...
Applies to other sources too




                       etc...
Thanks


http://datadoghq.com

More Related Content

PPTX
Ansible for Enterprise
PPTX
Catch these Sessions on-demand at .conf Online
PDF
Neptune : Re-thinking Incident Response Automation
PPTX
Monitoring Microservices at Scale on OpenShift (OpenShift Commons Briefing #52)
PPTX
Best Practices for Forwarder Hierarchies
PDF
A journey in the public clouds
PDF
Datadog jawsdays2017 lunch_lt
PDF
Cloud malfunction up11
Ansible for Enterprise
Catch these Sessions on-demand at .conf Online
Neptune : Re-thinking Incident Response Automation
Monitoring Microservices at Scale on OpenShift (OpenShift Commons Briefing #52)
Best Practices for Forwarder Hierarchies
A journey in the public clouds
Datadog jawsdays2017 lunch_lt
Cloud malfunction up11

Similar to Beyond Nagios (20)

PDF
Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds ma...
PDF
Nagios Conference 2011 - Matt Wall - Performance Graphing and Trending In Nagios
PPT
RSA 2006 - Visual Security Event Analysis
PDF
Orchestration Panel at Cloud Connect 2010
PDF
Business Driven Security Securing the Smarter Planet pcty_020710_rev
PDF
Self-healing of operational workflow incidents on distributed computing infra...
PDF
Nagios, Getting Started.
PPTX
Fs isac fico and core presentation10222012
PPTX
Observability - the good, the bad, and the ugly
PPTX
Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continu...
PPTX
Developing a Continuous Monitoring Action Plan
ODP
Security framework
PPT
Nagios Conference 2012 - Alexis Le Quoc - Deep Dive into Nagios Analytics
PPT
Debs 2012 basic proactive
PPT
Nagios Conference 2012 - Nate Broderick - Bringing Nagios XI Into Your Business
PDF
Cyber Security C2
PPTX
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
PPTX
Problem management foundation - Lifecycle
PPTX
Observability – the good, the bad, and the ugly
PPTX
Unified Operations Vision
Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds ma...
Nagios Conference 2011 - Matt Wall - Performance Graphing and Trending In Nagios
RSA 2006 - Visual Security Event Analysis
Orchestration Panel at Cloud Connect 2010
Business Driven Security Securing the Smarter Planet pcty_020710_rev
Self-healing of operational workflow incidents on distributed computing infra...
Nagios, Getting Started.
Fs isac fico and core presentation10222012
Observability - the good, the bad, and the ugly
Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continu...
Developing a Continuous Monitoring Action Plan
Security framework
Nagios Conference 2012 - Alexis Le Quoc - Deep Dive into Nagios Analytics
Debs 2012 basic proactive
Nagios Conference 2012 - Nate Broderick - Bringing Nagios XI Into Your Business
Cyber Security C2
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Problem management foundation - Lifecycle
Observability – the good, the bad, and the ugly
Unified Operations Vision
Ad

Recently uploaded (20)

PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
Modernising the Digital Integration Hub
PDF
project resource management chapter-09.pdf
PDF
Zenith AI: Advanced Artificial Intelligence
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Web App vs Mobile App What Should You Build First.pdf
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Chapter 5: Probability Theory and Statistics
Enhancing emotion recognition model for a student engagement use case through...
NewMind AI Weekly Chronicles - August'25-Week II
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
A contest of sentiment analysis: k-nearest neighbor versus neural network
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Univ-Connecticut-ChatGPT-Presentaion.pdf
Modernising the Digital Integration Hub
project resource management chapter-09.pdf
Zenith AI: Advanced Artificial Intelligence
TLE Review Electricity (Electricity).pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Web App vs Mobile App What Should You Build First.pdf
O2C Customer Invoices to Receipt V15A.pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
A comparative study of natural language inference in Swahili using monolingua...
1 - Historical Antecedents, Social Consideration.pdf
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Ad

Beyond Nagios

  • 1. Beyond Nagios NYC DevOps 2011/07/21 Alexis Lê-Quôc - alq@datadoghq.com
  • 2. Beyond Nagios NYC DevOps 2011/07/21 Alexis Lê-Quôc - alq@datadoghq.com
  • 3. What I’m Going To Talk About • Super-quick Nagios summary • Monitoring/Alerting Pathologies • How to fix it
  • 4. What Is • “Industry Standard in IT Infrastructure Monitoring” • For once it’s true... • Scheduler & Notification server
  • 5. (+) Robust, Mature code-base (-) Configuration can be daunting (-) Not human-friendly
  • 9. Process alerts & Fix things Receive alerts Add more checks THE HAPPY START
  • 10. Missed alerts Ignore Alerts Add more checks THE SPIRAL OF DEATH
  • 11. Quality of life Few checks Few alerts More checks Too many alerts # of alerts FIGHT OR FLIGHT
  • 12. Effective Checks n^2 Coverage Fault-tolerant Less urgency Few checks Few alerts Every host counts More checks Too many alerts Every host still counts Scale Complexity THE TROUGH OF DESPAIR
  • 13. Effective Coverage Scale IF ONLY I ADDED MORE CHECKS...
  • 15. Way Out ‣Breathe! ‣Measure ‣Look for Patterns ‣Put Alerts in Context ‣Focus on the Business
  • 16. Turn Nagios logs into structured data Analyze day | success_pct | warning_pct | error_pct | events ---------------------+-------------+-------------+-----------+-------- 2011-07-12 00:00:00 | 89 | 0| 2 | 9628 2011-07-13 00:00:00 | 90 | 0| 2 | 9210 2011-07-14 00:00:00 | 90 | 0| 2 | 9735 2011-07-15 00:00:00 | 89 | 0| 2 | 9531 MEASURE
  • 17. day | success_pct | warning_pct | error_pct | events ---------------------+-------------+-------------+-----------+-------- 2011-07-12 00:00:00 | 89 | 0| 2 | 9628 2011-07-13 00:00:00 | 90 | 0| 2 | 9210 2011-07-14 00:00:00 | 90 | 0| 2 | 9735 2011-07-15 00:00:00 | 89 | 0| 2 | 9531 VISUALIZATION MATTERS
  • 18. In Time Flapping LOOK FOR PATTERNS
  • 19. PUT ALERTS IN CONTEXT https://app.datad0g.com/dash/dash/1000#/date_range/1310682467000.0-1310684267000.0
  • 20. Ultimate (hard) question ‣Does this alert impact the business? ‣If so by how much? ‣Assumes that you track business metrics... ‣And they can be accessed programatically FOCUS ON THE BUSINESS
  • 21. What applies to Nagios... Applies to other sources too etc...

Editor's Notes