SlideShare a Scribd company logo
Monitoring
Deeper dive
Who am I
Robert Kubiś
DevOps Engineer
https://www.linkedin.com/in/robertkubis89
Mikey
Dickerson
Hierarchy of
Needs
Monitoring  - deeper dive
Monitoring
Collecting, processing, aggregating, and displaying real-time quantitative data
about a system, such as query counts and types, error counts and types,
processing times, and server lifetimes.
● White-box monitoring
● Black-box monitoring
● Dashboard
● Alert
● Root cause
● Push
● Node and machine
Why Monitor?
● Analyzing long-term trends
● Comparing over time or experiment groups
● Alerting
● Building dashboards
● Conducting ad hoc retrospective analysis (i.e., debugging)
Please stop using nagios (Andy Sykes)
So we can die peacefully…..
Who use it?
Why did you choose it?
Please stop using nagios (Andy Sykes)
So we can die peacefully…..
Who use it?
Why did you choose it?
Advantages:
● Incredible simple plugins model.
● Simple to use
● Many people know it.
● On the top in google and everybody
use it :)
Please stop using nagios (Andy Sykes)
So we can die peacefully…..
Who use it?
Why did you choose it?
Advantages:
● Incredible simple plugins model.
● Simple to use
● Many people know it.
● On the top in google and everybody
use it :)
Disadvantages:
● Doesn’t scale - cannot be clustering -
Thruk hack
● Millions lines of configuration -
check_mk hack
● Horrible interface
● Only for static infrastructure
● Stupid format of clients - hacks
● Perfdata…
● Doesn’t have API - livestatus hack
● Always need to hack….
Nagios
Monitoring  - deeper dive
Monitoring  - deeper dive
When your monitoring suck...
- Improve the quality of alerts
- Improve monitoring tools, or even change them
Wait a minute…. Before you start to solve them...
UNDERSTAND PROBLEMS AND MEASURE THEM!!!
“To measure is to know”
“If you can not measure it,
you can not improve it”
Lord William Thomson
(aka Baron Kelvin)
Monitoring  - deeper dive
Monitoring  - deeper dive
Over-monitoring and alarm fatigue: For whom do the
bells toll? Hospitals in USA
- Ignoring Alarms notification
- “Yeah that is no important”
- 72–99% false alerts
- Young parents vs nurses in hospital
- Monitoring means more money
- More is not better
- Patient could died
- Telemetry as a means of preventing, detecting, and improving
Source:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4926996/
What to do?
So what to use for monitoring?
What monitoring should be?
Actionable
Compatible
Essential - only alerts which are needed
Fully Automated
Proactive - should predict failures
Easy for operators
State monitoring what it should be like?
State or blackbox monitoring now has the most of sense in VMs and bare-metals
What should be monitored with those kind of tools?
● Health endpoints
● Service states (like systemctl status *)
What could be monitored?
● Specific endpoints (using for example satellite node) with http/tcp checks
Icinga2 - Nagios fork but rewritten in many places, has scaling scenarios
(multimaster, with 3 levels of nodes - masters, satellites(ie. Supervisor per DC),
clients(check executors)), plugins - like InfluxDB metric exporter, livestatus etc.
What we can get from Icinga2?
● High Available and distributed setup
● Nice and good documented REST API
● (dynamic inventory)
● Decrease amount of time needed for implement features
Metrics
Metric tools could be used in two ways:
1. Failure prediction
2. Graphing the data for humans - for humans it means SIMPLE
First case is quite simple - rules for detecting anomalies like more traffic than
usual and alert if it can make an impact on other clients
Second case is also simple - just graphs for debugging and better understanding
what’s happening with applications
Not every metric should have alert (and notifications)!
Prometheus
Circa 120 ready to use dashboards in Grafana repository(ie. MySQL board by
Percona)
Many useful features in one tool - Prometheus has a rich query language, Alert
manager, support for PagerDuty etc.
Plenty of exporters (collectors) for standard tools: MySQL, HAProxy, NGINX,
Pagespeed, BIND, Jenkins, scollector
Third party project support for Prometheus: GitLab, Kubernetes, etcd, telegraf,
jmx-exporter, collectd)
Logs
Servers, application, network and security devices generate log files.
Errors, problems, and more information is constantly logged and saved for
analysis.
Once an event is detected, the monitoring system will send alert, either to a
person or to another software/hardware system.
Elasticsearch stack
Monitoring  - deeper dive
Monitoring strategy
Icinga 2 for state monitoring on bare metal, VMs and VMs in cloud.
Prometheus for metrics and data from Kubernetes (or other container) clusters.
ELK stack for logs
Is that enough?
What should be next step?
What is Pager Duty?
Users Settings
Notification Roules
Schedules
Escalation Policies
Services
Integrations
Integrations list
Connect with any
tool that provides
incoming event
data.
Extensions
Extensions list
Extend the PagerDuty
workflow to your existing tools.
Good practices for alerts
● Notify before accident
● Actionable alarms
● Value of measure things
● Documentation - not only one-liners in on call wiki
● Reduce number of tools
● Terraform
Let’s say that you’re rich :)
New Relic
Monitoring  - deeper dive
Monitoring  - deeper dive
Monitoring  - deeper dive
Monitoring  - deeper dive
Monitoring  - deeper dive
Monitoring  - deeper dive
Monitoring  - deeper dive
Monitoring  - deeper dive
Monitoring  - deeper dive
Monitoring  - deeper dive
Monitoring  - deeper dive
STACKDRIVER
● Full-Stack Monitoring, Powered by Google
● For Cloud Platform, AWS, and Hybrid Deployments
● Identify Trends, Prevent Issues
● Reduce Monitoring Overhead
● Improve Signal-to-Noise
● Fix Problems Faster
Stackdriver heatmap
STACKDRIVER MONITORING FEATURES
● Debugger
● Error reporting
● Rapid discovery
● Uptime monitoring
● Integrations
● Smart defaults
● Alerts
● Tracing
● Logging
● Dashboards
● Profiling
MONITORING = PEOPLE
Not only tools...
Team
To make sense currently and in the future changing the monitoring infrastructure
should be supported by development and "reacting" teams.
Reacting team:
● 24/7 people for looking on boards and reacting on issues work shifts
● Incident manager taking decisions and investigating tuning of monitoring
● People with “programming” skills responsible for deploy proposals of IM
(writing new checks, adding some pieces of code)
Plan your work
Monitoring  - deeper dive
Monitoring  - deeper dive

More Related Content

PDF
An Introduction to Rearview - Time Series Based Monitoring
PDF
DBOps
PPTX
Monitoring & alerting presentation sabin&mustafa
PDF
An Introduction to Prometheus
PDF
Resilient Applications with Circuit Breakers
PDF
Efficient monitoring and alerting
PPTX
Evolution of Monitoring and Prometheus (Dublin 2018)
PDF
Insight DE project
An Introduction to Rearview - Time Series Based Monitoring
DBOps
Monitoring & alerting presentation sabin&mustafa
An Introduction to Prometheus
Resilient Applications with Circuit Breakers
Efficient monitoring and alerting
Evolution of Monitoring and Prometheus (Dublin 2018)
Insight DE project

What's hot (10)

PDF
Go Observability (in practice)
PDF
[WSO2Con Asia 2018] Tooling for Observability
PPTX
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
PDF
Cloud Monitoring with Prometheus
PDF
Taskerman - a distributed cluster task manager
PPTX
Prometheus (Prometheus London, 2016)
PDF
Observability
PPTX
Evaluating Real-Time Anomaly Detection: The Numenta Anomaly Benchmark
PPTX
Happy users and good sleep. How?
PDF
Observability für alle
Go Observability (in practice)
[WSO2Con Asia 2018] Tooling for Observability
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Cloud Monitoring with Prometheus
Taskerman - a distributed cluster task manager
Prometheus (Prometheus London, 2016)
Observability
Evaluating Real-Time Anomaly Detection: The Numenta Anomaly Benchmark
Happy users and good sleep. How?
Observability für alle
Ad

Similar to Monitoring - deeper dive (20)

PDF
Proactive monitoring tools or services - Open Source
PDF
What is Continuous Monitoring in DevOps.pdf
PDF
What is Continuous Monitoring in DevOps.pdf
PPTX
Universal Profiling como nuevo pilar de la observabilidad
PPTX
Challenges of monitoring distributed systems
PDF
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
PDF
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
PPTX
An Introduction to Prometheus (GrafanaCon 2016)
PDF
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
PDF
DevOps Spain 2019. Beatriz Martínez-IBM
PPTX
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
PDF
Adventures in Observability - Clickhouse and Instana
PDF
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
PDF
Monitoring Big Data Systems - "The Simple Way"
PDF
Oksana Safronova - Will you detect it or not? How to check if security team i...
PPTX
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
PDF
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
PDF
Telemetry: The Overlooked Treasure in Axon Server-Centric Applications
PDF
Monitoring in 2017 - TIAD Camp Docker
PPT
Nagios En
Proactive monitoring tools or services - Open Source
What is Continuous Monitoring in DevOps.pdf
What is Continuous Monitoring in DevOps.pdf
Universal Profiling como nuevo pilar de la observabilidad
Challenges of monitoring distributed systems
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
An Introduction to Prometheus (GrafanaCon 2016)
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
DevOps Spain 2019. Beatriz Martínez-IBM
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Adventures in Observability - Clickhouse and Instana
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
Monitoring Big Data Systems - "The Simple Way"
Oksana Safronova - Will you detect it or not? How to check if security team i...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Telemetry: The Overlooked Treasure in Axon Server-Centric Applications
Monitoring in 2017 - TIAD Camp Docker
Nagios En
Ad

Recently uploaded (20)

PPTX
Geodesy 1.pptx...............................................
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
Sustainable Sites - Green Building Construction
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
Digital Logic Computer Design lecture notes
PPTX
web development for engineering and engineering
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
OOP with Java - Java Introduction (Basics)
DOCX
573137875-Attendance-Management-System-original
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Construction Project Organization Group 2.pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Internet of Things (IOT) - A guide to understanding
Geodesy 1.pptx...............................................
Foundation to blockchain - A guide to Blockchain Tech
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Sustainable Sites - Green Building Construction
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Digital Logic Computer Design lecture notes
web development for engineering and engineering
UNIT-1 - COAL BASED THERMAL POWER PLANTS
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
OOP with Java - Java Introduction (Basics)
573137875-Attendance-Management-System-original
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Construction Project Organization Group 2.pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Internet of Things (IOT) - A guide to understanding

Monitoring - deeper dive

  • 2. Who am I Robert Kubiś DevOps Engineer https://www.linkedin.com/in/robertkubis89
  • 5. Monitoring Collecting, processing, aggregating, and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times, and server lifetimes. ● White-box monitoring ● Black-box monitoring ● Dashboard ● Alert ● Root cause ● Push ● Node and machine
  • 6. Why Monitor? ● Analyzing long-term trends ● Comparing over time or experiment groups ● Alerting ● Building dashboards ● Conducting ad hoc retrospective analysis (i.e., debugging)
  • 7. Please stop using nagios (Andy Sykes) So we can die peacefully….. Who use it? Why did you choose it?
  • 8. Please stop using nagios (Andy Sykes) So we can die peacefully….. Who use it? Why did you choose it? Advantages: ● Incredible simple plugins model. ● Simple to use ● Many people know it. ● On the top in google and everybody use it :)
  • 9. Please stop using nagios (Andy Sykes) So we can die peacefully….. Who use it? Why did you choose it? Advantages: ● Incredible simple plugins model. ● Simple to use ● Many people know it. ● On the top in google and everybody use it :) Disadvantages: ● Doesn’t scale - cannot be clustering - Thruk hack ● Millions lines of configuration - check_mk hack ● Horrible interface ● Only for static infrastructure ● Stupid format of clients - hacks ● Perfdata… ● Doesn’t have API - livestatus hack ● Always need to hack….
  • 13. When your monitoring suck... - Improve the quality of alerts - Improve monitoring tools, or even change them Wait a minute…. Before you start to solve them...
  • 14. UNDERSTAND PROBLEMS AND MEASURE THEM!!! “To measure is to know” “If you can not measure it, you can not improve it” Lord William Thomson (aka Baron Kelvin)
  • 17. Over-monitoring and alarm fatigue: For whom do the bells toll? Hospitals in USA - Ignoring Alarms notification - “Yeah that is no important” - 72–99% false alerts - Young parents vs nurses in hospital - Monitoring means more money - More is not better - Patient could died - Telemetry as a means of preventing, detecting, and improving Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4926996/
  • 18. What to do? So what to use for monitoring?
  • 19. What monitoring should be? Actionable Compatible Essential - only alerts which are needed Fully Automated Proactive - should predict failures Easy for operators
  • 20. State monitoring what it should be like? State or blackbox monitoring now has the most of sense in VMs and bare-metals What should be monitored with those kind of tools? ● Health endpoints ● Service states (like systemctl status *) What could be monitored? ● Specific endpoints (using for example satellite node) with http/tcp checks
  • 21. Icinga2 - Nagios fork but rewritten in many places, has scaling scenarios (multimaster, with 3 levels of nodes - masters, satellites(ie. Supervisor per DC), clients(check executors)), plugins - like InfluxDB metric exporter, livestatus etc. What we can get from Icinga2? ● High Available and distributed setup ● Nice and good documented REST API ● (dynamic inventory) ● Decrease amount of time needed for implement features
  • 22. Metrics Metric tools could be used in two ways: 1. Failure prediction 2. Graphing the data for humans - for humans it means SIMPLE First case is quite simple - rules for detecting anomalies like more traffic than usual and alert if it can make an impact on other clients Second case is also simple - just graphs for debugging and better understanding what’s happening with applications Not every metric should have alert (and notifications)!
  • 23. Prometheus Circa 120 ready to use dashboards in Grafana repository(ie. MySQL board by Percona) Many useful features in one tool - Prometheus has a rich query language, Alert manager, support for PagerDuty etc. Plenty of exporters (collectors) for standard tools: MySQL, HAProxy, NGINX, Pagespeed, BIND, Jenkins, scollector Third party project support for Prometheus: GitLab, Kubernetes, etcd, telegraf, jmx-exporter, collectd)
  • 24. Logs Servers, application, network and security devices generate log files. Errors, problems, and more information is constantly logged and saved for analysis. Once an event is detected, the monitoring system will send alert, either to a person or to another software/hardware system.
  • 27. Monitoring strategy Icinga 2 for state monitoring on bare metal, VMs and VMs in cloud. Prometheus for metrics and data from Kubernetes (or other container) clusters. ELK stack for logs
  • 28. Is that enough? What should be next step?
  • 29. What is Pager Duty?
  • 35. Integrations Integrations list Connect with any tool that provides incoming event data.
  • 36. Extensions Extensions list Extend the PagerDuty workflow to your existing tools.
  • 37. Good practices for alerts ● Notify before accident ● Actionable alarms ● Value of measure things ● Documentation - not only one-liners in on call wiki ● Reduce number of tools ● Terraform
  • 38. Let’s say that you’re rich :)
  • 51. STACKDRIVER ● Full-Stack Monitoring, Powered by Google ● For Cloud Platform, AWS, and Hybrid Deployments ● Identify Trends, Prevent Issues ● Reduce Monitoring Overhead ● Improve Signal-to-Noise ● Fix Problems Faster
  • 53. STACKDRIVER MONITORING FEATURES ● Debugger ● Error reporting ● Rapid discovery ● Uptime monitoring ● Integrations ● Smart defaults ● Alerts ● Tracing ● Logging ● Dashboards ● Profiling
  • 54. MONITORING = PEOPLE Not only tools...
  • 55. Team To make sense currently and in the future changing the monitoring infrastructure should be supported by development and "reacting" teams. Reacting team: ● 24/7 people for looking on boards and reacting on issues work shifts ● Incident manager taking decisions and investigating tuning of monitoring ● People with “programming” skills responsible for deploy proposals of IM (writing new checks, adding some pieces of code)