SlideShare a Scribd company logo
KMG Group GmbH, http://www.kmggroup.ch
Magnus Lübeck, Zürich, 2019-11-12
http://kmg.group
Icinga Day Zürich 2019
2
Sysadmin since the 90’s
Unix/Oracle at Volvo
Pre sales, Sun Microsystems reseller
Oracle DBA at CERN
IT Operations manager at Accarda
IT Operations manager at Kanton LU
Owner of KMG Group GmbH
Built infrastructure and operations at
Swisscom
peaq
Serafe
This is me
3
Where I work
4
What we do
5
Quick overview
People, tools and processes
The four fielder
Telemetry and health
Desire lines
OSS and Free software in modern operations environments
Tool landscape
Icinga’s part in the mechano
Outline
6
The stack – game of tetris
7
dennisadams.net
Metrics
Operational tools
Processes
Standards
MOPS
8
Telemetry is part of good systems design
Measurement points should be a mandatory point of EVERY system
This has been know since many years, across many industries
Metrics - /status, /health
9
The use of waveforms to diagnose broken things is far from new.
The triangular form is particularly useful.
Can be used in many ways
Very useful for repetitive patterns.
Metrics - /status, /health
10
Metrics - /status, /health
11
Metrics - /status, /health
12
A fool with a tool is still a fool.
Get smart people
Use tools
Integrate the tools with your
environment.
Tools can cost money
But does not have to
Operational tools
13
Implement simple processes
Use the right tools, and don’t make
the processes complicated.
Processes - desire lines
14
Morningcheck ok
Processes – desire lines
15
Naming conventions
No servers named after
porn stars
Baseline installations
Mini OS install
Automation/ Infrastructure as code
Ansible, chef, puppet
Coding guidelines
Standards
16
Inception in so many levels
Deals with ”less than 24/7” SLAs
You can service check your SLA
Shameless plug – SLA check
17
The four fielder
Technical
Monitoring
Telemetry
Operational tools Inventories
Configuration
Management
Admin Gui
Orchestration
IAM
Ticketing
Dashboarding
Documentation
Remote
Access
Code repository
Artifact
Repository
Application
Specific
Tools
Application
Specific
Tools
Application
Specific
Tools
Application
Specific
Tools
SLA
Monitoring
Audience
Spectrum
18
Remote access
Systems monitoring
Documentation
Identity management
Ticketing
Inventory (not CMDB)
Automation/Orchestration
Telemetry (Technical performance monitoring)
Dashboarding
Technical tools (sysadmin toolbox)
SLA monitoring
Tool landscape
Technical
Monitoring
Telemetry
Operational tools
Inventories
Configuration
Management
Admin Gui
Orchestration
IAM
Ticketing
Dashboarding
Documentation
Remote
Access
Code repository
Artifact
Repository
Application
Specific
Tools
Application
Specific
Tools
Application
Specific
Tools
Application
Specific
Tools
SLA
Monitoring
Audience
Spectrum
19
A customer of mine had
8’500 Open Critical Alerts
15’300 Warnings
Typical “cry wolf” scenario
3 possible/allowed Actions
Solve the problem
Change the threshold (change the metric, template, standard)
Remove the alert
Monitoring theory:
Bad design reduces the value of your monitoring
20
Move the responsibility of delivering telemetry to the application
designers and the application owners
Help them learn how to write service checks
A service delivery is not complete unless telemetry and monitoring
packages are delivered
Application service check responsibility
devOps or stoneAgeOps?
21
Question from an auditor (ISO-27001 audit)
How do you ensure that all applications work after a patch run
My answer:
We don’t
The big audit monster
22
Monitoring – Icinga
Service
Checks
23
Audience
24
One stop shop icinga
Service
Checks
Application
Application
Application
Application
Application
Application
ApplicationScheduled
Tasks
Notifications
Signage
Raspberry Pi 4 with 2 screens
Darboard
Smashing
Telemetry and logging
25
Backup slides
26
Manually edit config – use it when you learn Icinga
Good ways to do it
Automate icinga centric configuration repository - director
Icinga API – write the integration yourself
Automation per Ansible
Metamonitoring
By using your inventory, you know what you are monitoring
And, what you are not monitoring
Icinga client and service registration
27
The layer cake is your monitoring standard grouped
by common denominators.
Group service checks in layers (i.e L0 – L5)
L0 – OS Level - (Linux admin)
CPU, disk usage, ssh, ping, fs usage {/, /var, /home}
L1 – Server type – shared OS resources (Linux Admin)
iops on db fs, fs usage on /app/ora
…
L5 – Application checks – (Application Managers)
Application specific checks
The Layered Cake
28
The human brain is excellent at identifying harmonies and regularities.
Ingredient number 2: Sawtooth waveform
29
The human brain is excellent at identifying harmonies and regularities.
Ingredient number 2: Sawtooth waveform

More Related Content

PPTX
Tornado Complex Event Processing Framework for Icinga - Icinga Camp Zurich 2019
PPTX
How Cloud-Ready Alerting Is Optimal For Today's Environments
PDF
IIA8: Smartsignal Goes Microservices (Predix Transform 2016)
PDF
OSMC 2017 | Icinga2 in a 24/7 Broadcast Environment by Dave Kempe
PDF
PAN1: Thermal Imaging Analysis ( Predix Transform 2016)
PDF
DBOps
PDF
Detection, Response and the Azazel Rootkit
PDF
Reinventing enterprise defense with the Elastic Stack
Tornado Complex Event Processing Framework for Icinga - Icinga Camp Zurich 2019
How Cloud-Ready Alerting Is Optimal For Today's Environments
IIA8: Smartsignal Goes Microservices (Predix Transform 2016)
OSMC 2017 | Icinga2 in a 24/7 Broadcast Environment by Dave Kempe
PAN1: Thermal Imaging Analysis ( Predix Transform 2016)
DBOps
Detection, Response and the Azazel Rootkit
Reinventing enterprise defense with the Elastic Stack

What's hot (20)

PDF
Combinación de logs, métricas y rastreos para observabilidad unificada
PPTX
Paul Dix [InfluxData] | InfluxDays Keynote: Future of InfluxDB | InfluxDays N...
PDF
Turning Cloud Metrics into Results
PDF
Building an event system on top MongoDB
PDF
NetApp keynote for Openstack Silicon Valley 2015
PDF
Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)
PPTX
Top Considerations For Operating a Kubernetes Environment at Scale
PDF
DevDay 2018: Martin Schurz - Aufbau einer Monitoringlösung für moderne Applik...
PDF
Keynote
PDF
Yannis Zarkadas. Enterprise data science workflows on kubeflow
PDF
APIdays Paris 2018 - Cloud computing - we went through every steps of the Gar...
PDF
Opening Keynote
PDF
PAM3: Machine Learning in the Railway Industry ( Predix Transform 2016)
PPTX
Build A Better Way to Deliver IT
PDF
10 Steps to Cloud Happiness
PDF
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
PPTX
THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT ...
PDF
vSEC pro CISCO ACI
PDF
3 reasons to pick a time series platform for monitoring dev ops driven contai...
PDF
Supersonic, Subatomic, Kubernetes Native Java : Microservices Day Dallas
Combinación de logs, métricas y rastreos para observabilidad unificada
Paul Dix [InfluxData] | InfluxDays Keynote: Future of InfluxDB | InfluxDays N...
Turning Cloud Metrics into Results
Building an event system on top MongoDB
NetApp keynote for Openstack Silicon Valley 2015
Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)
Top Considerations For Operating a Kubernetes Environment at Scale
DevDay 2018: Martin Schurz - Aufbau einer Monitoringlösung für moderne Applik...
Keynote
Yannis Zarkadas. Enterprise data science workflows on kubeflow
APIdays Paris 2018 - Cloud computing - we went through every steps of the Gar...
Opening Keynote
PAM3: Machine Learning in the Railway Industry ( Predix Transform 2016)
Build A Better Way to Deliver IT
10 Steps to Cloud Happiness
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT ...
vSEC pro CISCO ACI
3 reasons to pick a time series platform for monitoring dev ops driven contai...
Supersonic, Subatomic, Kubernetes Native Java : Microservices Day Dallas
Ad

Similar to Efficient IT operations using monitoring systems and standardized tools - Icinga Camp Zurich 2019 (20)

PDF
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
PDF
Presentation predictive maintenance solution with IoT and machine learning_SE...
PDF
Case Study: Increasing Produban's Critical Systems Availability and Performance
PPTX
Neev Application Performance Management Services
PDF
On the Application of AI for Failure Management: Problems, Solutions and Algo...
PDF
Mathworks CAE simulation suite – case in point from automotive and aerospace.
PDF
Digital Transformation and Process Optimization in Manufacturing
PDF
Empowering SmartCloud APM - Predictive Insights and Analysis: A Use Case Scen...
PDF
Meeting the challenges to adopt visual production management systems hms-whit...
PDF
Energy Management Solution - iARMS-EMS/PMS
PDF
O.M.S. High Tech CNC parts
PPTX
Innoslate 4.5 and Sopatra
PDF
Internet of Things Microservices
PDF
Modern HMI_SCADA Guidebook for Efficient Operations.PDF
PDF
PSUG 5 - 2025-01-20 - Splunk Observability And Digital Resilience
PPT
Pmo slides jun2010
PDF
10 good reasons to go for model-based systems engineering in your organization
PDF
Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017
PPTX
Sap education knoa
PPTX
The ZDLC Brief
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
Presentation predictive maintenance solution with IoT and machine learning_SE...
Case Study: Increasing Produban's Critical Systems Availability and Performance
Neev Application Performance Management Services
On the Application of AI for Failure Management: Problems, Solutions and Algo...
Mathworks CAE simulation suite – case in point from automotive and aerospace.
Digital Transformation and Process Optimization in Manufacturing
Empowering SmartCloud APM - Predictive Insights and Analysis: A Use Case Scen...
Meeting the challenges to adopt visual production management systems hms-whit...
Energy Management Solution - iARMS-EMS/PMS
O.M.S. High Tech CNC parts
Innoslate 4.5 and Sopatra
Internet of Things Microservices
Modern HMI_SCADA Guidebook for Efficient Operations.PDF
PSUG 5 - 2025-01-20 - Splunk Observability And Digital Resilience
Pmo slides jun2010
10 good reasons to go for model-based systems engineering in your organization
Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017
Sap education knoa
The ZDLC Brief
Ad

More from Icinga (20)

PDF
Upgrading Incident Management with Icinga - Icinga Camp Milan 2023
PDF
Extending Icinga Web with Modules: powerful, smart and easily created - Icing...
PDF
Infrastructure Monitoring for Cloud Native Enterprises - Icinga Camp Milan 2023
PDF
Incident management: Best industry practices your team should know - Icinga C...
PDF
Monitoring Cooling Units in a pharmaceutical GxP regulated environment - Icin...
PDF
SNMP Monitoring at scale - Icinga Camp Milan 2023
PPTX
Monitoring Kubernetes with Icinga - Icinga Camp Milan 2023
PPTX
Current State of Icinga - Icinga Camp Milan 2023
PDF
Signalilo: Visualizing Prometheus alerts in Icinga2 - Icinga Camp Zurich 2019
PDF
Moving from Icinga 1 to Icinga 2 + Director - Icinga Camp Zurich 2019
PDF
Icinga Director and vSphereDB - how they play together - Icinga Camp Zurich 2019
PDF
Current State of Icinga - Icinga Camp Zurich 2019
PDF
NetEye 4 based on Icinga 2 - Icinga Camp Milan 2019
PDF
Integrating Icinga 2 and ntopng - Icinga Camp Milan 2019
PDF
DevOps monitoring: Best Practices using OpenShift combined with Icinga & Big ...
PPTX
Current State of Icinga - Icinga Camp Milan 2019
PPTX
Best of Icinga Modules - Icinga Camp Milan 2019
PPTX
hallenges of Monitoring Big Infrastructure - Icinga Camp Milan 2019
PPTX
Discover the real user experience with Alyvix - Icinga Camp Milan 2019
PDF
Current State of Logmanagement with Icinga - Icinga Camp Stockholm 2019
Upgrading Incident Management with Icinga - Icinga Camp Milan 2023
Extending Icinga Web with Modules: powerful, smart and easily created - Icing...
Infrastructure Monitoring for Cloud Native Enterprises - Icinga Camp Milan 2023
Incident management: Best industry practices your team should know - Icinga C...
Monitoring Cooling Units in a pharmaceutical GxP regulated environment - Icin...
SNMP Monitoring at scale - Icinga Camp Milan 2023
Monitoring Kubernetes with Icinga - Icinga Camp Milan 2023
Current State of Icinga - Icinga Camp Milan 2023
Signalilo: Visualizing Prometheus alerts in Icinga2 - Icinga Camp Zurich 2019
Moving from Icinga 1 to Icinga 2 + Director - Icinga Camp Zurich 2019
Icinga Director and vSphereDB - how they play together - Icinga Camp Zurich 2019
Current State of Icinga - Icinga Camp Zurich 2019
NetEye 4 based on Icinga 2 - Icinga Camp Milan 2019
Integrating Icinga 2 and ntopng - Icinga Camp Milan 2019
DevOps monitoring: Best Practices using OpenShift combined with Icinga & Big ...
Current State of Icinga - Icinga Camp Milan 2019
Best of Icinga Modules - Icinga Camp Milan 2019
hallenges of Monitoring Big Infrastructure - Icinga Camp Milan 2019
Discover the real user experience with Alyvix - Icinga Camp Milan 2019
Current State of Logmanagement with Icinga - Icinga Camp Stockholm 2019

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Encapsulation theory and applications.pdf
PPTX
Cloud computing and distributed systems.
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
The AUB Centre for AI in Media Proposal.docx
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation theory and applications.pdf
Cloud computing and distributed systems.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
NewMind AI Weekly Chronicles - August'25 Week I
NewMind AI Monthly Chronicles - July 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Empathic Computing: Creating Shared Understanding
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
cuic standard and advanced reporting.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Machine learning based COVID-19 study performance prediction
Understanding_Digital_Forensics_Presentation.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Digital-Transformation-Roadmap-for-Companies.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

Efficient IT operations using monitoring systems and standardized tools - Icinga Camp Zurich 2019

  • 1. KMG Group GmbH, http://www.kmggroup.ch Magnus Lübeck, Zürich, 2019-11-12 http://kmg.group Icinga Day Zürich 2019
  • 2. 2 Sysadmin since the 90’s Unix/Oracle at Volvo Pre sales, Sun Microsystems reseller Oracle DBA at CERN IT Operations manager at Accarda IT Operations manager at Kanton LU Owner of KMG Group GmbH Built infrastructure and operations at Swisscom peaq Serafe This is me
  • 5. 5 Quick overview People, tools and processes The four fielder Telemetry and health Desire lines OSS and Free software in modern operations environments Tool landscape Icinga’s part in the mechano Outline
  • 6. 6 The stack – game of tetris
  • 8. 8 Telemetry is part of good systems design Measurement points should be a mandatory point of EVERY system This has been know since many years, across many industries Metrics - /status, /health
  • 9. 9 The use of waveforms to diagnose broken things is far from new. The triangular form is particularly useful. Can be used in many ways Very useful for repetitive patterns. Metrics - /status, /health
  • 12. 12 A fool with a tool is still a fool. Get smart people Use tools Integrate the tools with your environment. Tools can cost money But does not have to Operational tools
  • 13. 13 Implement simple processes Use the right tools, and don’t make the processes complicated. Processes - desire lines
  • 15. 15 Naming conventions No servers named after porn stars Baseline installations Mini OS install Automation/ Infrastructure as code Ansible, chef, puppet Coding guidelines Standards
  • 16. 16 Inception in so many levels Deals with ”less than 24/7” SLAs You can service check your SLA Shameless plug – SLA check
  • 17. 17 The four fielder Technical Monitoring Telemetry Operational tools Inventories Configuration Management Admin Gui Orchestration IAM Ticketing Dashboarding Documentation Remote Access Code repository Artifact Repository Application Specific Tools Application Specific Tools Application Specific Tools Application Specific Tools SLA Monitoring Audience Spectrum
  • 18. 18 Remote access Systems monitoring Documentation Identity management Ticketing Inventory (not CMDB) Automation/Orchestration Telemetry (Technical performance monitoring) Dashboarding Technical tools (sysadmin toolbox) SLA monitoring Tool landscape Technical Monitoring Telemetry Operational tools Inventories Configuration Management Admin Gui Orchestration IAM Ticketing Dashboarding Documentation Remote Access Code repository Artifact Repository Application Specific Tools Application Specific Tools Application Specific Tools Application Specific Tools SLA Monitoring Audience Spectrum
  • 19. 19 A customer of mine had 8’500 Open Critical Alerts 15’300 Warnings Typical “cry wolf” scenario 3 possible/allowed Actions Solve the problem Change the threshold (change the metric, template, standard) Remove the alert Monitoring theory: Bad design reduces the value of your monitoring
  • 20. 20 Move the responsibility of delivering telemetry to the application designers and the application owners Help them learn how to write service checks A service delivery is not complete unless telemetry and monitoring packages are delivered Application service check responsibility devOps or stoneAgeOps?
  • 21. 21 Question from an auditor (ISO-27001 audit) How do you ensure that all applications work after a patch run My answer: We don’t The big audit monster
  • 24. 24 One stop shop icinga Service Checks Application Application Application Application Application Application ApplicationScheduled Tasks Notifications Signage Raspberry Pi 4 with 2 screens Darboard Smashing Telemetry and logging
  • 26. 26 Manually edit config – use it when you learn Icinga Good ways to do it Automate icinga centric configuration repository - director Icinga API – write the integration yourself Automation per Ansible Metamonitoring By using your inventory, you know what you are monitoring And, what you are not monitoring Icinga client and service registration
  • 27. 27 The layer cake is your monitoring standard grouped by common denominators. Group service checks in layers (i.e L0 – L5) L0 – OS Level - (Linux admin) CPU, disk usage, ssh, ping, fs usage {/, /var, /home} L1 – Server type – shared OS resources (Linux Admin) iops on db fs, fs usage on /app/ora … L5 – Application checks – (Application Managers) Application specific checks The Layered Cake
  • 28. 28 The human brain is excellent at identifying harmonies and regularities. Ingredient number 2: Sawtooth waveform
  • 29. 29 The human brain is excellent at identifying harmonies and regularities. Ingredient number 2: Sawtooth waveform