SlideShare a Scribd company logo
Maciej Lasyk, Ganglia & Nagios
Maciej Lasyk
11. Sesja Linuksowa
Wrocław, 2014-04-06
1/25
Ganglia & Nagios
Ganglia.. what?
Ganglia – cluster / group of neurons found outside
the central nervous system
Maciej Lasyk, Ganglia & Nagios 2/25
Just a little about monitoring
- the need for monitoring
Maciej Lasyk, Ganglia & Nagios 3/25
Just a little about monitoring
- the need for monitoring
- measuring availability
Maciej Lasyk, Ganglia & Nagios 3/25
Just a little about monitoring
- the need for monitoring
- measuring availability
- measuring performance
Maciej Lasyk, Ganglia & Nagios 3/25
Just a little about monitoring
- the need for monitoring
- measuring availability
- measuring performance
- gathering additional metrics
Maciej Lasyk, Ganglia & Nagios 3/25
Monitoring is critical for HA
How to measure availability?
Maciej Lasyk, Ganglia & Nagios 4/25
Monitoring is critical for HA
How to measure availability?
A = Uptime / (Uptime + Downtime)
Maciej Lasyk, Ganglia & Nagios 4/25
Monitoring is critical for HA
How to measure availability?
A = Uptime / (Uptime + Downtime)
MTTD (Mean Time to Diagnose)
The average time it takes to diagnose the problem
Maciej Lasyk, Ganglia & Nagios 4/25
Monitoring is critical for HA
How to measure availability?
A = Uptime / (Uptime + Downtime)
MTTD (Mean Time to Diagnose)
The average time it takes to diagnose the problem
MTTR (Mean Time to Repair)
The average time it takes to fix a problem
Maciej Lasyk, Ganglia & Nagios 4/25
Monitoring is critical for HA
How to measure availability?
A = Uptime / (Uptime + Downtime)
MTTD (Mean Time to Diagnose)
The average time it takes to diagnose the problem
MTTR (Mean Time to Repair)
The average time it takes to fix a problem
MTTF (Mean Time to Failure)
The average time there is correct behavior
Maciej Lasyk, Ganglia & Nagios 4/25
Monitoring is critical for HA
How to measure availability?
A = Uptime / (Uptime + Downtime)
MTTD (Mean Time to Diagnose)
The average time it takes to diagnose the problem
MTTR (Mean Time to Repair)
The average time it takes to fix a problem
MTTF (Mean Time to Failure)
The average time there is correct behavior
MTBF (Mean Time Between Failures)
The average time between different failures of the service
Maciej Lasyk, Ganglia & Nagios 4/25
Monitoring is critical for HA
Maciej Lasyk, Ganglia & Nagios 4/25
Monitoring is critical for HA
Maciej Lasyk, Ganglia & Nagios
A = MTTF / MTBF = MTTF / (MTTF + MTTD + MTTR)
4/25
What should we monitor?
Maciej Lasyk, Ganglia & Nagios
- hardware housing
- devices
- storage
- network
- hosts
- software (very deep hole)
5/25
What should we monitor?
Maciej Lasyk, Ganglia & Nagios
- hardware housing
- devices
- storage
- network
- hosts
- software (very deep hole)
Think dependencies!
5/25
When outage hits us – don't panic!
Maciej Lasyk, Ganglia & Nagios
- Notifications
6/25
When outage hits us – don't panic!
Maciej Lasyk, Ganglia & Nagios
- Notifications
- Escalations
L1 <-> L2 <-> L3 <-> L4 lol ;)
desktop support / devs / ops / networking /
/ storage / middleware / dc / security
6/25
When outage hits us – don't panic!
Maciej Lasyk, Ganglia & Nagios
- Notifications
- Escalations
L1 <-> L2 <-> L3 <-> L4 lol ;)
desktop support / devs / ops / networking /
/ storage / middleware / dc / security
- Clock is ticking – it should be simple
6/25
When outage hits us – don't panic!
Maciej Lasyk, Ganglia & Nagios
- Notifications
- Escalations
L1 <-> L2 <-> L3 <-> L4 lol ;)
desktop support / devs / ops / networking /
/ storage / middleware / dc / security
- Clock is ticking – it should be simple
- What if cell is offline or someone is out?
6/25
Monitoring: notifications issues
Maciej Lasyk, Ganglia & Nagios
- false positives
7/25
Maciej Lasyk, Ganglia & Nagios
- false positives
- major events
Monitoring: notifications issues
7/25
Maciej Lasyk, Ganglia & Nagios
- false positives
- major events
- failover notifications?
Monitoring: notifications issues
7/25
Maciej Lasyk, Ganglia & Nagios
- false positives
- major events
- failover notifications?
- tolerance & critical thresholds
Monitoring: notifications issues
7/25
Monitoring: reporting
Maciej Lasyk, Ganglia & Nagios
- baseline
8/25
Maciej Lasyk, Ganglia & Nagios
- baseline
- correlation between incidents and
change management
Monitoring: reporting
8/25
Maciej Lasyk, Ganglia & Nagios
- baseline
- correlation between incidents and
change management
- trending info
Monitoring: reporting
8/25
Maciej Lasyk, Ganglia & Nagios
- baseline
- correlation between incidents and
change management
- trending info
- reporting
Monitoring: reporting
8/25
Monitoring: good practices
Maciej Lasyk, Ganglia & Nagios
- don't NIH!
9/25
Maciej Lasyk, Ganglia & Nagios
- don't NIH!
- DVCS
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
- don't NIH!
- DVCS
- testing envs
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
- don't NIH!
- DVCS
- testing envs
- think usability!
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
- don't NIH!
- DVCS
- testing envs
- think usability!
- passive checks
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
- don't NIH!
- DVCS
- testing envs
- think usability!
- passive checks
- automate – don't hardcode
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
- don't NIH!
- DVCS
- testing envs
- think usability!
- passive checks
- automate – don't hardcode
- security
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
Last but not least...
“Quis custodiet ipsos custodes?”
(Who will guard the guards?)
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Host / Services / Contacts
- hosts, hostgroups
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Host / Services / Contacts
- hosts, hostgroups
- services, service groups
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Host / Services / Contacts
- hosts, hostgroups
- services, service groups
- templates
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Host / Services / Contacts
- hosts, hostgroups
- services, service groups
- templates
- time periods
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Host / Services / Contacts
- hosts, hostgroups
- services, service groups
- templates
- time periods
- host and services dependencies
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Host / Services / Contacts
- hosts, hostgroups
- services, service groups
- templates
- time periods
- host and services dependencies
- regular expressions
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Checks and states
- frequencies & thresholds
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Checks and states
- frequencies & thresholds
- scheduling downtimes
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Checks and states
- frequencies & thresholds
- scheduling downtimes
- outages and flapping
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Notifications
- periods
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Notifications
- periods
- groups
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Notifications
- periods
- groups
- which states to be notified about?
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Notifications
- periods
- groups
- which states to be notified about?
- escalations / rotations
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Notifications
- periods
- groups
- which states to be notified about?
- escalations / rotations
- custom notifications method
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Monitoring remotes
- NRPE daemons
- checks via SSH
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Web interface – tactical overview
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Web interface – availability reports
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Web interface – trends
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Web interface – network maps
10/25
Maciej Lasyk, Ganglia & Nagios
Networking recap
Unicast
11/25
Maciej Lasyk, Ganglia & Nagios
Networking recap
Multicast
11/25
Maciej Lasyk, Ganglia & Nagios
Networking recap
Broadcast
11/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – what is it?
Problems of big scale:
20k hosts with zylion metrics probed every 10 seconds
It is fully redundant (until you spoil it)
It is very scalable
Regexp searches and creating of views – adhoc :)
12/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – architecture
13/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – architecture
13/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – topologies
Default multicast topology
14/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – topologies
Deaf / mute multicast topology
14/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – topologies
Unicast topology
14/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – topologies
Gmetad topology
14/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – topologies
Gmetad HA topology (active - active)
14/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – topologies
Gmetad hierarchical topology
14/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – RRDcached
15/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – sFlow
16/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – web (grid view)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – web (cluster view)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – web (physical view)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – web (host view)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – web (compare hosts)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – web (events)
Events have API json based
Think – integration with whatever app :)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – web (dashboards)
- Create view -> apply as dashboard
- Create dashboard from XML
- Generate graphs and add to views
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – web (graphs)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
- c / c++
- mod_python
- spoofing
- gmetric
- gmetric4j / java
- Which to choose? gmetric / python / c/c++?
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
- c / c++
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
- c / c++
- mod_python
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
- c / c++
- mod_python
- spoofing
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
- c / c++
- mod_python
- spoofing
- gmetric
- gmetric4j / java
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
- c / c++
- mod_python
- spoofing
- gmetric
- gmetric4j / java
- Which to choose? gmetric / python / c/c++?
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia and logfiles?
ganglia-logtailer
- https://bitbucket.org/maplebed/ganglia-logtailer
- parser logfiles (realtime)
- pushes data to ganglia (via gmetric)
- yup – based on specific log formats
- yet still – open source so poke around ;)
19/25
So... Nagios + Ganglia!
Maciej Lasyk, Ganglia & Nagios
3 ways of integration:
- ganglia-web/nagios (PHP & bash based)
https://github.com/ganglia/ganglia-web
- ganglia-nagios-bridge (Python & cron based)
https://github.com/ganglia/ganglia-nagios-bridge
- check-ganglia-metric (Python)
https://github.com/ganglia/ganglia_contrib
20/25
Nagios + Ganglia: ganglia-web/nagios
Maciej Lasyk, Ganglia & Nagios
https://github.com/ganglia/ganglia-web
Sending Nagios Data to Ganglia
service_perfdata_command
Or replace Nagios checks with Ganglia!
- Check heartbeat.
- Check a single metric on a specific host.
- Check multiple metrics on a specific host.
- Check multiple metrics across a regex-defined
range of hosts
21/25
Maciej Lasyk, Ganglia & Nagios
Nagios + Ganglia: ganglia-web/nagios
Nagios pulls info from Ganglia via HTTP
21/25
Maciej Lasyk, Ganglia & Nagios
Nagios + Ganglia: ganglia-nagios-bridge
- https://github.com/ganglia/ganglia-nagios-bridge
- Python script run in e.g. in crontab
- pulls data from Ganglia XML via sockets
- parses XML
- send data to Nagios
- Nagios commits only passive checks
22/25
Maciej Lasyk, Ganglia & Nagios
Nagios + Ganglia: check_ganglia_metric
- https://pypi.python.org/pypi/check_ganglia_metric/
- basically Nagios plugin
- pulls data from Ganglia XML via sockets
- check_ganglia_metric.py 
--gmetad_host=gmetad-server.example.com 
--metric_host=host.example.com --metric_name=cpu_idle
23/25
Maciej Lasyk, Ganglia & Nagios
Nagios + Ganglia
Which one integration should I use?
24/25
Maciej Lasyk, Ganglia & Nagios
Nagios + Ganglia
Which one integration should I use?
Seriously – try yourself and test
24/25
Maciej Lasyk, Ganglia & Nagios
Freenode #ganglia
https://lists.sourceforge.net/lists/listinfo/ganglia-general
24.5/25
sources?
Maciej Lasyk, Ganglia & Nagios 25/25
- “Monitoring with Ganglia” book
- also nagios.org
- and “Web Operations” book
- plus some experience ;)
Maciej Lasyk
11. Sesja Linuksowa
2014-04-06, Wrocław
http://maciek.lasyk.info/sysop
maciek@lasyk.info
@docent-net
Ganglia & Nagios
Thank you :)
Maciej Lasyk, Ganglia & Nagios 25/25

More Related Content

PPT
Nagios
PPTX
Nagios XI Best Practices
PPTX
What is Nagios XI and how is it different from Nagios Core
PDF
Computer monitoring with the Open Monitoring Distribution
PDF
Stop using Nagios (so it can die peacefully)
PPT
Ganglia Monitoring Tool
ODP
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
PPTX
Why favour Icinga over Nagios - Rootconf 2015
Nagios
Nagios XI Best Practices
What is Nagios XI and how is it different from Nagios Core
Computer monitoring with the Open Monitoring Distribution
Stop using Nagios (so it can die peacefully)
Ganglia Monitoring Tool
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Why favour Icinga over Nagios - Rootconf 2015

Viewers also liked (19)

KEY
Using Nagios with Chef
PPTX
Nagios core vs. nagios xi presentation power point.pptx [diperbaiki]
PDF
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
PDF
Monitoring with Ganglia
ODP
Nagios Conference 2013 - Eric Stanley and Andy Brist - API and Nagios
PPTX
Time to say goodbye to your Nagios based setup
ODP
Nagios Conference 2012 - Mike Weber - Failover
PDF
Jenkins
PDF
Nagios, Getting Started.
PPT
Nagios Conference 2014 - Konstantin Benz - Monitoring Openstack The Relations...
PDF
OTechs Network Monitoring (Nagios) Training Course
PPT
Nagios Conference 2011 - David Thomas - Know Its Broke Before Your Customers Do
PPTX
Nagios Consulting Implementation and Maintenance
ODP
Nagios Conference 2013 - Andy Brist - Data Visualizations and Nagios XI
PDF
Metrics with Ganglia
ODP
Nagios Conference 2012 - Mike Weber - NRPE
PPTX
NagiosXI - Astiostech NagiosXI Event with NTT MSC Cyberjaya
ODP
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
PDF
Janice Singh - Writing Custom Nagios Plugins
Using Nagios with Chef
Nagios core vs. nagios xi presentation power point.pptx [diperbaiki]
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Monitoring with Ganglia
Nagios Conference 2013 - Eric Stanley and Andy Brist - API and Nagios
Time to say goodbye to your Nagios based setup
Nagios Conference 2012 - Mike Weber - Failover
Jenkins
Nagios, Getting Started.
Nagios Conference 2014 - Konstantin Benz - Monitoring Openstack The Relations...
OTechs Network Monitoring (Nagios) Training Course
Nagios Conference 2011 - David Thomas - Know Its Broke Before Your Customers Do
Nagios Consulting Implementation and Maintenance
Nagios Conference 2013 - Andy Brist - Data Visualizations and Nagios XI
Metrics with Ganglia
Nagios Conference 2012 - Mike Weber - NRPE
NagiosXI - Astiostech NagiosXI Event with NTT MSC Cyberjaya
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Janice Singh - Writing Custom Nagios Plugins
Ad

Similar to Monitoring with Nagios and Ganglia (20)

PDF
Multi Layer Monitoring V1
PDF
Proactive monitoring tools or services - Open Source
PDF
OSMC 2009 | Implementing a large monitoring infrastructure with Nagios and Ga...
PDF
Nagios 3
PPTX
Functionality, security and performance monitoring of web assets (e.g. Joomla...
PPT
network-management Web base.ppt
PDF
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
PDF
Monitoring of OpenNebula installations
PDF
Handout: 'Open Source Tools & Resources'
PDF
Nagios Conference 2012 - Nicolas Brousse - Optimizing your Monitoring and Tre...
PDF
Nagios 3
PPTX
Continous delivery devoops Session no 23_new.pptx
PPTX
NagiOs.pptxhjkgfddssddfccgghuikjhgvccvvhjj
PDF
NetEye Conference 2010: Ethan Galstad on Nagios
PDF
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
PDF
Have you been stalking your servers?
PDF
An Introduction To Monitoring With Nagios PowerPoint Presentation Slides
PPT
WLCG Grid Infrastructure Monitoring
PDF
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
PDF
Have you been stalking your servers?
Multi Layer Monitoring V1
Proactive monitoring tools or services - Open Source
OSMC 2009 | Implementing a large monitoring infrastructure with Nagios and Ga...
Nagios 3
Functionality, security and performance monitoring of web assets (e.g. Joomla...
network-management Web base.ppt
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
Monitoring of OpenNebula installations
Handout: 'Open Source Tools & Resources'
Nagios Conference 2012 - Nicolas Brousse - Optimizing your Monitoring and Tre...
Nagios 3
Continous delivery devoops Session no 23_new.pptx
NagiOs.pptxhjkgfddssddfccgghuikjhgvccvvhjj
NetEye Conference 2010: Ethan Galstad on Nagios
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Have you been stalking your servers?
An Introduction To Monitoring With Nagios PowerPoint Presentation Slides
WLCG Grid Infrastructure Monitoring
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Have you been stalking your servers?
Ad

More from Maciej Lasyk (20)

PDF
Rundeck & Ansible
PDF
Docker 1.11
ODP
Programowanie AWSa z CLI, boto, Ansiblem i libcloudem
ODP
Co powinieneś wiedzieć na temat devops?f
ODP
"Containers do not contain"
PDF
Git Submodules
ODP
Linux containers & Devops
PDF
Under the Dome (of failure driven pipeline)
PDF
Continuous Security in DevOps
ODP
About cultural change w/Devops
ODP
Orchestrating docker containers at scale (#DockerKRK edition)
ODP
Orchestrating docker containers at scale (PJUG edition)
PDF
Orchestrating Docker containers at scale
ODP
Ghost in the shell
ODP
Scaling and securing node.js apps
ODP
Node.js security
ODP
High Availability (HA) Explained - second edition
PDF
Stop disabling SELinux!
ODP
RHEL/Fedora + Docker (and SELinux)
PDF
High Availability (HA) Explained
Rundeck & Ansible
Docker 1.11
Programowanie AWSa z CLI, boto, Ansiblem i libcloudem
Co powinieneś wiedzieć na temat devops?f
"Containers do not contain"
Git Submodules
Linux containers & Devops
Under the Dome (of failure driven pipeline)
Continuous Security in DevOps
About cultural change w/Devops
Orchestrating docker containers at scale (#DockerKRK edition)
Orchestrating docker containers at scale (PJUG edition)
Orchestrating Docker containers at scale
Ghost in the shell
Scaling and securing node.js apps
Node.js security
High Availability (HA) Explained - second edition
Stop disabling SELinux!
RHEL/Fedora + Docker (and SELinux)
High Availability (HA) Explained

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
1. Introduction to Computer Programming.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation theory and applications.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPT
Teaching material agriculture food technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
A Presentation on Artificial Intelligence
“AI and Expert System Decision Support & Business Intelligence Systems”
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
The Rise and Fall of 3GPP – Time for a Sabbatical?
MIND Revenue Release Quarter 2 2025 Press Release
Encapsulation_ Review paper, used for researhc scholars
1. Introduction to Computer Programming.pptx
Empathic Computing: Creating Shared Understanding
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Reach Out and Touch Someone: Haptics and Empathic Computing
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation theory and applications.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Network Security Unit 5.pdf for BCA BBA.
Teaching material agriculture food technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
A comparative analysis of optical character recognition models for extracting...
A Presentation on Artificial Intelligence

Monitoring with Nagios and Ganglia

  • 1. Maciej Lasyk, Ganglia & Nagios Maciej Lasyk 11. Sesja Linuksowa Wrocław, 2014-04-06 1/25 Ganglia & Nagios
  • 2. Ganglia.. what? Ganglia – cluster / group of neurons found outside the central nervous system Maciej Lasyk, Ganglia & Nagios 2/25
  • 3. Just a little about monitoring - the need for monitoring Maciej Lasyk, Ganglia & Nagios 3/25
  • 4. Just a little about monitoring - the need for monitoring - measuring availability Maciej Lasyk, Ganglia & Nagios 3/25
  • 5. Just a little about monitoring - the need for monitoring - measuring availability - measuring performance Maciej Lasyk, Ganglia & Nagios 3/25
  • 6. Just a little about monitoring - the need for monitoring - measuring availability - measuring performance - gathering additional metrics Maciej Lasyk, Ganglia & Nagios 3/25
  • 7. Monitoring is critical for HA How to measure availability? Maciej Lasyk, Ganglia & Nagios 4/25
  • 8. Monitoring is critical for HA How to measure availability? A = Uptime / (Uptime + Downtime) Maciej Lasyk, Ganglia & Nagios 4/25
  • 9. Monitoring is critical for HA How to measure availability? A = Uptime / (Uptime + Downtime) MTTD (Mean Time to Diagnose) The average time it takes to diagnose the problem Maciej Lasyk, Ganglia & Nagios 4/25
  • 10. Monitoring is critical for HA How to measure availability? A = Uptime / (Uptime + Downtime) MTTD (Mean Time to Diagnose) The average time it takes to diagnose the problem MTTR (Mean Time to Repair) The average time it takes to fix a problem Maciej Lasyk, Ganglia & Nagios 4/25
  • 11. Monitoring is critical for HA How to measure availability? A = Uptime / (Uptime + Downtime) MTTD (Mean Time to Diagnose) The average time it takes to diagnose the problem MTTR (Mean Time to Repair) The average time it takes to fix a problem MTTF (Mean Time to Failure) The average time there is correct behavior Maciej Lasyk, Ganglia & Nagios 4/25
  • 12. Monitoring is critical for HA How to measure availability? A = Uptime / (Uptime + Downtime) MTTD (Mean Time to Diagnose) The average time it takes to diagnose the problem MTTR (Mean Time to Repair) The average time it takes to fix a problem MTTF (Mean Time to Failure) The average time there is correct behavior MTBF (Mean Time Between Failures) The average time between different failures of the service Maciej Lasyk, Ganglia & Nagios 4/25
  • 13. Monitoring is critical for HA Maciej Lasyk, Ganglia & Nagios 4/25
  • 14. Monitoring is critical for HA Maciej Lasyk, Ganglia & Nagios A = MTTF / MTBF = MTTF / (MTTF + MTTD + MTTR) 4/25
  • 15. What should we monitor? Maciej Lasyk, Ganglia & Nagios - hardware housing - devices - storage - network - hosts - software (very deep hole) 5/25
  • 16. What should we monitor? Maciej Lasyk, Ganglia & Nagios - hardware housing - devices - storage - network - hosts - software (very deep hole) Think dependencies! 5/25
  • 17. When outage hits us – don't panic! Maciej Lasyk, Ganglia & Nagios - Notifications 6/25
  • 18. When outage hits us – don't panic! Maciej Lasyk, Ganglia & Nagios - Notifications - Escalations L1 <-> L2 <-> L3 <-> L4 lol ;) desktop support / devs / ops / networking / / storage / middleware / dc / security 6/25
  • 19. When outage hits us – don't panic! Maciej Lasyk, Ganglia & Nagios - Notifications - Escalations L1 <-> L2 <-> L3 <-> L4 lol ;) desktop support / devs / ops / networking / / storage / middleware / dc / security - Clock is ticking – it should be simple 6/25
  • 20. When outage hits us – don't panic! Maciej Lasyk, Ganglia & Nagios - Notifications - Escalations L1 <-> L2 <-> L3 <-> L4 lol ;) desktop support / devs / ops / networking / / storage / middleware / dc / security - Clock is ticking – it should be simple - What if cell is offline or someone is out? 6/25
  • 21. Monitoring: notifications issues Maciej Lasyk, Ganglia & Nagios - false positives 7/25
  • 22. Maciej Lasyk, Ganglia & Nagios - false positives - major events Monitoring: notifications issues 7/25
  • 23. Maciej Lasyk, Ganglia & Nagios - false positives - major events - failover notifications? Monitoring: notifications issues 7/25
  • 24. Maciej Lasyk, Ganglia & Nagios - false positives - major events - failover notifications? - tolerance & critical thresholds Monitoring: notifications issues 7/25
  • 25. Monitoring: reporting Maciej Lasyk, Ganglia & Nagios - baseline 8/25
  • 26. Maciej Lasyk, Ganglia & Nagios - baseline - correlation between incidents and change management Monitoring: reporting 8/25
  • 27. Maciej Lasyk, Ganglia & Nagios - baseline - correlation between incidents and change management - trending info Monitoring: reporting 8/25
  • 28. Maciej Lasyk, Ganglia & Nagios - baseline - correlation between incidents and change management - trending info - reporting Monitoring: reporting 8/25
  • 29. Monitoring: good practices Maciej Lasyk, Ganglia & Nagios - don't NIH! 9/25
  • 30. Maciej Lasyk, Ganglia & Nagios - don't NIH! - DVCS Monitoring: good practices 9/25
  • 31. Maciej Lasyk, Ganglia & Nagios - don't NIH! - DVCS - testing envs Monitoring: good practices 9/25
  • 32. Maciej Lasyk, Ganglia & Nagios - don't NIH! - DVCS - testing envs - think usability! Monitoring: good practices 9/25
  • 33. Maciej Lasyk, Ganglia & Nagios - don't NIH! - DVCS - testing envs - think usability! - passive checks Monitoring: good practices 9/25
  • 34. Maciej Lasyk, Ganglia & Nagios - don't NIH! - DVCS - testing envs - think usability! - passive checks - automate – don't hardcode Monitoring: good practices 9/25
  • 35. Maciej Lasyk, Ganglia & Nagios - don't NIH! - DVCS - testing envs - think usability! - passive checks - automate – don't hardcode - security Monitoring: good practices 9/25
  • 36. Maciej Lasyk, Ganglia & Nagios Last but not least... “Quis custodiet ipsos custodes?” (Who will guard the guards?) Monitoring: good practices 9/25
  • 37. Maciej Lasyk, Ganglia & Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups 10/25
  • 38. Maciej Lasyk, Ganglia & Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups - services, service groups 10/25
  • 39. Maciej Lasyk, Ganglia & Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups - services, service groups - templates 10/25
  • 40. Maciej Lasyk, Ganglia & Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups - services, service groups - templates - time periods 10/25
  • 41. Maciej Lasyk, Ganglia & Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups - services, service groups - templates - time periods - host and services dependencies 10/25
  • 42. Maciej Lasyk, Ganglia & Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups - services, service groups - templates - time periods - host and services dependencies - regular expressions 10/25
  • 43. Maciej Lasyk, Ganglia & Nagios Nagios recap 10/25
  • 44. Maciej Lasyk, Ganglia & Nagios Nagios recap 10/25
  • 45. Maciej Lasyk, Ganglia & Nagios Nagios recap Checks and states - frequencies & thresholds 10/25
  • 46. Maciej Lasyk, Ganglia & Nagios Nagios recap Checks and states - frequencies & thresholds - scheduling downtimes 10/25
  • 47. Maciej Lasyk, Ganglia & Nagios Nagios recap Checks and states - frequencies & thresholds - scheduling downtimes - outages and flapping 10/25
  • 48. Maciej Lasyk, Ganglia & Nagios Nagios recap Notifications - periods 10/25
  • 49. Maciej Lasyk, Ganglia & Nagios Nagios recap Notifications - periods - groups 10/25
  • 50. Maciej Lasyk, Ganglia & Nagios Nagios recap Notifications - periods - groups - which states to be notified about? 10/25
  • 51. Maciej Lasyk, Ganglia & Nagios Nagios recap Notifications - periods - groups - which states to be notified about? - escalations / rotations 10/25
  • 52. Maciej Lasyk, Ganglia & Nagios Nagios recap Notifications - periods - groups - which states to be notified about? - escalations / rotations - custom notifications method 10/25
  • 53. Maciej Lasyk, Ganglia & Nagios Nagios recap Monitoring remotes - NRPE daemons - checks via SSH 10/25
  • 54. Maciej Lasyk, Ganglia & Nagios Nagios recap Web interface – tactical overview 10/25
  • 55. Maciej Lasyk, Ganglia & Nagios Nagios recap Web interface – availability reports 10/25
  • 56. Maciej Lasyk, Ganglia & Nagios Nagios recap Web interface – trends 10/25
  • 57. Maciej Lasyk, Ganglia & Nagios Nagios recap Web interface – network maps 10/25
  • 58. Maciej Lasyk, Ganglia & Nagios Networking recap Unicast 11/25
  • 59. Maciej Lasyk, Ganglia & Nagios Networking recap Multicast 11/25
  • 60. Maciej Lasyk, Ganglia & Nagios Networking recap Broadcast 11/25
  • 61. Maciej Lasyk, Ganglia & Nagios Ganglia – what is it? Problems of big scale: 20k hosts with zylion metrics probed every 10 seconds It is fully redundant (until you spoil it) It is very scalable Regexp searches and creating of views – adhoc :) 12/25
  • 62. Maciej Lasyk, Ganglia & Nagios Ganglia – architecture 13/25
  • 63. Maciej Lasyk, Ganglia & Nagios Ganglia – architecture 13/25
  • 64. Maciej Lasyk, Ganglia & Nagios Ganglia – topologies Default multicast topology 14/25
  • 65. Maciej Lasyk, Ganglia & Nagios Ganglia – topologies Deaf / mute multicast topology 14/25
  • 66. Maciej Lasyk, Ganglia & Nagios Ganglia – topologies Unicast topology 14/25
  • 67. Maciej Lasyk, Ganglia & Nagios Ganglia – topologies Gmetad topology 14/25
  • 68. Maciej Lasyk, Ganglia & Nagios Ganglia – topologies Gmetad HA topology (active - active) 14/25
  • 69. Maciej Lasyk, Ganglia & Nagios Ganglia – topologies Gmetad hierarchical topology 14/25
  • 70. Maciej Lasyk, Ganglia & Nagios Ganglia – RRDcached 15/25
  • 71. Maciej Lasyk, Ganglia & Nagios Ganglia – sFlow 16/25
  • 72. Maciej Lasyk, Ganglia & Nagios Ganglia – web (grid view) 17/25
  • 73. Maciej Lasyk, Ganglia & Nagios Ganglia – web (cluster view) 17/25
  • 74. Maciej Lasyk, Ganglia & Nagios Ganglia – web (physical view) 17/25
  • 75. Maciej Lasyk, Ganglia & Nagios Ganglia – web (host view) 17/25
  • 76. Maciej Lasyk, Ganglia & Nagios Ganglia – web (compare hosts) 17/25
  • 77. Maciej Lasyk, Ganglia & Nagios Ganglia – web (events) Events have API json based Think – integration with whatever app :) 17/25
  • 78. Maciej Lasyk, Ganglia & Nagios Ganglia – web (dashboards) - Create view -> apply as dashboard - Create dashboard from XML - Generate graphs and add to views 17/25
  • 79. Maciej Lasyk, Ganglia & Nagios Ganglia – web (graphs) 17/25
  • 80. Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ - mod_python - spoofing - gmetric - gmetric4j / java - Which to choose? gmetric / python / c/c++? 18/25
  • 81. Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics 18/25
  • 82. Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules 18/25
  • 83. Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ 18/25
  • 84. Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ - mod_python 18/25
  • 85. Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ - mod_python - spoofing 18/25
  • 86. Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ - mod_python - spoofing - gmetric - gmetric4j / java 18/25
  • 87. Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ - mod_python - spoofing - gmetric - gmetric4j / java - Which to choose? gmetric / python / c/c++? 18/25
  • 88. Maciej Lasyk, Ganglia & Nagios Ganglia and logfiles? ganglia-logtailer - https://bitbucket.org/maplebed/ganglia-logtailer - parser logfiles (realtime) - pushes data to ganglia (via gmetric) - yup – based on specific log formats - yet still – open source so poke around ;) 19/25
  • 89. So... Nagios + Ganglia! Maciej Lasyk, Ganglia & Nagios 3 ways of integration: - ganglia-web/nagios (PHP & bash based) https://github.com/ganglia/ganglia-web - ganglia-nagios-bridge (Python & cron based) https://github.com/ganglia/ganglia-nagios-bridge - check-ganglia-metric (Python) https://github.com/ganglia/ganglia_contrib 20/25
  • 90. Nagios + Ganglia: ganglia-web/nagios Maciej Lasyk, Ganglia & Nagios https://github.com/ganglia/ganglia-web Sending Nagios Data to Ganglia service_perfdata_command Or replace Nagios checks with Ganglia! - Check heartbeat. - Check a single metric on a specific host. - Check multiple metrics on a specific host. - Check multiple metrics across a regex-defined range of hosts 21/25
  • 91. Maciej Lasyk, Ganglia & Nagios Nagios + Ganglia: ganglia-web/nagios Nagios pulls info from Ganglia via HTTP 21/25
  • 92. Maciej Lasyk, Ganglia & Nagios Nagios + Ganglia: ganglia-nagios-bridge - https://github.com/ganglia/ganglia-nagios-bridge - Python script run in e.g. in crontab - pulls data from Ganglia XML via sockets - parses XML - send data to Nagios - Nagios commits only passive checks 22/25
  • 93. Maciej Lasyk, Ganglia & Nagios Nagios + Ganglia: check_ganglia_metric - https://pypi.python.org/pypi/check_ganglia_metric/ - basically Nagios plugin - pulls data from Ganglia XML via sockets - check_ganglia_metric.py --gmetad_host=gmetad-server.example.com --metric_host=host.example.com --metric_name=cpu_idle 23/25
  • 94. Maciej Lasyk, Ganglia & Nagios Nagios + Ganglia Which one integration should I use? 24/25
  • 95. Maciej Lasyk, Ganglia & Nagios Nagios + Ganglia Which one integration should I use? Seriously – try yourself and test 24/25
  • 96. Maciej Lasyk, Ganglia & Nagios Freenode #ganglia https://lists.sourceforge.net/lists/listinfo/ganglia-general 24.5/25
  • 97. sources? Maciej Lasyk, Ganglia & Nagios 25/25 - “Monitoring with Ganglia” book - also nagios.org - and “Web Operations” book - plus some experience ;)
  • 98. Maciej Lasyk 11. Sesja Linuksowa 2014-04-06, Wrocław http://maciek.lasyk.info/sysop maciek@lasyk.info @docent-net Ganglia & Nagios Thank you :) Maciej Lasyk, Ganglia & Nagios 25/25