Monitoring with Nagios and Ganglia

Maciej Lasyk, Ganglia & Nagios
Maciej Lasyk
11. Sesja Linuksowa
Wrocław, 2014-04-06
1/25
Ganglia & Nagios

Ganglia.. what?
Ganglia – cluster / group of neurons found outside
the central nervous system
Maciej Lasyk, Ganglia & Nagios 2/25

Just a little about monitoring
- the need for monitoring

- measuring availability

- measuring performance

- measuring performance
- gathering additional metrics

Monitoring is critical for HA
How to measure availability?

A = Uptime / (Uptime + Downtime)

MTTD (Mean Time to Diagnose)
The average time it takes to diagnose the problem

MTTR (Mean Time to Repair)
The average time it takes to fix a problem

MTTF (Mean Time to Failure)
The average time there is correct behavior

MTTF (Mean Time to Failure)
The average time there is correct behavior
MTBF (Mean Time Between Failures)
The average time between different failures of the service

A = MTTF / MTBF = MTTF / (MTTF + MTTD + MTTR)
4/25

What should we monitor?
- hardware housing
- devices
- storage
- network
- hosts
- software (very deep hole)
5/25

What should we monitor?
- hardware housing
- devices
- storage
- network
- hosts
- software (very deep hole)
Think dependencies!
5/25

When outage hits us – don't panic!
- Notifications
6/25

- Notifications
- Escalations
L1 <-> L2 <-> L3 <-> L4 lol ;)
desktop support / devs / ops / networking /
/ storage / middleware / dc / security
6/25

- Notifications
- Escalations
L1 <-> L2 <-> L3 <-> L4 lol ;)
- Clock is ticking – it should be simple
6/25

- Notifications
- Escalations
L1 <-> L2 <-> L3 <-> L4 lol ;)
- Clock is ticking – it should be simple
- What if cell is offline or someone is out?
6/25

Monitoring: notifications issues
- false positives
7/25

- false positives
- major events
7/25

- false positives
- major events
- failover notifications?
7/25

- false positives
- major events
- failover notifications?
- tolerance & critical thresholds
7/25

Monitoring: reporting
- baseline
8/25

- baseline
- correlation between incidents and
change management
8/25

- baseline
change management
- trending info
8/25

- baseline
change management
- trending info
- reporting
8/25

Monitoring: good practices
- don't NIH!
9/25

- don't NIH!
- DVCS
9/25

- don't NIH!
- DVCS
- testing envs
9/25

- don't NIH!
- DVCS
- testing envs
- think usability!
9/25

- don't NIH!
- DVCS
- testing envs
- think usability!
- passive checks
9/25

- don't NIH!
- DVCS
- testing envs
- think usability!
- passive checks
- automate – don't hardcode
9/25

- don't NIH!
- DVCS
- testing envs
- think usability!
- passive checks
- automate – don't hardcode
- security
9/25

Last but not least...
“Quis custodiet ipsos custodes?”
(Who will guard the guards?)
9/25

Nagios recap
Host / Services / Contacts
- hosts, hostgroups
10/25

Nagios recap
- hosts, hostgroups
- services, service groups
10/25

Nagios recap
- hosts, hostgroups
- templates
10/25

Nagios recap
- hosts, hostgroups
- templates
- time periods
10/25

Nagios recap
- hosts, hostgroups
- templates
- time periods
- host and services dependencies
10/25

Nagios recap
- hosts, hostgroups
- templates
- time periods
- host and services dependencies
- regular expressions
10/25

Nagios recap
10/25

Nagios recap
Checks and states
- frequencies & thresholds
10/25

Nagios recap
Checks and states
- scheduling downtimes
10/25

Nagios recap
Checks and states
- scheduling downtimes
- outages and flapping
10/25

Nagios recap
Notifications
- periods
10/25

Nagios recap
Notifications
- periods
- groups
10/25

Nagios recap
Notifications
- periods
- groups
- which states to be notified about?
10/25

Nagios recap
Notifications
- periods
- groups
- escalations / rotations
10/25

Nagios recap
Notifications
- periods
- groups
- escalations / rotations
- custom notifications method
10/25

Nagios recap
Monitoring remotes
- NRPE daemons
- checks via SSH
10/25

Nagios recap
Web interface – tactical overview
10/25

Nagios recap
Web interface – availability reports
10/25

Nagios recap
Web interface – trends
10/25

Nagios recap
Web interface – network maps
10/25

Networking recap
Unicast
11/25

Networking recap
Multicast
11/25

Networking recap
Broadcast
11/25

Ganglia – what is it?
Problems of big scale:
20k hosts with zylion metrics probed every 10 seconds
It is fully redundant (until you spoil it)
It is very scalable
Regexp searches and creating of views – adhoc :)
12/25

Ganglia – architecture
13/25

Ganglia – topologies
Default multicast topology
14/25

Deaf / mute multicast topology
14/25

Unicast topology
14/25

Gmetad topology
14/25

Gmetad HA topology (active - active)
14/25

Gmetad hierarchical topology
14/25

Ganglia – RRDcached
15/25

Ganglia – sFlow
16/25

Ganglia – web (grid view)
17/25

Ganglia – web (cluster view)
17/25

Ganglia – web (physical view)
17/25

Ganglia – web (host view)
17/25

Ganglia – web (compare hosts)
17/25

Ganglia – web (events)
Events have API json based
Think – integration with whatever app :)
17/25

Ganglia – web (dashboards)
- Create view -> apply as dashboard
- Create dashboard from XML
- Generate graphs and add to views
17/25

Ganglia – web (graphs)
17/25

Ganglia – metrics
- base / extended metrics
- own modules
- c / c++
- mod_python
- spoofing
- gmetric
- gmetric4j / java
- Which to choose? gmetric / python / c/c++?
18/25

Ganglia – metrics
18/25

Ganglia – metrics
- own modules
18/25

Ganglia – metrics
- own modules
- c / c++
18/25

Ganglia – metrics
- own modules
- c / c++
- mod_python
18/25

Ganglia – metrics
- own modules
- c / c++
- mod_python
- spoofing
18/25

Ganglia – metrics
- own modules
- c / c++
- mod_python
- spoofing
- gmetric
- gmetric4j / java
18/25

Ganglia and logfiles?
ganglia-logtailer
- https://bitbucket.org/maplebed/ganglia-logtailer
- parser logfiles (realtime)
- pushes data to ganglia (via gmetric)
- yup – based on specific log formats
- yet still – open source so poke around ;)
19/25

So... Nagios + Ganglia!
3 ways of integration:
- ganglia-web/nagios (PHP & bash based)
https://github.com/ganglia/ganglia-web
- ganglia-nagios-bridge (Python & cron based)
https://github.com/ganglia/ganglia-nagios-bridge
- check-ganglia-metric (Python)
https://github.com/ganglia/ganglia_contrib
20/25

Nagios + Ganglia: ganglia-web/nagios
https://github.com/ganglia/ganglia-web
Sending Nagios Data to Ganglia
service_perfdata_command
Or replace Nagios checks with Ganglia!
- Check heartbeat.
- Check a single metric on a specific host.
- Check multiple metrics on a specific host.
- Check multiple metrics across a regex-defined
range of hosts
21/25

Nagios + Ganglia: ganglia-web/nagios
Nagios pulls info from Ganglia via HTTP
21/25

Nagios + Ganglia: ganglia-nagios-bridge
- https://github.com/ganglia/ganglia-nagios-bridge
- Python script run in e.g. in crontab
- pulls data from Ganglia XML via sockets
- parses XML
- send data to Nagios
- Nagios commits only passive checks
22/25

Nagios + Ganglia: check_ganglia_metric
- https://pypi.python.org/pypi/check_ganglia_metric/
- basically Nagios plugin
- pulls data from Ganglia XML via sockets
- check_ganglia_metric.py
--gmetad_host=gmetad-server.example.com
--metric_host=host.example.com --metric_name=cpu_idle
23/25

Nagios + Ganglia
Which one integration should I use?
24/25

Nagios + Ganglia
Which one integration should I use?
Seriously – try yourself and test
24/25

Freenode #ganglia
https://lists.sourceforge.net/lists/listinfo/ganglia-general
24.5/25

sources?
- “Monitoring with Ganglia” book
- also nagios.org
- and “Web Operations” book
- plus some experience ;)

Maciej Lasyk
11. Sesja Linuksowa
2014-04-06, Wrocław
http://maciek.lasyk.info/sysop
maciek@lasyk.info
@docent-net
Ganglia & Nagios
Thank you :)

Monitoring with Nagios and Ganglia

More Related Content

Viewers also liked (19)

Similar to Monitoring with Nagios and Ganglia (20)

More from Maciej Lasyk (20)

Recently uploaded (20)

Monitoring with Nagios and Ganglia