OSMC 2009 | Implementing a large monitoring infrastructure with Nagios and Ganglia by Spike Morelli

From 1 to 10K with
Ganglia and Nagios
Spike Morelli aka Space Linden

About Second Life
3D Virtual World
Not a game

About Second Life
• Built by Residents
– Textured
– Scripted
– Animated
– Owned

About Second Life
Education
Business
Art & Design

About Second Life
~1M unique users/60 days
~77+ million concurrent scripts
~30K simulators
~2K square Km = ~10x Nuremberg

• 6K nodes simulation grid (simulator, dataservice, memcache,
s3asset_proxy, region presence)
• 600 infrastructure nodes (databases, im<->email, region presence,
groupchat, messaging system, logging system, websites, caches, data
warehousing...)
• 3 DataCentres running with a 4th
being set up and another
planned for Europe
• Integration of EC2 instances for development grids
• Size matters but services' complexity matters much more
• a 3D world is more complicated to run than a web farm
Our Infrastructure

Monitoring?
What do you think when people say
'Monitoring' ?

Outage
• Big enough to get its own name tag...

Outage
• Big enough to get its own name tag...
• … and change the game

• How long to find the root cause?
• Are you really understanding what's going on?
• Is that a symptom or a cause?
• Is something different from yesterday? From one hour ago? From five
minutes ago?
• The bigger your infrastructure is, the harder is to answer those
questions, the higher your chances are that something small will pass
undetected, but have a devastating ripple effect which will be caught
Outage

Statistical Data
“They've done studies, you know. 60% of the time it works every time.”

• INSTRUMENT EVERYTHING^W A LOT
• Too much data can be confusing and misleading
• Monitoring is not just about detecting failures
– Trends
– Capacity planning
– Bottlenecks identification
– Identify bugs in your code (since the last commit mem usage
doubled)
• When we say monitoring we think about systems, but software
engineering has a LOT to benefit from it
Statistical Data

• Ganglia to the rescue
– Almost zero configuration required for new nodes
– Design to scale from the ground up
– Agent based
– Can leverage multicast or multiple unicast targets for data
redundancy
– Lightweight and smart use of network resources (could be better!)
– Plug in Nagios (performance data) into ganglia via gmetric
– Can support resolutions down to the second with good
performance
– Support custom RRDs for higher data resolution
Ganglia

Ganglia Infrastructure
7K nodes * ~40 metrics = 280K metrics collected and stored
(and plenty capacity left)

RRD
• Defaults are insufficient (hour, day, week, month, year)
• RRRcached (in trunk, more work needed, some security
issues)
• Tmpfs + cron for backup + sync at boot
• Application Buffer-Cache Management for
Performance: Running the World's Largest MRTG
http://www.usenix.org/event/lisa07/tech/full_papers/plonka/plonka
_html/index.html

Monitoring
• If you are thinking that there is a lot of duplication
between Nagios and Ganglia you're right, but you
should...

Monitoring
• If you are thinking that there is a lot of duplication
between Nagios and Ganglia you're right, but you
should...
• ...Change the way you're thinking about
monitoring!
• Don't think checks, think metrics
• Don't think if a service up, but rather how it's doing

WYSINWTS
WYSINWTS
What You See Is Not What They See

WYSINWTS
• External Nagios
• External generic http (or tcp) proxy
• CDN based 3rd
party monitoring service

Needed Improvements
• Dashboards and data analysis tools
• TCO
• Monitoring tools for developers
• CEP

Code!
• http://bitbucket.org/maplebed/ganglia-logtailer
• http://bitbucket.org/maplebed/ganglios

Thank You!
• Questions?
• Ideas?
• Contact me at space@lindenlab.com
• Thanks for listening!
– Please feel free to say hi and chat with me about these
topics during the conference!

OSMC 2009 | Implementing a large monitoring infrastructure with Nagios and Ganglia by Spike Morelli

More Related Content

What's hot (19)

Similar to OSMC 2009 | Implementing a large monitoring infrastructure with Nagios and Ganglia by Spike Morelli (20)

Recently uploaded (20)

OSMC 2009 | Implementing a large monitoring infrastructure with Nagios and Ganglia by Spike Morelli