SlideShare a Scribd company logo
From 1 to 10K with
Ganglia and Nagios
Spike Morelli aka Space Linden
About Second Life
3D Virtual World
Not a game
About Second Life
• Built by Residents
– Textured
– Scripted
– Animated
– Owned
About Second Life
Education
Business
Art & Design
About Second Life
~1M unique users/60 days
~77+ million concurrent scripts
~30K simulators
~2K square Km = ~10x Nuremberg
• 6K nodes simulation grid (simulator, dataservice, memcache,
s3asset_proxy, region presence)
• 600 infrastructure nodes (databases, im<->email, region presence,
groupchat, messaging system, logging system, websites, caches, data
warehousing...)
• 3 DataCentres running with a 4th
being set up and another
planned for Europe
• Integration of EC2 instances for development grids
• Size matters but services' complexity matters much more
• a 3D world is more complicated to run than a web farm
Our Infrastructure
Monitoring?
What do you think when people say
'Monitoring' ?
Monitoring?
Monitoring?
Monitoring?
Monitoring?
Outage
• Big enough to get its own name tag...
Outage
• Big enough to get its own name tag...
• … and change the game
• How long to find the root cause?
• Are you really understanding what's going on?
• Is that a symptom or a cause?
• Is something different from yesterday? From one hour ago? From five
minutes ago?
• The bigger your infrastructure is, the harder is to answer those
questions, the higher your chances are that something small will pass
undetected, but have a devastating ripple effect which will be caught
Outage
Statistical Data
“They've done studies, you know. 60% of the time it works every time.”
• INSTRUMENT EVERYTHING^W A LOT
• Too much data can be confusing and misleading
• Monitoring is not just about detecting failures
– Trends
– Capacity planning
– Bottlenecks identification
– Identify bugs in your code (since the last commit mem usage
doubled)
• When we say monitoring we think about systems, but software
engineering has a LOT to benefit from it
Statistical Data
• Ganglia to the rescue
– Almost zero configuration required for new nodes
– Design to scale from the ground up
– Agent based
– Can leverage multicast or multiple unicast targets for data
redundancy
– Lightweight and smart use of network resources (could be better!)
– Plug in Nagios (performance data) into ganglia via gmetric
– Can support resolutions down to the second with good
performance
– Support custom RRDs for higher data resolution
Ganglia
Ganglia
Ganglia
Ganglia
Ganglia Infrastructure
7K nodes * ~40 metrics = 280K metrics collected and stored
(and plenty capacity left)
RRD
• Defaults are insufficient (hour, day, week, month, year)
• RRRcached (in trunk, more work needed, some security
issues)
• Tmpfs + cron for backup + sync at boot
• Application Buffer-Cache Management for
Performance: Running the World's Largest MRTG
http://www.usenix.org/event/lisa07/tech/full_papers/plonka/plonka
_html/index.html
Monitoring
• If you are thinking that there is a lot of duplication
between Nagios and Ganglia you're right, but you
should...
Monitoring
• If you are thinking that there is a lot of duplication
between Nagios and Ganglia you're right, but you
should...
• ...Change the way you're thinking about
monitoring!
• Don't think checks, think metrics
• Don't think if a service up, but rather how it's doing
Fault Detection with Nagios
WYSINWTS
WYSINWTS
What You See Is Not What They See
WYSINWTS
• External Nagios
• External generic http (or tcp) proxy
• CDN based 3rd
party monitoring service
Needed Improvements
• Dashboards and data analysis tools
• TCO
• Monitoring tools for developers
• CEP
Code!
• http://bitbucket.org/maplebed/ganglia-logtailer
• http://bitbucket.org/maplebed/ganglios
Thank You!
• Questions?
• Ideas?
• Contact me at space@lindenlab.com
• Thanks for listening!
– Please feel free to say hi and chat with me about these
topics during the conference!

More Related Content

PDF
Design Computation - Call 04/2012 - Digital Realities
PPTX
Internet of Things and Big Data
PDF
Big Trends in Big Data
PDF
[161] 데이터사이언스팀 빌딩
PPT
Taste Java In The Clouds
PDF
Apache Druid: The Foundation of Fortune 500 “Analytical Decision-Making"
PPTX
The of Operational Analytics Data Store
PDF
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Design Computation - Call 04/2012 - Digital Realities
Internet of Things and Big Data
Big Trends in Big Data
[161] 데이터사이언스팀 빌딩
Taste Java In The Clouds
Apache Druid: The Foundation of Fortune 500 “Analytical Decision-Making"
The of Operational Analytics Data Store
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...

What's hot (19)

PPTX
Bizosys at fifth elephant
PPTX
Big Data Analysis : Deciphering the haystack
PDF
PPTX
Infinitely Scalable Clusters - Grid Computing on Public Cloud - New York
PPT
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
PPTX
MongoDB for Time Series Data: Schema Design
PDF
RDO hangout on gnocchi
PPTX
BigDataCamp LA 2014 Schedule
PDF
Building a Real-Time Gaming Analytics Service with Apache Druid
PPTX
Apache Druid Design and Future prospect
PDF
ESIP 2018 - The Case for Archives of Convenience
PDF
M|18 GPU Accelerated Data Processing
PDF
Geospatial Rectification of Web Transactions and Data Security
PPTX
Big data
PDF
刘诚忠:Running cloudera impala on postgre sql
PPTX
Google Developer Group - Cloud Singapore BigQuery Webinar
PDF
Google Dremel. Concept and Implementations.
PDF
Aggregated queries with Druid on terrabytes and petabytes of data
PDF
The world with Cloud, Big Data, ML, IoT and AI
Bizosys at fifth elephant
Big Data Analysis : Deciphering the haystack
Infinitely Scalable Clusters - Grid Computing on Public Cloud - New York
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
MongoDB for Time Series Data: Schema Design
RDO hangout on gnocchi
BigDataCamp LA 2014 Schedule
Building a Real-Time Gaming Analytics Service with Apache Druid
Apache Druid Design and Future prospect
ESIP 2018 - The Case for Archives of Convenience
M|18 GPU Accelerated Data Processing
Geospatial Rectification of Web Transactions and Data Security
Big data
刘诚忠:Running cloudera impala on postgre sql
Google Developer Group - Cloud Singapore BigQuery Webinar
Google Dremel. Concept and Implementations.
Aggregated queries with Druid on terrabytes and petabytes of data
The world with Cloud, Big Data, ML, IoT and AI
Ad

Similar to OSMC 2009 | Implementing a large monitoring infrastructure with Nagios and Ganglia by Spike Morelli (20)

PPTX
In memory grids IMDG
PDF
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
PPTX
The elephantintheroom bigdataanalyticsinthecloud
PPTX
Manta Unleashed BigDataSG talk 2 July 2013
PPTX
Lrz kurs: big data analysis
PDF
Py tables
PDF
PyTables
PDF
Large Data Analyze With PyTables
PDF
Bertenthal
PDF
What is Big Data?
PDF
Internet of Things
PDF
Fast and Scalable Python
PPTX
Big Data Analytics Strategy and Roadmap
PDF
PyTables
ODP
Big data nyu
PDF
GIST AI-X Computing Cluster
PDF
Accelerating Cyber Threat Detection With GPU
PDF
Stsg17 speaker yousunjeong
PDF
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
PDF
Webinar: SQL for Machine Data?
In memory grids IMDG
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
The elephantintheroom bigdataanalyticsinthecloud
Manta Unleashed BigDataSG talk 2 July 2013
Lrz kurs: big data analysis
Py tables
PyTables
Large Data Analyze With PyTables
Bertenthal
What is Big Data?
Internet of Things
Fast and Scalable Python
Big Data Analytics Strategy and Roadmap
PyTables
Big data nyu
GIST AI-X Computing Cluster
Accelerating Cyber Threat Detection With GPU
Stsg17 speaker yousunjeong
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Webinar: SQL for Machine Data?
Ad

Recently uploaded (20)

PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Nekopoi APK 2025 free lastest update
PPTX
assetexplorer- product-overview - presentation
PDF
medical staffing services at VALiNTRY
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PDF
AutoCAD Professional Crack 2025 With License Key
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PPTX
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
PPTX
L1 - Introduction to python Backend.pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
iTop VPN 6.5.0 Crack + License Key 2025 (Premium Version)
PPTX
Oracle Fusion HCM Cloud Demo for Beginners
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Advanced SystemCare Ultimate Crack + Portable (2025)
PPTX
Transform Your Business with a Software ERP System
PPTX
history of c programming in notes for students .pptx
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
Autodesk AutoCAD Crack Free Download 2025
Wondershare Filmora 15 Crack With Activation Key [2025
Nekopoi APK 2025 free lastest update
assetexplorer- product-overview - presentation
medical staffing services at VALiNTRY
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
AutoCAD Professional Crack 2025 With License Key
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
L1 - Introduction to python Backend.pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 41
iTop VPN 6.5.0 Crack + License Key 2025 (Premium Version)
Oracle Fusion HCM Cloud Demo for Beginners
Design an Analysis of Algorithms I-SECS-1021-03
Advanced SystemCare Ultimate Crack + Portable (2025)
Transform Your Business with a Software ERP System
history of c programming in notes for students .pptx
Digital Systems & Binary Numbers (comprehensive )
Autodesk AutoCAD Crack Free Download 2025

OSMC 2009 | Implementing a large monitoring infrastructure with Nagios and Ganglia by Spike Morelli

  • 1. From 1 to 10K with Ganglia and Nagios Spike Morelli aka Space Linden
  • 2. About Second Life 3D Virtual World Not a game
  • 3. About Second Life • Built by Residents – Textured – Scripted – Animated – Owned
  • 5. About Second Life ~1M unique users/60 days ~77+ million concurrent scripts ~30K simulators ~2K square Km = ~10x Nuremberg
  • 6. • 6K nodes simulation grid (simulator, dataservice, memcache, s3asset_proxy, region presence) • 600 infrastructure nodes (databases, im<->email, region presence, groupchat, messaging system, logging system, websites, caches, data warehousing...) • 3 DataCentres running with a 4th being set up and another planned for Europe • Integration of EC2 instances for development grids • Size matters but services' complexity matters much more • a 3D world is more complicated to run than a web farm Our Infrastructure
  • 7. Monitoring? What do you think when people say 'Monitoring' ?
  • 12. Outage • Big enough to get its own name tag...
  • 13. Outage • Big enough to get its own name tag... • … and change the game
  • 14. • How long to find the root cause? • Are you really understanding what's going on? • Is that a symptom or a cause? • Is something different from yesterday? From one hour ago? From five minutes ago? • The bigger your infrastructure is, the harder is to answer those questions, the higher your chances are that something small will pass undetected, but have a devastating ripple effect which will be caught Outage
  • 15. Statistical Data “They've done studies, you know. 60% of the time it works every time.”
  • 16. • INSTRUMENT EVERYTHING^W A LOT • Too much data can be confusing and misleading • Monitoring is not just about detecting failures – Trends – Capacity planning – Bottlenecks identification – Identify bugs in your code (since the last commit mem usage doubled) • When we say monitoring we think about systems, but software engineering has a LOT to benefit from it Statistical Data
  • 17. • Ganglia to the rescue – Almost zero configuration required for new nodes – Design to scale from the ground up – Agent based – Can leverage multicast or multiple unicast targets for data redundancy – Lightweight and smart use of network resources (could be better!) – Plug in Nagios (performance data) into ganglia via gmetric – Can support resolutions down to the second with good performance – Support custom RRDs for higher data resolution Ganglia
  • 21. Ganglia Infrastructure 7K nodes * ~40 metrics = 280K metrics collected and stored (and plenty capacity left)
  • 22. RRD • Defaults are insufficient (hour, day, week, month, year) • RRRcached (in trunk, more work needed, some security issues) • Tmpfs + cron for backup + sync at boot • Application Buffer-Cache Management for Performance: Running the World's Largest MRTG http://www.usenix.org/event/lisa07/tech/full_papers/plonka/plonka _html/index.html
  • 23. Monitoring • If you are thinking that there is a lot of duplication between Nagios and Ganglia you're right, but you should...
  • 24. Monitoring • If you are thinking that there is a lot of duplication between Nagios and Ganglia you're right, but you should... • ...Change the way you're thinking about monitoring! • Don't think checks, think metrics • Don't think if a service up, but rather how it's doing
  • 26. WYSINWTS WYSINWTS What You See Is Not What They See
  • 27. WYSINWTS • External Nagios • External generic http (or tcp) proxy • CDN based 3rd party monitoring service
  • 28. Needed Improvements • Dashboards and data analysis tools • TCO • Monitoring tools for developers • CEP
  • 30. Thank You! • Questions? • Ideas? • Contact me at space@lindenlab.com • Thanks for listening! – Please feel free to say hi and chat with me about these topics during the conference!