Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

Nagios in the Agile / DevOps /
Continuous Deployment World
Kishore Jalleda
Director of Operations
IMVU, Inc
kjalleda@imvu.com

About IMVU

Avatar based Social Entertainment destination
$50+ Million Annual Revenue
100+ Million Registered Users
10+ Million Items in Virtual Catalog

2012 3

IMVU Engineering and Continuous Deployment

►Doing the Impossible 50 times a day
►Continuous deployment (CD) is real
►IMVU has been one of the pioneers of CD
►DevOps culture is big
►No approval needed to ship to 1% of customers

Check out our engineering blog
http://engineering.imvu.com/

2012 4

What does this mean ?

►Things change quickly
►New features add up instantly
►Can break frequently
►Failures can cascade rapidly
►Things can fall through the cracks
►Many things change at the same time
►Etc

2012 5

Overview

►Nagios Core 3.2.0
►800+ Hosts
►18000+ Service Checks
►Single Nagios Instance
►8 cores, 8GB RAM

2012 7

Server Lifecycle Management

Purchase &
Asset DHCP, Preseed, Nagios, Decommiss
Manageme CFEngine Opspush Cacti, CFEngine Production
ion
DNS Istatd
nt

2012 8

[ Operations ] Continuous
Integration and Deployment

2012 9

IMVU Asset Database ( AssetDB )

►Built internally by IMVU
►Simple but powerful concept
►Source of truth for everything asset related
►Has information on
►Class ( mysql, standard-http-server, redis )
►Role ( customer shard, clientdynweb )
►Tag (available, no-update )
►Attributes (cpu-cores, memory-size, mysql-role )
►Much more …

2012 10

Auto generation of Nagios configuration files

#generate_nagios_conf.pl
( most configurations auto generated from AssetDB )

2012 11

Ops Buildbot ( builds, builders/buildslaves )

# svn commit hosts.cfg hostgroups.cfg

2012 12

Opspush ( Operations Push System )

# opspush --comment “xxxxxx” –role nagios

run “cfagent -v”
on the box
--use-last-green-rev
green

check status
opspush of “last build”
yes

red
--oncall-
override ?

No
exit

2012 13

Product Development

Ideation, UI Monitoring
Design, and Alerting
Tech Design Production Maintenance
Usability Coverage..
Testing, etc Nagios

2012 14

Tech Designs & New Nagios Alert Requests

2012 15

Nagios Alert Request Template

2012 16

Big Data / De-Sharding

► Data freshness is critical to help make the right
business decisions
► Nagios used for ETL/DW status and error
checking
► Nagios and Ops embeds can help empower
your Data Infrastructure team

2012 17

Things will FAIL

2012 18

How we try to prevent and catch failures

Automated 3rd party like
Local Manual QA
Cluster webmetrics,
Acceptance Hypo Builds Buildbot using roll- Nagios
Immunity customers,
Tests out
(CI) etc

2012 19

Cluster Immune System

Automated push monitoring and rollback !
Push to Monitor Good
X% of Critical Push to
servers Metrics rest

Bad

Bad Monitor
Critical
Auto Rollback Metrics

w00t!, my
change is Good
Live

Don’t just rely on Standard Metrics

2012

Demystifying P1s ( Priority 1 )

P1: Priority 1 issue impacting live operations
Phases
► Identification (Nagios )
► Communication and Declaration
► Resolution
► Postmortem / 5 Whys / Root Cause Analysis
► P1 follow up

2012 22

5 Why / Postmortem (PM) / Root Cause Analysis

► 5 Why process
► Amazing culture of running blameless
postmortems
► New Nagios checks are the most common
action Items .
► A lot of monitoring and alerting on business
and application level metrics was originally the
outcome of PMs

2012 23

Example “5 Whys” Process

2012 24

Monitor Business & Application Level Metrics

2012 25

Monitor Response Times

Load Average is a meaningless number 

2012 26

Continuous Monitoring ( Istatd )

► Developed by IMVU
► Sub 10 sec resolution of data
► API to get average, SD, min, max sample count
for each data point in a graph
► Ability to stack multiple graphs on the fly
► Long retention times
► Releasing as open source this week !!!
https://github.com/imvu-open/istatd/wiki

2012 27

Istatd: 10 Second Resolution of Data

2012 28

Istatd: Stacking graphs on the fly

2012 29

Have a “Strategy” for Monitoring
and Alerting

Our (Nagios) Strategy

► Human element of Monitoring and Alerting (
Nagios )
► Nagios & Test Driven Development ( TDD )
► Decouple ( Nagios )
► Aggregated Checks

2012 31

Human Element of Monitoring and Alerting

► Have zero tolerance towards False Positives.
You do not want your ops staff to walk into the
office next AM looking like zombies ;)
► Do not let people develop immunity to pages as
very soon real issues will be ignored
► All pages are Actionable policy: If there is no
action, it should not be paging
► Automatic enabling of alerting/notifications for
improperly silenced ones.
► Ownership and accountability of issues/alerts
2012 32

Daily Triage of Nagios Alerts and Interrupts

2012 33

Nagios & Test Driven Development (TDD)

► Write tests for your Nagios Infrastructure
► Adopted heavily by Ops ( imp to keep pace
with eng, DevOps culture is awesome  )
► High degree of confidence in pushing changes
► Things will eventually change ( OS, libraries,
logic, people, Nagios version, etc ). Tests will
make the change much smoother.
► Functional testing can still be a challenge

2012 34

Sample Nagios Test Output

2012 35

Decouple Nagios

We do it using “Fact, Worker, Reporter & Aggregator” Model

Worker
fact

fact
Redis
Reporter
fact status

fact status
Aggregator

2012 36

Why Decouple ?

 For scalability and efficiency
 Our model was higher performing compared to
NRPE
 Lets you make changes ( like thresholds ) in
one place instead of on like a 1000 machines (
if using NRPE )
 Lets you do aggregated checks, which is again
a very simple but powerful concept to reduce
paging levels by a ton

2012 37

Closing Remarks

► Monitoring and Alerting (M&A) is mission critical for
any business, invest properly and smartly in it
► Don’t limit the usage of Nagios to just Ops. The secret
to wide spread adoption is to make things frictionless
► Bathroom breaks can take 5-10 minutes, so don’t fret
too much about Nagios performance
► Build some form of predictive monitoring and alerting
to catch and alert on change in trends
► Invest in configuration automation, validation and
compliance
► Finally, Nagios has been like a Honda, very reliable !!!
2012 39

Thank You !!!

kjalleda@imvu.com
We are Hiring: imvu.com/jobs
Engineering Blog: http://engineering.imvu.com/

2012 41

Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

More Related Content

Similar to Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World (20)

More from Nagios (20)

Recently uploaded (20)

Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World