SlideShare a Scribd company logo
Monitoring at a SaaS
Startup
Tradeoffs and Tools
Bridget Kromhout
8thbridge.com
small social commerce startup
acquired in the last week by Fluid, Inc.
small devteam
I am the ops team
twisty maze of little shell scripts
bespoke artisanal
monitoring
difficult to modify;
doesn’t scale
http://www.pcgameshardware.de/screenshots/1280x1024/2007/07/CA01.jpg
New Relic
pros:
nice graphs
application-level view
good error analysis
cons:
slow to update
many false-positive alerts
high prices (better now)
Motivating
Change
http://99designs.com/illustrations/contests/illustration-pagerduty-161025/entries
: as hideous as you remember
https://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a-bit-longer-thank-you-very-much/
“Horrendous interface”
“Well, it’s more “old” than anything
else. At least everything is in the
same place as you left it because it’s
been the same since 1912.”
“Sensu has so many
moving parts that I
wouldn’t be able to
sleep at night unless
I set up a Nagios
instance to make
sure they were all
running.”
-- @murphy_slaw (via @lozzd)
HBase: monitor all the ports?!?
hbck: the HBase consistency checker
nagios -> bash script -> parsing output of hbck
http://www.ymc.ch/en/how-to-monitor-hbase-health-by-nagios
adding alert after alert after...
http://modiinhub.com/wp-content/uploads/2014/02/logo-mongodb-tagline.png
Monitoring at a SAAS Startup: Tradeoffs and Tools
MMS (MongoDB Monitoring Service)
“cyber” monday:
1988 called; wants its word back.
the rewards of hubris
MMS showed the issue
but we weren't alerting on it
didn't understand the global write lock
If it moves, we track it.
Sometimes we’ll draw a graph
of something that isn’t moving
yet, just in case it decides to
make a run for it. -- @indec
http://codeascraft.com/2011/02/15/measure-anything-measure-everything/
Graphite & StatsD
➔ Graphite
◆ Store and visualize time-series data
◆ http://graphite.readthedocs.org/
➔ StatsD
◆ Measure everything! (Timers, counters, events, …)
◆ https://github.com/etsy/statsd/
Where we were
➔ Graphite 0.9.9 (wanted 0.9.12)
◆ over 2 years old
◆ missing new features (Consolidate by!)
➔ StatsD was newish, but…
◆ hand-rolled
◆ running in a screen session
◆ on a special snowflake box
Community cookbooks?
➔ Graphite ones good, but…
◆ focus on Apache (we use nginx)
◆ we haven’t moved to Chef 11 (gasp!)
➔ StatsD
◆ https://github.com/librato/statsd-cookbook
◆ launches daemons via upstart
◆ generates config file based on attributes
Graphite cookbook (Part 1)
➔ Install in a virtualenv (django, uwsgi, nginx)
➔ Dependencies recommended
◆ https://github.com/graphite-project/graphite-
web/blob/master/requirements.txt
➔ libcairo2-dev package on Ubuntu 12.04 LTS
➔ install graphite’s 3 parts via pip
Graphite cookbook (Part 2)
➔ graphite-web
◆ Django app, renders graphs
➔ whisper
◆ fixed-size database for storing time-series data
◆ like RRD
➔ carbon
◆ carbon-cache.py - stores data
◆ carbon-aggregator.py - buffers, then stores
◆ carbon-relay.py - for sharding/replication
when in doubt: tcpdump is your friend
http://blog.johngoulah.com/2012/10/looking-under-the-covers-of-statsd/
carbon-aggravator (between 0.9.10 & 0.9.12)
# If set true, metric received will be forwarded to
# DESTINATIONS in addition to
# the output of the aggregation rules. If set false
# the carbon-aggregator will
# only ever send the output of aggregation.
FORWARD_ALL = True
Carbonate
whisper-fill.py
backfill datapoints between whisper files
2am: sudden drop-off
8am: look at graphs: ?!?!
10am: and we’re back.
What’s next?
❏ finds real problems
❏ actionable alerting
❏ usable by all
❏ …?
the ideal
monitoring
solution...
http://www.quickmeme.com/img/f5/f512ff9bee084263df5571d3c81388019dcb063173e1dbcfa2babac9274576b6.jpg
What we’re actually using now
StatsD
Application-level error
analysis
Alarms for autoscaling
Timers &
counters
Log & host-level
Hadoop & HBase
visualization
MongoDB
Graphs
Time-series
data graphing
client-side
plugins
External uptime checks
oncall rotation/alerting
Threshold-based alarms
Dashboard
Discuss!
Twitter: @bridgetkromhout
Email: bridget@kromhout.org

More Related Content

PDF
Front-end development automation with Grunt
PPTX
Introduction to Jenkins X
PDF
Jenkins X Hands-on - automated CI/CD solution for cloud native applications o...
PDF
PaddlePaddle: A Complete Enterprise Solution
PDF
Gulp: Your Build Process Will Thank You
PPTX
Automated Development Workflow with Gulp
PDF
Grunt training deck
PDF
Intro to Gulp
Front-end development automation with Grunt
Introduction to Jenkins X
Jenkins X Hands-on - automated CI/CD solution for cloud native applications o...
PaddlePaddle: A Complete Enterprise Solution
Gulp: Your Build Process Will Thank You
Automated Development Workflow with Gulp
Grunt training deck
Intro to Gulp

What's hot (20)

PDF
2016 may-31 dockercon2016–cool-hackssubmission
PDF
Improving your workflow with gulp
PPTX
Introduction to Gulp
PPTX
JavaScript Task Runners - Gulp & Grunt
PDF
Devenez le plus heureux des Front-end avec Gulp.js
PPTX
Gulp: Task Runner
PDF
How we maintain 200+ Drupal sites in Georgetown University
PDF
Preprocessor Workflow with Grunt
PDF
Docker, developers take power
PPTX
Ultimate Survival - React-Native edition
PDF
Migration from Drupal 7 to Drupal 8 - How Docker can save our lives!
PDF
Angular workflow with gulp.js
PDF
Production Ready Javascript With Grunt
ODP
Eclipse Buildship DemoCamp Hamburg (June 2015) with additional screenshots
PDF
Web development tools { starter pack }
PDF
Serving Pull Requests with Jenkins
PDF
Gulp - The Streaming Build System
PPTX
Grunt - The JavaScript Task Runner
PDF
TDC2016SP - Esqueça Grunt ou Gulp. Webpack and NPM rule them all!
PDF
Devfest 2021' - Artifact Registry Introduction (Taipei)
2016 may-31 dockercon2016–cool-hackssubmission
Improving your workflow with gulp
Introduction to Gulp
JavaScript Task Runners - Gulp & Grunt
Devenez le plus heureux des Front-end avec Gulp.js
Gulp: Task Runner
How we maintain 200+ Drupal sites in Georgetown University
Preprocessor Workflow with Grunt
Docker, developers take power
Ultimate Survival - React-Native edition
Migration from Drupal 7 to Drupal 8 - How Docker can save our lives!
Angular workflow with gulp.js
Production Ready Javascript With Grunt
Eclipse Buildship DemoCamp Hamburg (June 2015) with additional screenshots
Web development tools { starter pack }
Serving Pull Requests with Jenkins
Gulp - The Streaming Build System
Grunt - The JavaScript Task Runner
TDC2016SP - Esqueça Grunt ou Gulp. Webpack and NPM rule them all!
Devfest 2021' - Artifact Registry Introduction (Taipei)
Ad

Similar to Monitoring at a SAAS Startup: Tradeoffs and Tools (20)

PDF
From Zero To Visibility
PDF
Monitoring as Software Validation
PDF
OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age
PPTX
Time to say goodbye to your Nagios based setup
PDF
OSMC 2014: Time to say goodbye to your Nagios setup | Oliver Jan
KEY
Trending with Purpose
PDF
How to measure everything - a million metrics per second with minimal develop...
PDF
Open source monitoring systems
PDF
app/server monitoring
PDF
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
PDF
Making operations visible - Nick Gallbreath
PDF
Making operations visible - devopsdays tokyo 2013
PDF
Highly Available Graphite
PPTX
What does "monitoring" mean? (FOSDEM 2017)
PDF
StatsD DevOps Boulder 7/20/15
PDF
Multi Layer Monitoring V1
PPTX
Functionality, security and performance monitoring of web assets (e.g. Joomla...
PDF
Handout: 'Open Source Tools & Resources'
PPTX
Evolution of Monitoring and Prometheus (Dublin 2018)
PDF
Monitoring in the cloud with Puppet
From Zero To Visibility
Monitoring as Software Validation
OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age
Time to say goodbye to your Nagios based setup
OSMC 2014: Time to say goodbye to your Nagios setup | Oliver Jan
Trending with Purpose
How to measure everything - a million metrics per second with minimal develop...
Open source monitoring systems
app/server monitoring
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
Making operations visible - Nick Gallbreath
Making operations visible - devopsdays tokyo 2013
Highly Available Graphite
What does "monitoring" mean? (FOSDEM 2017)
StatsD DevOps Boulder 7/20/15
Multi Layer Monitoring V1
Functionality, security and performance monitoring of web assets (e.g. Joomla...
Handout: 'Open Source Tools & Resources'
Evolution of Monitoring and Prometheus (Dublin 2018)
Monitoring in the cloud with Puppet
Ad

More from bridgetkromhout (20)

PDF
An introduction to Helm - KubeCon EU 2020
PDF
Join Our Party: The Cloud Native Adventure Brigade (Kubernetes Belgium 2019)
PDF
devops, distributed (devopsdays Ghent 2019)
PDF
Join Our Party: The Cloud Native Adventure Brigade (devopsdays Philly 2019)
PDF
Join Our Party: The Cloud Native Adventure Brigade (TCSW 2019)
PDF
Increasing Reliability via Helm Pre-Release Checks (Helm Summit 2019)
PDF
Kubernetes for the Impatient (devopsdays Cape Town 2019)
PDF
Join Our Party: The Cloud Native Adventure Brigade (OSS 2019)
PDF
Helm 3: Navigating To Distant Shores (OSS NA 2019)
PDF
Helm 3: Navigating to Distant Shores (OSCON 2019)
PDF
Kubernetes for the Impatient (Velocity San Jose 2019)
PDF
Community projects inform enterprise products (Velocity San Jose 2019)
PDF
Helm 3: Navigating to Distant Shores (KubeCon EU 2019)
PDF
Kubernetes Operability Tooling (GOTO Chicago 2019)
PDF
Kubernetes Operability Tooling (Minnebar 2019)
PDF
Livetweeting Tech Conferences - SREcon Americas 2019
PDF
Kubernetes Operability Tooling (devopsdays Seattle 2019)
PDF
Kubernetes Operability Tooling (LEAP 2019)
PDF
Day 2 Kubernetes - Tools for Operability (KubeCon)
PDF
Cloud, Containers, Kubernetes (YOW Melbourne 2018)
An introduction to Helm - KubeCon EU 2020
Join Our Party: The Cloud Native Adventure Brigade (Kubernetes Belgium 2019)
devops, distributed (devopsdays Ghent 2019)
Join Our Party: The Cloud Native Adventure Brigade (devopsdays Philly 2019)
Join Our Party: The Cloud Native Adventure Brigade (TCSW 2019)
Increasing Reliability via Helm Pre-Release Checks (Helm Summit 2019)
Kubernetes for the Impatient (devopsdays Cape Town 2019)
Join Our Party: The Cloud Native Adventure Brigade (OSS 2019)
Helm 3: Navigating To Distant Shores (OSS NA 2019)
Helm 3: Navigating to Distant Shores (OSCON 2019)
Kubernetes for the Impatient (Velocity San Jose 2019)
Community projects inform enterprise products (Velocity San Jose 2019)
Helm 3: Navigating to Distant Shores (KubeCon EU 2019)
Kubernetes Operability Tooling (GOTO Chicago 2019)
Kubernetes Operability Tooling (Minnebar 2019)
Livetweeting Tech Conferences - SREcon Americas 2019
Kubernetes Operability Tooling (devopsdays Seattle 2019)
Kubernetes Operability Tooling (LEAP 2019)
Day 2 Kubernetes - Tools for Operability (KubeCon)
Cloud, Containers, Kubernetes (YOW Melbourne 2018)

Recently uploaded (20)

PDF
cuic standard and advanced reporting.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Encapsulation theory and applications.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Electronic commerce courselecture one. Pdf
PDF
KodekX | Application Modernization Development
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
cuic standard and advanced reporting.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Programs and apps: productivity, graphics, security and other tools
Encapsulation theory and applications.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Electronic commerce courselecture one. Pdf
KodekX | Application Modernization Development
Network Security Unit 5.pdf for BCA BBA.
Chapter 3 Spatial Domain Image Processing.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
The Rise and Fall of 3GPP – Time for a Sabbatical?
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Reach Out and Touch Someone: Haptics and Empathic Computing
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
The AUB Centre for AI in Media Proposal.docx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx

Monitoring at a SAAS Startup: Tradeoffs and Tools