SlideShare a Scribd company logo
Winning the metrics battle (finally)
Winning the metrics battle
         (finally)
       Simon Hildrew           Nick Satterly
  Infrastructure Developer   Monitoring Engineer
        The Guardian           The Guardian
Winning the metrics battle
The metrics battlefield
Total metrics


                                180,000




                       50,000

1,400   2,800
http://www.flickr.com/photos/ghostsigns/6676069121



                                              5 minutes


                                                every 15
                                                 seconds

                                                           http://www.flickr.com/photos/millynet/134071210
developer dashboards
Physical screens   Screensaver hacks
20


15


10


 5


 0
dev


hack
business dashboards
metrics + dashboards = culture change
http://www.flickr.com/photos/chrisjames_taylor/5454315456
our approach
         Side project    ➡   Prioritise
Incremental upgrade      ➡   Understand the real problem
Use off the shelf tool   ➡   Question the tools
  Pragmatic solution     ➡   Be ambitious
      Done in a year     ➡   Keep learning
Prioritise
drowning in work




http://www.flickr.com/photos/iampeas/246738971
a dedicated monitoring and
     metrics engineer
Understand the
 real problem
Urgent issue -
current tool end of life
The story so far...
metrics were not helping us
 solve production outages
ballooning number of
     applications
but... difficult to instrument applications
T.T. Detect
                      +
T.T. Fix   =   T.T. Diagnose
                      +
                T.T. Resolve
inaccessible tools




             http://www.flickr.com/photos/kdashy/2678539087
inconsistent data



http://www.flickr.com/photos/sybrenstuvel/2468506922
hypothesising & arguing
 easier than measuring


               http://www.flickr.com/photos/nouqraz/200049988
The ‘right’ thing
• measure everything
• measure frequently
• measure each data point once
• input and output must be open
Question the tools
Brute force?




http://www.flickr.com/photos/epublicist/3546059144
The safe option?




http://www.flickr.com/photos/alicebartlett/2361209195
Unintuitive?




http://www.flickr.com/photos/merlijnhoek/2841785343
Imposing a flawed model?
http://www.flickr.com/photos/evansville/8953838/
Too difficult / no progress?
http://www.flickr.com/photos/ginja_andy/4165849136/
Nagios


•   the “IBM” of monitoring tools

•   compromise over quantity and frequency of checks

•   < insert your criticism of nagios here >
Zabbix


•   metric collection tightly coupled to monitoring tool

•   confusing UI with poor visualisation

•   needed brute force to make limited API work
The ‘right’ thing
• measure everything
• measure frequently
• measure each data point once
• input and output must be open
Winning the metrics battle
don’t compromise
Be ambitious
http://www.flickr.com/photos/mugley/2961131550




                                 Throw work away
Draw your dream
http://www.flickr.com/photos/sk8geek/7358702704




                             Get as far as you can
screens           users
                                            db?             alerting?


 Etsy dashboard
                                                         message queue




              graphite                                   SNMP?           syslog?



 FITB                     ganglia                 api?



network      hosts           applications
Develop missing pieces




              http://www.flickr.com/photos/kalexanderson/5969012589
screens           users
                                            mongodb                   alerta       elastic
                                                                                   search


 Etsy dashboard
                                                                 message queue



                                                                          syslog     SNMP
              graphite                          ganglia alerts
                                                                          alerts     alerts




 FITB                     ganglia                ganglia-api




network      hosts           applications
Guardian Management
https://github.com/guardian/guardian-management
Ganglia API
https://github.com/guardian/ganglia-api
rescale image???




                       Alerta
https://github.com/guardian/alerta
Current stack
• Ganglia             • Guardian management
                        https://github.com/guardian/guardian-management


• FITB                • Guardian ganglia-api
                        https://github.com/guardian/ganglia-api
• Graphite
                      • Guardian alerta
• Etsy dashboards       https://github.com/guardian/alerta
Keep learning
we are not there yet
Watch the cultural changes
detecting
diagnosis
diagnosis
performance testing
confirmation
#monitoringsucks
➡ Prioritise
➡ Understand the real problem
➡ Question the tools
➡ Be ambitious
➡ Keep learning
tools can change culture
Thank you
               http://github.com/guardian
                 http://gu.com/p/3ap5f
       Simon Hildrew                    Nick Satterly
            @sihil                      @nicksatterly
simon.hildrew@guardian.co.uk    nick.satterly@guardian.co.uk

More Related Content

PDF
Log management with Graylog2 - FrOSCon 2012
PPTX
Log Monitoring Simplified - Get the best out of Graylog2 & Icinga 2
PPT
Ganglia monitoring
PDF
Introduction into ARIA
PDF
HDFS Design Principles
PDF
From Software Engineering To Machine Learning
PDF
Hack 101 - IIT Delhi HackU 2011
PDF
Icinga Camp New York 2018 - Icinga2 and Elastic
Log management with Graylog2 - FrOSCon 2012
Log Monitoring Simplified - Get the best out of Graylog2 & Icinga 2
Ganglia monitoring
Introduction into ARIA
HDFS Design Principles
From Software Engineering To Machine Learning
Hack 101 - IIT Delhi HackU 2011
Icinga Camp New York 2018 - Icinga2 and Elastic

Similar to Winning the metrics battle (20)

PDF
Monitoring the #DevOps way
PDF
Funnel Analysis with Apache Spark and Druid
PDF
Crossing the Production Barrier: Development at Scale
PPTX
Google Cloud: Next'19 Extended Hanoi
KEY
Move out from AppEngine, and Python PaaS alternatives
PDF
Google Wave: Ripple or Tsunami for Research
PDF
Honeypots for Active Defense
PDF
Hyperleger Fabric Workshop - Denver Blockchain Week
PDF
Performance - a challenging craft
PDF
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
PPTX
Ultimate Git Workflow - Seoul 2015
PDF
Learning Github Actions Automation And Integration Of Cicd With Github 1st Ed...
PDF
CONFidence 2017: Hackers vs SOC - 12 hours to break in, 250 days to detect (G...
PDF
Blue team reboot - HackFest
PPTX
How to fully automate a store.pptx
PDF
Introduzione alle metodologie di sviluppo agile
PDF
Accessibility and web innovation. (no notes)
PDF
Using Blockchain to Increase Supply Chain Transparency
PDF
Adoption of AI: The Great Opportunities for Everyone
PDF
AB Testing, Ads and other 3rd party tags - London WebPerf - March 2018
Monitoring the #DevOps way
Funnel Analysis with Apache Spark and Druid
Crossing the Production Barrier: Development at Scale
Google Cloud: Next'19 Extended Hanoi
Move out from AppEngine, and Python PaaS alternatives
Google Wave: Ripple or Tsunami for Research
Honeypots for Active Defense
Hyperleger Fabric Workshop - Denver Blockchain Week
Performance - a challenging craft
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
Ultimate Git Workflow - Seoul 2015
Learning Github Actions Automation And Integration Of Cicd With Github 1st Ed...
CONFidence 2017: Hackers vs SOC - 12 hours to break in, 250 days to detect (G...
Blue team reboot - HackFest
How to fully automate a store.pptx
Introduzione alle metodologie di sviluppo agile
Accessibility and web innovation. (no notes)
Using Blockchain to Increase Supply Chain Transparency
Adoption of AI: The Great Opportunities for Everyone
AB Testing, Ads and other 3rd party tags - London WebPerf - March 2018
Ad

Recently uploaded (20)

PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Hindi spoken digit analysis for native and non-native speakers
PPTX
observCloud-Native Containerability and monitoring.pptx
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PPT
Module 1.ppt Iot fundamentals and Architecture
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
1. Introduction to Computer Programming.pptx
PDF
Getting started with AI Agents and Multi-Agent Systems
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Hybrid model detection and classification of lung cancer
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PPTX
Chapter 5: Probability Theory and Statistics
PDF
August Patch Tuesday
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
Zenith AI: Advanced Artificial Intelligence
Enhancing emotion recognition model for a student engagement use case through...
Hindi spoken digit analysis for native and non-native speakers
observCloud-Native Containerability and monitoring.pptx
O2C Customer Invoices to Receipt V15A.pptx
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Module 1.ppt Iot fundamentals and Architecture
Group 1 Presentation -Planning and Decision Making .pptx
Univ-Connecticut-ChatGPT-Presentaion.pdf
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
1. Introduction to Computer Programming.pptx
Getting started with AI Agents and Multi-Agent Systems
OMC Textile Division Presentation 2021.pptx
Hybrid model detection and classification of lung cancer
cloud_computing_Infrastucture_as_cloud_p
A contest of sentiment analysis: k-nearest neighbor versus neural network
Chapter 5: Probability Theory and Statistics
August Patch Tuesday
NewMind AI Weekly Chronicles – August ’25 Week III
Ad

Winning the metrics battle