SlideShare a Scribd company logo
Monitoring:
Doing it the right way
Saving your sanity, making the world a better place.
John-Daniel Trask
Raygun
Who is this person?
Funny voice.
Weird name.
What’s his deal?
What we’re covering
Doing monitoring the right way.
Getting started, but also helping identify potential issues
With your current monitoring.
Getting
started with monitoring
Why monitor things?
• You’re not employed to write code.
Business value?
• I got a CS degree mate, not an MBA
Framework
1. Do a best-efforts analysis of what to monitor
• Bad things
• Good things
• Limit to a sprint or two of effort, you won’t get it perfect.
2. Perform post mortems to identify gaps in your monitoring
3. Update/improve monitoring based on findings
4. GOTO 2
Getting started
1. Something is better than nothing.
2. You can go a long way with some simple tools
Doing monitoring right
Metrics & Monitoring
• Metrics are a given value or measure.
• Monitoring encapsulates everything.
Metric: error rate over time
Full monitoring: full story about an error
Monitoring vs. Observability
• Is there a difference?
User
Server
Application
Know what to measure
• You could track almost anything
Crash reporting JavaScript log aggregation
Metrics server (statsd) Alerting and pager tools
Dashboarding tools Usage monitoring
Real User Monitoring Structured and unstructured logging
Up time monitoring Network monitoring
Application performance monitoring Wire-level monitoring
Server monitoring Canary logging
Log aggregation service Distributed tracing
Intrusion detection monitoring Employee device monitoring
Cloud metrics from cloud provider Security monitoring
Custom event tracking Advanced visualizing tooling
Deployment tracking Infrastructure change monitoring
User navigation and click tracking monitoring Infrastructure spend monitoring
The obvious
• Errors & error rate
• Server performance
• Requests per second per service
• Database call times
What about the less obvious?
• Back to basics: business value users!
Amazon example
• When is the page loaded?
Doing monitoring right
What about the less obvious?
• Cost to serve each customer
• Feature use tracking to double down on what customers do the most
• Good things
• Any you’d add?
Getting
the most from
monitoring
Connect the dots
• Connect all your data together
• Connect teams
Information Radiators
• A fancy way of saying TV
Doing monitoring right
Averages are lies
• Yet so many monitoring tools focus on them
On Average, everyone here is worth $900m.
Quantiles
• Median
• P90
• P99
• P99.9
P25 P75
Doing monitoring right
Why are quantiles hard?
• You need to store everything
Doing monitoring right
Common
monitoring
mistakes
Common mistakes
• Only measuring your servers
Doing monitoring right
Common mistakes
• Only measuring the server
Common mistakes
• Saving money by flying blind
Common mistakes
• Bad sampling of data
Common mistakes
• Building it yourself
Common mistakes
• Making it difficult to add to new systems
Common mistakes
• Making it difficult to consume the data
Doing monitoring right
Common mistakes
• Just buying/installing a tool doesn’t help
Common mistakes
• Not getting out of the building
Common mistakes
• NEW: Compliance!
Common mistakes
• Anyone have a mistake they’d love to share?
References & Links
• Observability vs. Monitoring: https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c
• Coda Hale, Metrics, Metrics Everywhere: https://www.youtube.com/watch?v=czes-oa0yik
• Google Site Reliability Book: https://landing.google.com/sre/
• Developers are your GDPR risk: https://jdtrask.com/post/software-developers-are-your-biggest-gdpr-risk.html
• Netflix tech blog: https://medium.com/netflix-techblog/
Questions?
Thank you for coming!
@traskjd
@raygunio
Raygun.com (I also have some swag)

More Related Content

PPTX
Humans by the hundred (DevOps Days Ohio)
PPTX
Humans by the hundred
PPTX
Software Economies of Scale
PPTX
How to measure the business impact of web performance
PPTX
6 Guidelines for A/B Testing
PPTX
10 Guidelines for A/B Testing
PPTX
Yelp Tech Talks: Mobile Testing 1, 2, 3
PDF
What it would be like to build a system for publishing magazines on mobile de...
Humans by the hundred (DevOps Days Ohio)
Humans by the hundred
Software Economies of Scale
How to measure the business impact of web performance
6 Guidelines for A/B Testing
10 Guidelines for A/B Testing
Yelp Tech Talks: Mobile Testing 1, 2, 3
What it would be like to build a system for publishing magazines on mobile de...

What's hot (20)

PDF
The Fault In Our Code
PPTX
PPTX
BA World Boston: Evening the Odds with Monte Carlo Project Forecasting
PDF
SkyStem Webinar-Close Like a Rock Star
PPTX
Machine Learning Vital Signs
PPTX
Literacy Iq Test 1[1]
PDF
Back-upNightmares8
PPT
Ticket101
PPTX
How to Pass CCIE in first Attempt? Tips by CCIE Experts
PPTX
Performance testing mistakes newbies make
PDF
Conversion Rate Optimization 101 - Kick-Start Your Growth Engine
PPT
Mw ppt
PDF
The anatomy of an A/B Test - JSConf Colombia Workshop
PPTX
Managed-Workstations-Presentation-EN
PPTX
The art of Bugging
PPTX
Probing Questions
PPT
Why OBVA Virtual Assistant for your ebay/amazon store and small business out...
PDF
You have no idea what your users want - WordCamp PDX
PDF
HUSTEF '21 Keynote: Hands Off Exploratory Testing - Managing at Scale
PPTX
Lean Responsive
The Fault In Our Code
BA World Boston: Evening the Odds with Monte Carlo Project Forecasting
SkyStem Webinar-Close Like a Rock Star
Machine Learning Vital Signs
Literacy Iq Test 1[1]
Back-upNightmares8
Ticket101
How to Pass CCIE in first Attempt? Tips by CCIE Experts
Performance testing mistakes newbies make
Conversion Rate Optimization 101 - Kick-Start Your Growth Engine
Mw ppt
The anatomy of an A/B Test - JSConf Colombia Workshop
Managed-Workstations-Presentation-EN
The art of Bugging
Probing Questions
Why OBVA Virtual Assistant for your ebay/amazon store and small business out...
You have no idea what your users want - WordCamp PDX
HUSTEF '21 Keynote: Hands Off Exploratory Testing - Managing at Scale
Lean Responsive
Ad

Similar to Doing monitoring right (20)

PPTX
Unified Operations Vision
PDF
How to improve your system monitoring
PDF
Building Reliability - The Realities of Observability
PDF
Building Reliability - The Realities of Observability
PDF
Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at Li...
PPTX
What does "monitoring" mean? (FOSDEM 2017)
PDF
Measure All the Things! - Austin Data Day 2014
PPTX
StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
PPTX
Monitoring & alerting presentation sabin&mustafa
PPTX
DevOps monitoring: Feedback loops in enterprise environments
PPTX
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
PDF
OSMC 2019 | How to improve database Observability by Charles Judith
PDF
Monitoring Drupal In an Infrastructure as Code Age
PDF
Monitoring and observability
PPTX
The differing ways to monitor and instrument
PDF
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
PDF
Observability for Emerging Infra (what got you here won't get you there)
PDF
Monitoring Your AWS Cloud Infrastructure
PPTX
TopConf : DevOps Monitoring: Feedback Loops in Enterprise Environments
PDF
What the hell is your software doing at runtime?
Unified Operations Vision
How to improve your system monitoring
Building Reliability - The Realities of Observability
Building Reliability - The Realities of Observability
Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at Li...
What does "monitoring" mean? (FOSDEM 2017)
Measure All the Things! - Austin Data Day 2014
StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
Monitoring & alerting presentation sabin&mustafa
DevOps monitoring: Feedback loops in enterprise environments
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
OSMC 2019 | How to improve database Observability by Charles Judith
Monitoring Drupal In an Infrastructure as Code Age
Monitoring and observability
The differing ways to monitor and instrument
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Observability for Emerging Infra (what got you here won't get you there)
Monitoring Your AWS Cloud Infrastructure
TopConf : DevOps Monitoring: Feedback Loops in Enterprise Environments
What the hell is your software doing at runtime?
Ad

Recently uploaded (20)

PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
medical staffing services at VALiNTRY
PPTX
Introduction to Artificial Intelligence
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
top salesforce developer skills in 2025.pdf
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
wealthsignaloriginal-com-DS-text-... (1).pdf
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Design an Analysis of Algorithms II-SECS-1021-03
Softaken Excel to vCard Converter Software.pdf
VVF-Customer-Presentation2025-Ver1.9.pptx
Operating system designcfffgfgggggggvggggggggg
medical staffing services at VALiNTRY
Introduction to Artificial Intelligence
CHAPTER 2 - PM Management and IT Context
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Wondershare Filmora 15 Crack With Activation Key [2025
Design an Analysis of Algorithms I-SECS-1021-03
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PTS Company Brochure 2025 (1).pdf.......
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
top salesforce developer skills in 2025.pdf
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
How to Migrate SBCGlobal Email to Yahoo Easily
How to Choose the Right IT Partner for Your Business in Malaysia

Doing monitoring right

Editor's Notes

  • #3: I’m John-Daniel Trask, or JD to everyone. First name is two names. I’ve loved code since the age of 9, more than 25 years of coding away any chance I got. I’m a 10 year Microsoft MVP, distinguished alumni and awarded the wellingtonian of the year in science and technology. I have VM snap shots of various machines, and thought it amusing that I was writing monitoring tools when I was in my early teens (“Console” which would track everything). I have been running businesses through high school and university. At high school I sold “browser privacy tools” to class mates… In 2013 we launched Raygun, a software crash reporting product. In 2015 a Real User Monitoring product. And in April announced our innovative approach to APM. We’re processing billions of data points while I’m standing here. A lot of my learnings are from our own experience in monitoring, but also from conversations with customers
  • #4: Reminder, in case you’re in the wrong room or can’t remember what this talk was going to be about. Target is more for folks getting started, but aim to provide value to even the folks focusing on monitoring in their org. The slides will be posted online. Easiest way to get them once posted: follow me on twitter: traskjd This is about monitoring your software, not everything else (e.g. osquery for monitoring your team machines etc)
  • #5: How should we be thinking about monitoring? Here’s how to get started, how to think about monitoring and even if you have monitoring in place, hopefully this challenges your thinking about what monitoring is really about.
  • #6: Coda Hale: You’re not employed to code, you’re employed to create business value.
  • #7: What is business value? - Adding anew feature that customers want - Improving an existing feature to please customers - Reducing bugs that annoy customers. - Making our software faster so not annoying our customers - Making our site look better (could be worse!) to please customers What is the common thread? Customers. I talk about ‘we write code for human beings’, yet most of us rarely think about the user, or worse – hold them in disdain.
  • #8: This is a basic getting started framework. Fact is, there’s so much stuff out there to help. Look at Raygun, we do 3 things now – CR, RUM, APM. Still get asked about Logs, custom metrics, uptime monitoring, security reporting, statsd endpoints, wire level monitoring,
  • #9: Big one for Raygun was StatsD.
  • #10: This was what got us excited – so easy to start instrumenting our code.
  • #11: Metrics are great for spotting trends, or issues, but they don’t tell you the why or how. The “what’s broken” indicates the symptom; the “why” indicates a (possibly intermediate) cause.“What” versus “why” is one of the most important distinctions in writing good monitoring with maximum signal and minimum noise.
  • #13: While here’s the full story, the data behind the metric. Helping me as a developer figure out the HOW and the WHY, so I can resolve the issue.
  • #14: Discussion going on about these two, whereby the basics seem to be that observability is a super-set of monitoring…. Twitter defined observability as: Monitoring - Alerting/visualization - Distributed systems tracing infrastructure - Log aggregation/analytics However I count all of that as monitoring. https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c
  • #15: Something at each level. Doesn’t need to be perfect, but shouldn’t lie to you (more on this later!) Why have I ordered it this way? The user is the most important. If they aren’t happy, we aren’t getting paid. Best to track that most The application helps understand things that are likely to impact the user. Server monitoring. But isn’t server monitoring super important? It is, but oftentimes it’s value is correlating to user monitoring. For example, measure user server load experience, if it’s slow, look at the server data being correlated with it. Maybe it’s a sign of maxed out
  • #16: Next slide
  • #17: Look at this, here’s just some stuff we could be doing…. So let’s get real. It’s why my framework is to only do some at the start and then build it up over time. Trying to handle everything will waste a lot of time, money and won’t help. You’ll still find issues (kind of like 100% code coverage in unit tests – you still have bugs)
  • #18: Bias, but errors are a very easy to add and high value thing to track. They are literally where you crap all over your customer. We see this “we don’t use this anymore”, but they have 68,000 users a month getting errors… I wonder what the CEO would think about the team not bothering with 68,000 customers being let down each month. It also gives you the ammunition you need to ask for time to pay down technical debt which is common but engineers typically get asked to keep doing feature development.
  • #19: While the items that I listed impact users, we also want to be creative and think about the non-obvious.
  • #21: Forget about the “well technically”, which is common for us engineers. Think about the business value, the end user. That changes what we measure!
  • #22: There’s lots of things that aren’t immediately obvious. However, they can create enormous business value. Cost to serve is a huge one for many earlier stage organizations. If you’re spending more to provide the service than the customer pays, you won’t be around very long. This is a number typically managed by VP’s or higher, but helping them is never a bad idea. It also leads to helping understand the cost to scale. I’m sure there’s some examples in the audience? What’s a thing you monitored and were surprised by?
  • #23: Getting the most out of your investment
  • #24: Connect your data together Key is often being able to easily correlate data across different monitors. For example, seeing a response time start exploading and rapidly identifying if there’s an activity issue on your web server, the underlying database, one of the caches, etc. Connect your teams One of the biggest wins we see is making monitoring more than just an engineering or SRE concern. Being able to lift error reports into Jira is one example – it connects product and project managers and helps them work how they like to, but in collaboration with engineering.
  • #25: TV’s Just like I believe whiteboards are better than almost any digital equivalent, getting dashboards of live data on the wall is amazing. Suddenly key metrics become part of the water cooler chat. Jump to next slide.
  • #27: Averages are lies. Why do so many tools in this area use them? Because it’s super cheap. But a cheap lie doesn’t make it a good lie.
  • #29: Quantiles help us understand distribution
  • #30: Bell Curve - How we’re taught distributions look like. - This shows the median and the 25% and 75% - This is kind of bullshit. Think back to the Gates example, it ain’t a bell curve distribution. It’s almost always the same in software.
  • #31: Actual distribution - This is more common - Sometimes you may even see a lump near the end - Understanding outliers is key to better monitoring
  • #32: Why does more tooling not support this? You need to store A LOT of data, and you need to then look at the % points after sorting it. This gets very slow. Example: 100m events, which is not actually a lot. 8 bits in a byte, 64 bit numbers, you’re loading 762MB of data into memory, sorting it and taking single values at positions. Event if 32bit it’s a lot of data, but remember – 100m events is not that much when it comes to machine data!
  • #34: Getting the most out of your investment
  • #35: What happens on your server is not what happens to the user. Ensure you track the customer experience. Note about RUM and what we see with todays very heavy JS frameworks
  • #36: Noticing a trend here? I’m big on making sure we always focus on the user.
  • #38: Not uncommon to see tech teams try and avoid the costs associated with monitoring. They might only monitor some things, or only a few servers. This causes problems. Also, asking for money is easy if you are connecting it to the business value. Noticing a pattern here? 
  • #39: Sampling has a place, but be wary around your tools. Example: ecommerce provider with 1 server, costing 10% of all sales. Another CR tool was sampling but buried that note in their docs, so customer couldn’t see the issue
  • #40: Always, ALWAYS takes longer than you expect. Not a sales pitch, but if I’ve spent $10m building a product, tell me how you’re going to do it yourself in six months? I want to hire you. Also, statistics can be very hard. Also, introduces concern that maybe the bug is in the monitoring tools. There are great open source projects also, but consider the TCO of now managing that internally DOES BUILDING IT YOURSELF CREATE BUSINESS VALUE? No. Unless you are Netflix etc.
  • #41: Make it easy to surface statistics, monitor data, etc. If it’s difficult, it likely won’t be added when the time pressure is on. Similar impact as with Unit Tests, oftentimes it won’t be done unless somebody else has already laid all the groundwork with mocks, fakes etc. Make it so easy that it’s not considered a real cost to add (see: impact of StatsD)
  • #42: Raygun story of CTO’s pet project: error tracking, that almost nobody in the business can use. Did some magical things, shame only one person in this company of thousands actually could use the thing… Other story: one customer had to employ a full time person to teach the team how to use dashboards! wtfbqq
  • #43: Raygun story of CTO’s pet project: error tracking, that almost nobody in the business can use. Did some magical things, shame only one person in this company of thousands actually could use the thing… Other story: one customer had to employ a full time person to teach the team how to use dashboards! wtfbqq
  • #44: We see this all the time, and it’s frustrating. Raygun story: The highest value thing we can do, is hold training sessions with the team. Story of Board Meetings (rare, but should be common). Just installing it is kind of like buying your pain killers but never actually using them when in pain.
  • #45: Remember how almost everything goes back to fellow humans? Look, I know it’s awesome coding away. Raygun Story: Events, taking engineers rather than sales people. 180 degree change. See the impact, feel the pain. Next-level engineer.
  • #46: Welcome to GDPR. Where all your ‘I will build this or cobble it together myself’ could cost your company 4% of revenue when you’re audited. Youch! Yet, I keep seeing this, and I think it’s the biggest threat to businesses in relation to compliance.
  • #47: SUM UP WHAT WE COVERED