SlideShare a Scribd company logo
Systems
Introspection
A story about the tools we use to determine if the system is
running properly
Me
@andrewhowdencom
Software engineer @
Sitewards
I built much of this stuff
Anything useful will be posted to Twitter.
Please Interrupt.
Please Contribute!
Thanks <3
Behrouz · Anton
Problem #1: Node is
dead
It starts with an alert
What is “Down” even?
It’s down
ALERT InstanceDown
IF (probe_success{job="blackbox"} == 0) or (probe_success{job="auth_blackbox"}
== 0)
FOR 5m
ANNOTATIONS {summary="Address {{ $labels.instance }} appears to be down. See
https://wiki.sitewards.net/index.php/Instance_Down"}
“probe_success” means:
- The site responded within 5 seconds
- With a 200 status code
- Over a valid SSL certificate
Application level failure
What else is happening?
I see you, full disk
Problem Solved
$ sudo ncdu
$ sudo rm -rf /var/lib/docker
- Easy to understand
- Open to all developers
- Gives historical context of
problems
- Allows setting defined alerts for
known error conditions
Finding full disks
Collects the metrics,
stores them in its
database
Fires alerts (via
AlertManager)
Prometheus
Makes pretty
dashboards
Grafana
Sends alerts to my
phone. Pretty cheap.
Pushover
Time series monitoring stack
Prometheus
Master
Prometheus
+
Node
Exporter
Grafana
Pushover
Lessons (Time series data)
- It takes time to learn to use time series data
- It’s worth while
- Dashboards are a good intro
- The prometheus query tool is where you end up
Lessons (Alerts)
Alerts need to be:
- Specific; They need to say what’s wrong exactly
- Actionable; If you’re ignoring them they’re not an alert
they’re a shitty log
- Relevant; If you’re the wrong person for the alert, find
the right person
Alert fatgiue kills (systems)!
Get time
series
data!!
Problem #2: Critical
form is broken
It starts with a phone call
“Uhm the diesel
checkout seems to be
broken”
What’s the next step?
Logs
{
"time": "2018-11-21T19:23:39+00:00",
"type": "magento",
"environment": "production",
"host": "www1.nope.de,
"service": "magento",
"pid": "",
"request_id": "",
"payload": {
"version": "1.0.0",
"severity": "CRIT",
"store_code": "default",
"request_url": "",
"remote_address": "91.137.96.100",
"file": "web/app/code/core/Mage/Core/Block/Template.php",
"line_number": "243",
"message": "Not valid template
file:frontend/base/default/template/nope/customer/account/dashboard/hello.phtml"
}
}
Querying logs by error level
I see you, faulty template file*
Problem Solved
(Rolled back release)
- Does not require access to
production
- Can be searched en-masse by
log attributes
- Faults can be automatically
detected
FInding Faults
Sends logs to Google
Cloud
fluentbit
Analyses Logs
Google Cloud
Logging
Collects all logs on
machine to a central
location via Syslog
interface
systemd-journald
Requires binary (json) logging
12:58:00 “Hello, World”
+ Easy to read
+ What you’re used to
- Hard to parse
- Hard to handle when there’s
30,000 logs
- Hard to analyse
text/plain
{“time”: “12:58:00”, “message”:
“Hello, World”}
+ Easy to analyse/handle
+ Can be read with `jq`
+ Well supported
- Hard to read in less without jq
application/json
Lessons
- Finding a binary log format to use consistently is hard
- I now adopt Googles
- It can take a while to get used to json logging
Log in
JSON and
forward!!
Problem #3:
Cascading failure
due to 3rd party
It starts with an alert
Application level failure (Again)
What else is happening?
Logs are showing … MySQL
connections?
But connections are sleepy
Ah, found you
Problem Badly
solved
Dropped
`default_socket_timeout` to 5s
- Took an hour to find
- No easy way of separating out a
multitude of issues
- Issues hiding behind issues
Transaction Tracing
Finding Long Requests
Collects and reviews
transaction traces
jaeger
Stores transaction
traces
Cassandra
A PHP extension that
modifies
automatically
instruments and
propagates tracing
context:
molton
Lessons
- New relic does this
- You don’t need it until you really do
- It becomes more relevant for distributed systems
- If you’re making an API call, that’s a “distributed” system
We will do this soon!
In Summary
Instrumentation is good!
Time Series Data · Structured &
Aggregated Logging · Transaction
Traces
I didn’t get what you
just said
Brian Brazil, and Fabxc are very nice maintainers. There’s also
a mailing list. Lastly, there’s a tonne of stuff on youtube.
What questions do you have?
Find all information at:
https://git.io/TODO
Give feedback at:
https://goo.gl/forms/NOPE

More Related Content

PDF
Writing malware while the blue team is staring at you
PDF
Attacker Ghost Stories - ShmooCon 2014
PDF
Practical Exploitation - Webappy Style
PDF
6 ways to hack your JavaScript application by Viktor Turskyi
PDF
JS Fest 2019. Виктор Турский. 6 способов взломать твое JavaScript приложение
PDF
Webinar slides: How to Secure MongoDB with ClusterControl
PPTX
DDoS: practical survival
PPTX
Cross-Platform Desktop Apps with JavaScript
Writing malware while the blue team is staring at you
Attacker Ghost Stories - ShmooCon 2014
Practical Exploitation - Webappy Style
6 ways to hack your JavaScript application by Viktor Turskyi
JS Fest 2019. Виктор Турский. 6 способов взломать твое JavaScript приложение
Webinar slides: How to Secure MongoDB with ClusterControl
DDoS: practical survival
Cross-Platform Desktop Apps with JavaScript

What's hot (17)

PDF
BlueHat v17 || Go Hunt: An Automated Approach for Security Alert Validation
PDF
Blindsql
PPT
Sembang2 Keselamatan It 2004
PDF
Malware analysis
PDF
Think Like a Hacker - Database Attack Vectors
PPTX
Selenium Conference 2014 -- Bangalore
DOCX
Tricks to hack notepad
PDF
Threat stack aws
PDF
How to exploit heartbleed vulnerability demonstration
PDF
String.fromCharCode(60)script>alert("XSS")String.fromCharCode(60)/script>
PDF
2010 za con_jameel_haffejee
PPTX
Thread dump troubleshooting
PDF
OSB230: Anatomy of Ransomware V2
PDF
Kettunen, miaubiz fuzzing at scale and in style
PPTX
Being HAPI! Reverse Proxying on Purpose
PPTX
Wcl303 russinovich
PPTX
Buffer overflow for Beginners
BlueHat v17 || Go Hunt: An Automated Approach for Security Alert Validation
Blindsql
Sembang2 Keselamatan It 2004
Malware analysis
Think Like a Hacker - Database Attack Vectors
Selenium Conference 2014 -- Bangalore
Tricks to hack notepad
Threat stack aws
How to exploit heartbleed vulnerability demonstration
String.fromCharCode(60)script>alert("XSS")String.fromCharCode(60)/script>
2010 za con_jameel_haffejee
Thread dump troubleshooting
OSB230: Anatomy of Ransomware V2
Kettunen, miaubiz fuzzing at scale and in style
Being HAPI! Reverse Proxying on Purpose
Wcl303 russinovich
Buffer overflow for Beginners
Ad

Similar to Systems Introspection (20)

PDF
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
PDF
Microservices and Prometheus (Microservices NYC 2016)
PPTX
An Introduction to Prometheus (GrafanaCon 2016)
PPTX
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
PPTX
Evolution of Monitoring and Prometheus (Dublin 2018)
PDF
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
PDF
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
PPTX
Prometheus - Open Source Forum Japan
PPTX
Prometheus (Prometheus London, 2016)
PDF
Docker Logging and analysing with Elastic Stack
PDF
Docker Logging and analysing with Elastic Stack - Jakub Hajek
PDF
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
PPTX
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
PDF
Systems Monitoring with Prometheus (Devops Ireland April 2015)
PPTX
Hot to build continuously processing for 24/7 real-time data streaming platform?
PPTX
What does "monitoring" mean? (FOSDEM 2017)
PDF
Intro to open source observability with grafana, prometheus, loki, and tempo(...
PDF
Prometheus (Microsoft, 2016)
PDF
Monitoring your Python with Prometheus (Python Ireland April 2015)
PDF
Redundant devops
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Microservices and Prometheus (Microservices NYC 2016)
An Introduction to Prometheus (GrafanaCon 2016)
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Evolution of Monitoring and Prometheus (Dublin 2018)
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Prometheus - Open Source Forum Japan
Prometheus (Prometheus London, 2016)
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Hot to build continuously processing for 24/7 real-time data streaming platform?
What does "monitoring" mean? (FOSDEM 2017)
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Prometheus (Microsoft, 2016)
Monitoring your Python with Prometheus (Python Ireland April 2015)
Redundant devops
Ad

Recently uploaded (20)

PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Transform Your Business with a Software ERP System
PDF
Understanding Forklifts - TECH EHS Solution
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
Essential Infomation Tech presentation.pptx
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
System and Network Administration Chapter 2
PDF
Digital Strategies for Manufacturing Companies
PPTX
Introduction to Artificial Intelligence
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Nekopoi APK 2025 free lastest update
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Softaken Excel to vCard Converter Software.pdf
wealthsignaloriginal-com-DS-text-... (1).pdf
How to Choose the Right IT Partner for Your Business in Malaysia
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Transform Your Business with a Software ERP System
Understanding Forklifts - TECH EHS Solution
2025 Textile ERP Trends: SAP, Odoo & Oracle
Essential Infomation Tech presentation.pptx
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Design an Analysis of Algorithms I-SECS-1021-03
System and Network Administration Chapter 2
Digital Strategies for Manufacturing Companies
Introduction to Artificial Intelligence
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Nekopoi APK 2025 free lastest update
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Softaken Excel to vCard Converter Software.pdf

Systems Introspection