Systems Introspection

Systems
Introspection
A story about the tools we use to determine if the system is
running properly

Me
@andrewhowdencom
Software engineer @
Sitewards
I built much of this stuff
Anything useful will be posted to Twitter.

It’s down
ALERT InstanceDown
IF (probe_success{job="blackbox"} == 0) or (probe_success{job="auth_blackbox"}
== 0)
FOR 5m
ANNOTATIONS {summary="Address {{ $labels.instance }} appears to be down. See
https://wiki.sitewards.net/index.php/Instance_Down"}
“probe_success” means:
- The site responded within 5 seconds
- With a 200 status code
- Over a valid SSL certificate

Problem Solved
$ sudo ncdu
$ sudo rm -rf /var/lib/docker
- Easy to understand
- Open to all developers
- Gives historical context of
problems
- Allows setting defined alerts for
known error conditions

Finding full disks
Collects the metrics,
stores them in its
database
Fires alerts (via
AlertManager)
Prometheus
Makes pretty
dashboards
Grafana
Sends alerts to my
phone. Pretty cheap.
Pushover

Time series monitoring stack
Prometheus
Master
Prometheus
+
Node
Exporter
Grafana
Pushover

Lessons (Time series data)
- It takes time to learn to use time series data
- It’s worth while
- Dashboards are a good intro
- The prometheus query tool is where you end up

Lessons (Alerts)
Alerts need to be:
- Specific; They need to say what’s wrong exactly
- Actionable; If you’re ignoring them they’re not an alert
they’re a shitty log
- Relevant; If you’re the wrong person for the alert, find
the right person
Alert fatgiue kills (systems)!

Problem #2: Critical
form is broken

It starts with a phone call
“Uhm the diesel
checkout seems to be
broken”

Logs
{
"time": "2018-11-21T19:23:39+00:00",
"type": "magento",
"environment": "production",
"host": "www1.nope.de,
"service": "magento",
"pid": "",
"request_id": "",
"payload": {
"version": "1.0.0",
"severity": "CRIT",
"store_code": "default",
"request_url": "",
"remote_address": "91.137.96.100",
"file": "web/app/code/core/Mage/Core/Block/Template.php",
"line_number": "243",
"message": "Not valid template
file:frontend/base/default/template/nope/customer/account/dashboard/hello.phtml"
}
}

I see you, faulty template file*

Problem Solved
(Rolled back release)
- Does not require access to
production
- Can be searched en-masse by
log attributes
- Faults can be automatically
detected

FInding Faults
Sends logs to Google
Cloud
fluentbit
Analyses Logs
Google Cloud
Logging
Collects all logs on
machine to a central
location via Syslog
interface
systemd-journald

Requires binary (json) logging
12:58:00 “Hello, World”
+ Easy to read
+ What you’re used to
- Hard to parse
- Hard to handle when there’s
30,000 logs
- Hard to analyse
text/plain
{“time”: “12:58:00”, “message”:
“Hello, World”}
+ Easy to analyse/handle
+ Can be read with `jq`
+ Well supported
- Hard to read in less without jq
application/json

Lessons
- Finding a binary log format to use consistently is hard
- I now adopt Googles
- It can take a while to get used to json logging

Problem #3:
Cascading failure
due to 3rd party

Application level failure (Again)

Logs are showing … MySQL
connections?

Problem Badly
solved
Dropped
`default_socket_timeout` to 5s
- Took an hour to find
- No easy way of separating out a
multitude of issues
- Issues hiding behind issues

Finding Long Requests
Collects and reviews
transaction traces
jaeger
Stores transaction
traces
Cassandra
A PHP extension that
modifies
automatically
instruments and
propagates tracing
context:
molton

Lessons
- New relic does this
- You don’t need it until you really do
- It becomes more relevant for distributed systems
- If you’re making an API call, that’s a “distributed” system

In Summary
Instrumentation is good!
Time Series Data · Structured &
Aggregated Logging · Transaction
Traces

I didn’t get what you
just said
Brian Brazil, and Fabxc are very nice maintainers. There’s also
a mailing list. Lastly, there’s a tonne of stuff on youtube.

What questions do you have?
Find all information at:
https://git.io/TODO
Give feedback at:
https://goo.gl/forms/NOPE

Systems Introspection

More Related Content

What's hot (17)

Similar to Systems Introspection (20)

Recently uploaded (20)

Systems Introspection