Monitor everything

Don’t Limit Monitoring to
Infrastructure

- [ ] Introduction
- [ ] Operations Overview
- [ ] Failure Models
- [ ] Demo
- [ ] Best Practices & Recap

docker run who-is-brian
Brian Christner
• Co-Founder 56K.Cloud / SRE
• Docker Captain
• Passionate about Monitoring and
anything with .io domains

OK, give me a business
example

www.your-now.com
A joint collaboration between BMW & Daimler AG

- [ X ] Introduction
- [ ] Operations Overview
- [ ] What to Monitor
- [ ] Demo

• Everything is Automated
• Reduce Costs
• No support calls / support tickets
Ops Paradise

Users Care about 3 Things
• Availability - Is my System Online Yes/No
• Latency - Does it take a long time to access applications x, y, z
• Reliability - Can the user rely on using the application

Brain Based Tools
• We can track 8 objects on
average
• 4 Moving Objects
• Build Dashboards & Tools
accordingly

SRE is treats Operations as if it
were a Software Problem
“Hope is not a strategy.”
Traditional SRE saying
www.google.com/sre
SRE (Site Reliability Engineering)

TL;DR - SRE
Latency
Traffic
Errors
Saturation
4 Golden Signals

(Request) Rate: the number of requests, per second,
you services are serving.
(Request) Errors: the number of failed requests per
second. Utilization: the average time that the
resource was busy servicing work
(Request) Duration: distributions of the amount of
time each request takes.
R.E.D (Microservice Level)

Resource: all physical server functional components (CPUs,
disks, busses, ...)
Utilization: the average time that the resource was busy
servicing work
Saturation: the degree to which the resource has extra
work which it can't service, often queued
Errors: the count of error events
U.S.E (Low Level / Infrastructure)
For every resource, check Utilization, Saturation, and Errors

- [ X ] Operations Overview
- [ ] What to Monitor
- [ ] Demo

Operating
Systems
Understanding Failure Models
Config Mgt Monitoring LoggingCI/CD ..more..Images Networking Volumes
PhysicalVirtualizationPublic Cloud
Developer
Services
Registry
Services
Access
Policies
App Lifecycle
Management
Automation &
Extensibility
Networking Orchestration Storage
Container Engine
CONTAINER PLATFORM
Platform
Security

Operating
Systems
Host / Hardware
Developer
Services
Registry
Services
Access
Policies
App Lifecycle
Management
Automation &
Extensibility
Container Engine
CONTAINER PLATFORM
Platform
Security
CPU
Memory
Liveness
File Descriptors
Storage Capacity

Operating
Systems
Networking
Developer
Services
Registry
Services
Access
Policies
App Lifecycle
Management
Automation &
Extensibility
Container Engine
CONTAINER PLATFORM
Platform
Security
Reachability
Link Utilization
File Descriptors
Storage Capacity

Operating
Systems
Orchestration
Developer
Services
Registry
Services
Access
Policies
App Lifecycle
Management
Automation &
Extensibility
Container Engine
CONTAINER PLATFORM
Platform
Security
State
Deployment Rates
Capacity
Scheduling Events

Operating
Systems
Applications
Developer
Services
Registry
Services
Access
Policies
App Lifecycle
Management
Automation &
Extensibility
Container Engine
CONTAINER PLATFORM
Platform
Security
CPU
Memory
Liveness
File Descriptors
Storage Capacity

• Total Downtime: Just under 4
minutes
• 502 error messages total: 12 000
• People affected by the 502 error
who did not get their bargain: 400
Website Down?

- [ X ] What to Monitor
- [ ] Demo

cAdvisor
Node-Exporter
Containers
Containers
Containers
Host Metrics
Containers
metrics
Scraped
Endpoints
Push Alerts
Docker

- [ X ] What to Monitor
- [ X ] Demo

Best Practices
• Start small & increment
• Don’t Overlert yourself
• Set Resource Limits
• Aim for actionable Information
• Run separate from Workload
• Test for Failures
• Know your Failure Models

Resources
•56K.Cloud - https://56K.Cloud
•Prometheus - https://github.com/vegasbrianc/prometheus
•Monitoring Labs – github.com/56kcloud/Training/
•Docker Resource Link - https://awesome-docker.netlify.com
•GitLab Dashboards - https://monitor.gitlab.net

Thank You
Brian Christner
brian@56K.cloud
@idomyowntrick
s

Monitor everything

More Related Content

What's hot (12)

Similar to Monitor everything (20)

More from Brian Christner (16)

Recently uploaded (20)

Monitor everything

Editor's Notes