SlideShare a Scribd company logo
Monitor Everything/*
Monitor everything
Monitor everything
Don’t Limit Monitoring to
Infrastructure
- [ ] Introduction
- [ ] Operations Overview
- [ ] Failure Models
- [ ] Demo
- [ ] Best Practices & Recap
docker run who-is-brian
Brian Christner
• Co-Founder 56K.Cloud / SRE
• Docker Captain
• Passionate about Monitoring and
anything with .io domains
Monitor everything
Zugerberg, Zug, Switzerland
Monitor everything
Monitor everything
OK, give me a business
example
2019
2 Billion++
every Week
Casinos
Monitor everything
Industrial machines
Monitor Early
www.your-now.com
A joint collaboration between BMW & Daimler AG
- [ X ] Introduction
- [ ] Operations Overview
- [ ] What to Monitor
- [ ] Demo
- [ ] Best Practices & Recap
• Everything is Automated
• Reduce Costs
• No support calls / support tickets
Ops Paradise
Ops Firefighting
Operational Models
Users Care about 3 Things
• Availability - Is my System Online Yes/No
• Latency - Does it take a long time to access applications x, y, z
• Reliability - Can the user rely on using the application
Brain Based Tools
• We can track 8 objects on
average
• 4 Moving Objects
• Build Dashboards & Tools
accordingly
Design vs Experience
SRE
SRE is treats Operations as if it
were a Software Problem
“Hope is not a strategy.”
Traditional SRE saying
www.google.com/sre
SRE (Site Reliability Engineering)
TL;DR - SRE
Latency
Traffic
Errors
Saturation
4 Golden Signals
(Request) Rate: the number of requests, per second,
you services are serving.
(Request) Errors: the number of failed requests per
second. Utilization: the average time that the
resource was busy servicing work
(Request) Duration: distributions of the amount of
time each request takes.
R.E.D (Microservice Level)
Resource: all physical server functional components (CPUs,
disks, busses, ...)
Utilization: the average time that the resource was busy
servicing work
Saturation: the degree to which the resource has extra
work which it can't service, often queued
Errors: the count of error events
U.S.E (Low Level / Infrastructure)
For every resource, check Utilization, Saturation, and Errors
- [ X ] Introduction
- [ X ] Operations Overview
- [ ] What to Monitor
- [ ] Demo
- [ ] Best Practices & Recap
Operating
Systems
Understanding Failure Models
Config Mgt Monitoring LoggingCI/CD ..more..Images Networking Volumes
PhysicalVirtualizationPublic Cloud
Developer
Services
Registry
Services
Access
Policies
App Lifecycle
Management
Automation &
Extensibility
Networking Orchestration Storage
Container Engine
CONTAINER PLATFORM
Platform
Security
Operating
Systems
Host / Hardware
Config Mgt Monitoring LoggingCI/CD ..more..Images Networking Volumes
PhysicalVirtualizationPublic Cloud
Developer
Services
Registry
Services
Access
Policies
App Lifecycle
Management
Automation &
Extensibility
Networking Orchestration Storage
Container Engine
CONTAINER PLATFORM
Platform
Security
CPU
Memory
Liveness
File Descriptors
Storage Capacity
Operating
Systems
Networking
Config Mgt Monitoring LoggingCI/CD ..more..Images Networking Volumes
PhysicalVirtualizationPublic Cloud
Developer
Services
Registry
Services
Access
Policies
App Lifecycle
Management
Automation &
Extensibility
Networking Orchestration Storage
Container Engine
CONTAINER PLATFORM
Platform
Security
Reachability
Link Utilization
File Descriptors
Storage Capacity
Operating
Systems
Orchestration
Config Mgt Monitoring LoggingCI/CD ..more..Images Networking Volumes
PhysicalVirtualizationPublic Cloud
Developer
Services
Registry
Services
Access
Policies
App Lifecycle
Management
Automation &
Extensibility
Networking Orchestration Storage
Container Engine
CONTAINER PLATFORM
Platform
Security
State
Deployment Rates
Capacity
Scheduling Events
Operating
Systems
Applications
Config Mgt Monitoring LoggingCI/CD ..more..Images Networking Volumes
PhysicalVirtualizationPublic Cloud
Developer
Services
Registry
Services
Access
Policies
App Lifecycle
Management
Automation &
Extensibility
Networking Orchestration Storage
Container Engine
CONTAINER PLATFORM
Platform
Security
CPU
Memory
Liveness
File Descriptors
Storage Capacity
Monitor everything
• Total Downtime: Just under 4
minutes
• 502 error messages total: 12 000
• People affected by the 502 error
who did not get their bargain: 400
Website Down?
Monitor everything
- [ X ] Introduction
- [ X ] Operations Overview
- [ X ] What to Monitor
- [ ] Demo
- [ ] Best Practices & Recap
DEMO
cAdvisor
Node-Exporter
Containers
Containers
Containers
Host Metrics
Containers
metrics
Scraped
Endpoints
Push Alerts
Docker
- [ X ] Introduction
- [ X ] Operations Overview
- [ X ] What to Monitor
- [ X ] Demo
- [ ] Best Practices & Recap
Improve Incrementally
Monitor everything
Best Practices
• Start small & increment
• Don’t Overlert yourself
• Set Resource Limits
• Aim for actionable Information
• Run separate from Workload
• Test for Failures
• Know your Failure Models
Monitor everything
Resources
•56K.Cloud - https://56K.Cloud
•Prometheus - https://github.com/vegasbrianc/prometheus
•Monitoring Labs – github.com/56kcloud/Training/
•Docker Resource Link - https://awesome-docker.netlify.com
•GitLab Dashboards - https://monitor.gitlab.net
Thank You
Brian Christner
brian@56K.cloud
@idomyowntrick
s

More Related Content

PPTX
Where to Deploy Hadoop: Bare-metal or Cloud?
PDF
Amazon Web Services
PPTX
클라우드 기반 앱 현대화를 위한 5가지 체크리스트 - 윤석찬 :: AWS 현대적 애플리케이션 개발
PDF
마이크로서비스를 위한 App Mesh & Cloud Map - 김세호 솔루션즈 아키텍트, AWS :: AWS Summit Seoul 2019
PDF
VMware on AWS를 통한 하이브리드 클라우드 구축 적용 - 홍정진, AWS Partner SA/ VMC on AWS
PDF
마이크로 서비스 아키텍처와 앱 모던화 – 김일호 :: AWS Builders Online Series
PDF
20200930 AWS Black Belt Online Seminar Amazon Kinesis Video Streams
PDF
성공적인 클라우드 마이그레이션을 위한 디지털 트랜스포메이션 전략 - Gregor Hophe :: AWS 클라우드 마이그레이션 온라인
Where to Deploy Hadoop: Bare-metal or Cloud?
Amazon Web Services
클라우드 기반 앱 현대화를 위한 5가지 체크리스트 - 윤석찬 :: AWS 현대적 애플리케이션 개발
마이크로서비스를 위한 App Mesh & Cloud Map - 김세호 솔루션즈 아키텍트, AWS :: AWS Summit Seoul 2019
VMware on AWS를 통한 하이브리드 클라우드 구축 적용 - 홍정진, AWS Partner SA/ VMC on AWS
마이크로 서비스 아키텍처와 앱 모던화 – 김일호 :: AWS Builders Online Series
20200930 AWS Black Belt Online Seminar Amazon Kinesis Video Streams
성공적인 클라우드 마이그레이션을 위한 디지털 트랜스포메이션 전략 - Gregor Hophe :: AWS 클라우드 마이그레이션 온라인

What's hot (12)

PDF
Amazon web service
PDF
20200721 AWS Black Belt Online Seminar AWS App Mesh
PPTX
Practical Approach to Data Maintenance in for PLM in Oracle EBS
PPTX
PDF
Aws summit 2014 redshift
PDF
20200128 AWS Black Belt Online Seminar Amazon Forecast
PDF
[Retail & CPG Day 2019] 기조연설 | Cloud Journey of Traditional Retailers for Dig...
PPTX
AWS Initiate - Migrando seus dados - Windows Workloads
PDF
AWS re:Invent re:Cap - 종단간 보안을 위한 클라우드 아키텍처 구축 - 양승도
PPTX
DevOps for Azure
PDF
VMware
PPTX
ユースケースからみた実装カタログ Developer meetup 20171207 amplify
Amazon web service
20200721 AWS Black Belt Online Seminar AWS App Mesh
Practical Approach to Data Maintenance in for PLM in Oracle EBS
Aws summit 2014 redshift
20200128 AWS Black Belt Online Seminar Amazon Forecast
[Retail & CPG Day 2019] 기조연설 | Cloud Journey of Traditional Retailers for Dig...
AWS Initiate - Migrando seus dados - Windows Workloads
AWS re:Invent re:Cap - 종단간 보안을 위한 클라우드 아키텍처 구축 - 양승도
DevOps for Azure
VMware
ユースケースからみた実装カタログ Developer meetup 20171207 amplify
Ad

Similar to Monitor everything (20)

PPTX
DockerCon Europe 2018 Monitoring & Logging Workshop
PDF
Proactive ops for container orchestration environments
PPTX
Monitoring microservice applications: An SRE’s perspective
PDF
I pushed in production :). Have a nice weekend
PPTX
Site reliability engineering
PDF
9 postproduction
PDF
How to Monitoring the SRE Golden Signals (E-Book)
PDF
Seeing RED: Monitoring and Observability in the Age of Microservices
PDF
Life Cycle of Metrics, Alerting, and Performance Monitoring in Microservices
PDF
Debug production server by counter
PDF
Enterprise Drupal Application & Hosting Infrastructure Level Monitoring
PDF
S.R.E - create ultra-scalable and highly reliable systems
PPTX
That Conference 2017: Refactoring your Monitoring
PPTX
Introduction to Prometheus Monitoring (Singapore Meetup)
PPTX
Site (Service) Reliability Engineering
PPTX
Shift Remote: DevOps: When metrics are not enough, and everyone is on-call - ...
PPTX
SRE (service reliability engineer) on big DevOps platform running on the clou...
PDF
Overview of Site Reliability Engineering (SRE) & best practices
PDF
The Shape of Cloud to Come
PDF
SRE & Kubernetes
DockerCon Europe 2018 Monitoring & Logging Workshop
Proactive ops for container orchestration environments
Monitoring microservice applications: An SRE’s perspective
I pushed in production :). Have a nice weekend
Site reliability engineering
9 postproduction
How to Monitoring the SRE Golden Signals (E-Book)
Seeing RED: Monitoring and Observability in the Age of Microservices
Life Cycle of Metrics, Alerting, and Performance Monitoring in Microservices
Debug production server by counter
Enterprise Drupal Application & Hosting Infrastructure Level Monitoring
S.R.E - create ultra-scalable and highly reliable systems
That Conference 2017: Refactoring your Monitoring
Introduction to Prometheus Monitoring (Singapore Meetup)
Site (Service) Reliability Engineering
Shift Remote: DevOps: When metrics are not enough, and everyone is on-call - ...
SRE (service reliability engineer) on big DevOps platform running on the clou...
Overview of Site Reliability Engineering (SRE) & best practices
The Shape of Cloud to Come
SRE & Kubernetes
Ad

More from Brian Christner (16)

PDF
56k.cloud intro and pitch deck
PDF
Monitor Traefik with Prometheus
PDF
56k.cloud training
PDF
Docker, DevOps, & IoT
PDF
Docker Switzelrand Meetup #18 DockerCon Recap
PDF
Zero to Serverless in 60s - Anywhere
PPTX
Experts Live CH Bern Docker & Kubernetes
PDF
56K.cloud Docker Training
PPTX
Cloud Native & Docker
PDF
Cloud Native & Docker
PPTX
Docker Serverless
PPTX
Lugano Tech Talks - Why Docker
PPTX
Docker - Build, Ship and Run Any App, Anywhere Hollywood edition
PPTX
Monitoring mayhem - Using Prometheus
PDF
Docker Swarm 1.12 Overview and Demo
PPTX
2015 DockeCon monitoring presentation
56k.cloud intro and pitch deck
Monitor Traefik with Prometheus
56k.cloud training
Docker, DevOps, & IoT
Docker Switzelrand Meetup #18 DockerCon Recap
Zero to Serverless in 60s - Anywhere
Experts Live CH Bern Docker & Kubernetes
56K.cloud Docker Training
Cloud Native & Docker
Cloud Native & Docker
Docker Serverless
Lugano Tech Talks - Why Docker
Docker - Build, Ship and Run Any App, Anywhere Hollywood edition
Monitoring mayhem - Using Prometheus
Docker Swarm 1.12 Overview and Demo
2015 DockeCon monitoring presentation

Recently uploaded (20)

PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Big Data Technologies - Introduction.pptx
PDF
KodekX | Application Modernization Development
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Building Integrated photovoltaic BIPV_UPV.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The Rise and Fall of 3GPP – Time for a Sabbatical?
Digital-Transformation-Roadmap-for-Companies.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Unlocking AI with Model Context Protocol (MCP)
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation_ Review paper, used for researhc scholars
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Reach Out and Touch Someone: Haptics and Empathic Computing
Big Data Technologies - Introduction.pptx
KodekX | Application Modernization Development
Diabetes mellitus diagnosis method based random forest with bat algorithm
“AI and Expert System Decision Support & Business Intelligence Systems”

Monitor everything

  • 4. Don’t Limit Monitoring to Infrastructure
  • 5. - [ ] Introduction - [ ] Operations Overview - [ ] Failure Models - [ ] Demo - [ ] Best Practices & Recap
  • 6. docker run who-is-brian Brian Christner • Co-Founder 56K.Cloud / SRE • Docker Captain • Passionate about Monitoring and anything with .io domains
  • 11. OK, give me a business example
  • 17. www.your-now.com A joint collaboration between BMW & Daimler AG
  • 18. - [ X ] Introduction - [ ] Operations Overview - [ ] What to Monitor - [ ] Demo - [ ] Best Practices & Recap
  • 19. • Everything is Automated • Reduce Costs • No support calls / support tickets Ops Paradise
  • 22. Users Care about 3 Things • Availability - Is my System Online Yes/No • Latency - Does it take a long time to access applications x, y, z • Reliability - Can the user rely on using the application
  • 23. Brain Based Tools • We can track 8 objects on average • 4 Moving Objects • Build Dashboards & Tools accordingly
  • 25. SRE
  • 26. SRE is treats Operations as if it were a Software Problem “Hope is not a strategy.” Traditional SRE saying www.google.com/sre SRE (Site Reliability Engineering)
  • 28. (Request) Rate: the number of requests, per second, you services are serving. (Request) Errors: the number of failed requests per second. Utilization: the average time that the resource was busy servicing work (Request) Duration: distributions of the amount of time each request takes. R.E.D (Microservice Level)
  • 29. Resource: all physical server functional components (CPUs, disks, busses, ...) Utilization: the average time that the resource was busy servicing work Saturation: the degree to which the resource has extra work which it can't service, often queued Errors: the count of error events U.S.E (Low Level / Infrastructure) For every resource, check Utilization, Saturation, and Errors
  • 30. - [ X ] Introduction - [ X ] Operations Overview - [ ] What to Monitor - [ ] Demo - [ ] Best Practices & Recap
  • 31. Operating Systems Understanding Failure Models Config Mgt Monitoring LoggingCI/CD ..more..Images Networking Volumes PhysicalVirtualizationPublic Cloud Developer Services Registry Services Access Policies App Lifecycle Management Automation & Extensibility Networking Orchestration Storage Container Engine CONTAINER PLATFORM Platform Security
  • 32. Operating Systems Host / Hardware Config Mgt Monitoring LoggingCI/CD ..more..Images Networking Volumes PhysicalVirtualizationPublic Cloud Developer Services Registry Services Access Policies App Lifecycle Management Automation & Extensibility Networking Orchestration Storage Container Engine CONTAINER PLATFORM Platform Security CPU Memory Liveness File Descriptors Storage Capacity
  • 33. Operating Systems Networking Config Mgt Monitoring LoggingCI/CD ..more..Images Networking Volumes PhysicalVirtualizationPublic Cloud Developer Services Registry Services Access Policies App Lifecycle Management Automation & Extensibility Networking Orchestration Storage Container Engine CONTAINER PLATFORM Platform Security Reachability Link Utilization File Descriptors Storage Capacity
  • 34. Operating Systems Orchestration Config Mgt Monitoring LoggingCI/CD ..more..Images Networking Volumes PhysicalVirtualizationPublic Cloud Developer Services Registry Services Access Policies App Lifecycle Management Automation & Extensibility Networking Orchestration Storage Container Engine CONTAINER PLATFORM Platform Security State Deployment Rates Capacity Scheduling Events
  • 35. Operating Systems Applications Config Mgt Monitoring LoggingCI/CD ..more..Images Networking Volumes PhysicalVirtualizationPublic Cloud Developer Services Registry Services Access Policies App Lifecycle Management Automation & Extensibility Networking Orchestration Storage Container Engine CONTAINER PLATFORM Platform Security CPU Memory Liveness File Descriptors Storage Capacity
  • 37. • Total Downtime: Just under 4 minutes • 502 error messages total: 12 000 • People affected by the 502 error who did not get their bargain: 400 Website Down?
  • 39. - [ X ] Introduction - [ X ] Operations Overview - [ X ] What to Monitor - [ ] Demo - [ ] Best Practices & Recap
  • 40. DEMO
  • 42. - [ X ] Introduction - [ X ] Operations Overview - [ X ] What to Monitor - [ X ] Demo - [ ] Best Practices & Recap
  • 45. Best Practices • Start small & increment • Don’t Overlert yourself • Set Resource Limits • Aim for actionable Information • Run separate from Workload • Test for Failures • Know your Failure Models
  • 47. Resources •56K.Cloud - https://56K.Cloud •Prometheus - https://github.com/vegasbrianc/prometheus •Monitoring Labs – github.com/56kcloud/Training/ •Docker Resource Link - https://awesome-docker.netlify.com •GitLab Dashboards - https://monitor.gitlab.net

Editor's Notes

  • #15: Air quality Problems in casino Top Players Sporting events weather
  • #23: What we won’t hear from customers: - I hope that we have more maintenance windows I really wish my application wasn’t so fast My application is far too stable
  • #46: Know-your-Failure Modes Structured Logs Test for Failures Optimize for MTTR and not Uptime Alerts Should be actionable with little digging Catch the Symptom and not the problem