SlideShare a Scribd company logo
A practical introduction
to observability
Nikolay Stoitsev
Engineering Manager @ Halo DX
A practical introduction to observability
A practical introduction to observability
Monitoring
Monitoring
Logging
Monitoring
Logging
Distributed Tracing
Monitoring
Monitoring system components
Application
Application
Application
Monitoring
System
Time Series Database
Dashboard
Monitoring system components
Application
Application
Application
Monitoring
System
Time Series Database
Dashboard
Prometheus, Graphite, m3db
Monitoring system components
Application
Application
Application
Monitoring
System
Time Series Database
Dashboard
Prometheus UI, Grafana
Counter
Counter increase
Timer
Labels
What to watch out for?
Cardinality
Cardinality
● search.success, app_version=1, type=Patient
● search.success, app_version=1, type=Exam
● search.success, app_version=2, type=Patient
● search.success, app_version=2, type=Exam
#1. Don’t add high
cardinality tags
Metrics are not accurate
● DB engine optimizes for faster operations
● When performing some operations for a different time resolution
● When archiving metrics for long term storage
#2. Don’t rely on metrics
infrastructure for BI
Don’t use average values
● Averages hide the
outliers
● Doesn’t represent
typical behavior
Use percentiles
● Represents the
worst experience in
90% of the time
● Can measure p90,
p95, p99
p90
Histograms
● Shows the whole
distribution
● Configurable
buckets
#3. Use percentiles or
histograms
A practical introduction to observability
Example alert
Alert Levels
Send Slack/Teams Message
Alert Levels
Send alert to oncall
Alerting tool is usually built
into the metrics system
Alerts should be
● urgent
● important
● actionable
● real
Should represent either
ongoing or imminent
problems
What to watch out for?
1. Better to remove an alert
when it’s noisy
A practical introduction to observability
#2. Use success rate
Symptom-based monitoring
● Number of 5xx HTTP response codes
● Response time
● Email sending is not working
● Users can’t log in
Cause-based monitoring
● Free disk space on database server
● Memory utilisation
● Free file descriptors
Many causes may trigger a
symptom
User impact is most
important
#3. Focus on
symptom-based alerts
Cause-based alerts are
also necessary
Picking alerts to start with
Front-end
Load
Balancer
Back-end DB
Count rate of
successful
log-in
Count
request
success rate
A practical introduction to observability
Logging
Logging system
Application
Application
Application
Log
Aggregation
Database
Dashboard
Log
Collector
Log
Collector
Log
Collector
Logstash, Fluentd
Logging system
Application
Application
Application
Log
Aggregation
Database
Dashboard
Log
Collector
Log
Collector
Log
Collector
Elasticsearch, Loki
Logging system
Application
Application
Application
Log
Aggregation
Database
Dashboard
Log
Collector
Log
Collector
Log
Collector
Kibana
Log messages
Finding logs
Can search by:
● content of log message
message : *notification*
● all logs from a service
kubernetes.labels.app/name.keyword : "api-gateway"
● many more thanks to flexible query schema
What to watch out for?
#1. Use appropriate log
level - info, warn, error
Structured logging
● Append useful key=value pairs
● Can group (aggregate) by the keys
● Can sort by aggregations
#2. Use structured logging
Too many logs
Application
Application
Application
Log
Aggregation
Real Time Search
Engine
Log Scraper
Log Scraper
Log Scraper
Dashboard
Too many logs
Application
Application
Application
Log
Aggregation
Real Time Search
Engine
Log Scraper
Log Scraper
Log Scraper
Dashboard
Reduce log
retention period
Too many logs
Application
Application
Application
Log
Aggregation
Real Time Search
Engine
Log Scraper
Log Scraper
Log Scraper
Dashboard
Cold Storage
Query UI
#3. Use proper retention
period or cold storage
A practical introduction to observability
Distributed tracing
https://www.youtube.com/watch?v=rM1z7Q1TxR0
End-to-end summary
1. Configure automated alerts
End-to-end summary
1. Configure automated alerts
2. Use metrics and tracing to pinpoint the problem
End-to-end summary
1. Configure automated alerts
2. Use metrics and tracing to pinpoint the problem
3. Use structured logging to find the root cause of the problem easily
End-to-end summary
1. Configure automated alerts
2. Use metrics and tracing to pinpoint the problem
3. Use structured logging to find the root cause of the problem easily
4. Fix problems and make sure all metrics are always back to normal
Thank you! Q&A
Nikolay Stoitsev
Engineering Manager at Halo DX
Photo by Pixabay, Şahin Sezer Dinçer, Andrea Piacquadio, Ian Beckley from Pexels

More Related Content

PPTX
Hot to build continuously processing for 24/7 real-time data streaming platform?
PDF
OSMC 2021 | Use OpenSource monitoring for an Enterprise Grade Platform
PPTX
DevOps: Infrastructure as Code
PDF
Javantura v4 - FreeMarker in Spring web - Marin Kalapać
PDF
Javantura v4 - Spring Boot and JavaFX - can they play together - Josip Kovaček
PPTX
Uncover the mysteries of infrastructure as code (iac)!
PPTX
Gabriele Provinciali/Gabriele Folchi/Luca Postacchini - Sviluppo con piattafo...
PDF
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
Hot to build continuously processing for 24/7 real-time data streaming platform?
OSMC 2021 | Use OpenSource monitoring for an Enterprise Grade Platform
DevOps: Infrastructure as Code
Javantura v4 - FreeMarker in Spring web - Marin Kalapać
Javantura v4 - Spring Boot and JavaFX - can they play together - Josip Kovaček
Uncover the mysteries of infrastructure as code (iac)!
Gabriele Provinciali/Gabriele Folchi/Luca Postacchini - Sviluppo con piattafo...
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...

What's hot (20)

PDF
The Next Generation Cloud: Unleashing the Power of the Unikernal
PPTX
How to setup a development environment for ONAP
PPTX
Keeping your Kubernetes Cluster Secure
PDF
Bringing DevOps to Routing with evolved XR: an overview
PDF
Netflix Open Source: Building a Distributed and Automated Open Source Program
PPTX
Microcontainers and Tools for Hardcore Container Debugging
PDF
Docker at and with SignalFx
PDF
Spring 5 Project Reactor
PDF
OSMC 2021 | Robotmk: You don’t run IT – you deliver services!
PDF
Operator development made easy with helm
PDF
Infrastructure as Code with Ansible
PDF
Intro to Git: a hands-on workshop
PDF
Javantura v4 - JVM++ The GraalVM - Martin Toshev
PDF
Serverless Workflow: New approach to Kubernetes service orchestration | DevNa...
PDF
Ports, pods and proxies
PPTX
Supporting Digital Media Workflows in the Cloud with Perforce Helix
PPTX
Machine learning at scale - Webinar By zekeLabs
PDF
StarlingX - A Platform for the Distributed Edge | Ildiko Vancsa
PDF
Automating CICD Pipeline with GitLab and Docker Containers for Java Applications
PPTX
Onnx and onnx runtime
The Next Generation Cloud: Unleashing the Power of the Unikernal
How to setup a development environment for ONAP
Keeping your Kubernetes Cluster Secure
Bringing DevOps to Routing with evolved XR: an overview
Netflix Open Source: Building a Distributed and Automated Open Source Program
Microcontainers and Tools for Hardcore Container Debugging
Docker at and with SignalFx
Spring 5 Project Reactor
OSMC 2021 | Robotmk: You don’t run IT – you deliver services!
Operator development made easy with helm
Infrastructure as Code with Ansible
Intro to Git: a hands-on workshop
Javantura v4 - JVM++ The GraalVM - Martin Toshev
Serverless Workflow: New approach to Kubernetes service orchestration | DevNa...
Ports, pods and proxies
Supporting Digital Media Workflows in the Cloud with Perforce Helix
Machine learning at scale - Webinar By zekeLabs
StarlingX - A Platform for the Distributed Edge | Ildiko Vancsa
Automating CICD Pipeline with GitLab and Docker Containers for Java Applications
Onnx and onnx runtime
Ad

Similar to A practical introduction to observability (20)

PDF
Building an Observability Platform in 389 Difficult Steps
PDF
Demystifying observability
PPTX
Solving the Hidden Costs of Kubernetes with Observability
PPTX
Prometheus - Open Source Forum Japan
PPTX
An Introduction to Prometheus (GrafanaCon 2016)
PPTX
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
PPTX
Observability for Application Developers (1)-1.pptx
PDF
Dev opsdays 2018 - Observability, the practical approach
PPTX
Observability, the practical approach - Anton Drukh - DevOpsDays Tel Aviv 2018
PDF
Observability for Emerging Infra (what got you here won't get you there)
PPTX
Evolution of Monitoring and Prometheus (Dublin 2018)
PDF
OSMC 2021 | Observability is More than Logs, Metrics & Traces
PPSX
Service Mesh - Observability
PDF
5 practical operability techniques - Matthew Skelton - SkillsMatter 2018
PDF
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
PDF
5 practical operability techniques for teams - Matthew Skelton - SQUID meetup...
PDF
beginners-guide-to-observability.pdf
PPTX
ADDO Open Source Observability Tools
PDF
RedisConf17 - Observability and the Glorious Future
PPTX
From SLO to GOTY
Building an Observability Platform in 389 Difficult Steps
Demystifying observability
Solving the Hidden Costs of Kubernetes with Observability
Prometheus - Open Source Forum Japan
An Introduction to Prometheus (GrafanaCon 2016)
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
Observability for Application Developers (1)-1.pptx
Dev opsdays 2018 - Observability, the practical approach
Observability, the practical approach - Anton Drukh - DevOpsDays Tel Aviv 2018
Observability for Emerging Infra (what got you here won't get you there)
Evolution of Monitoring and Prometheus (Dublin 2018)
OSMC 2021 | Observability is More than Logs, Metrics & Traces
Service Mesh - Observability
5 practical operability techniques - Matthew Skelton - SkillsMatter 2018
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
5 practical operability techniques for teams - Matthew Skelton - SQUID meetup...
beginners-guide-to-observability.pdf
ADDO Open Source Observability Tools
RedisConf17 - Observability and the Glorious Future
From SLO to GOTY
Ad

More from Nikolay Stoitsev (20)

PDF
Building vs Buying Software
PDF
How and why to manage your manager
PDF
From programming to management
PDF
Building a modern SaaS in 2020
PDF
Everything You Need to Know About NewSQL in 2020
PDF
3 lessons on effective communication for engineers
PDF
ISTA 2019 - Migrating data-intensive microservices from Python to Go
PDF
Evolving big microservice architectures
PDF
The career path of software engineers and how to navigate it
PDF
Migrating a data intensive microservice from Python to Go
PDF
Using Apache Kafka from Go
PDF
Large scale stream processing with Apache Flink
PDF
Scaling big with Apache Kafka
PDF
NewSQL: what, when and how
PDF
How to read the v8 source code?
PDF
Running in multiple data centers
PDF
Distributed tracing for big systems
PDF
Reusable patterns for scalable APIs running on Docker @ Java2Days
PDF
Everyday tools and tricks for scaling Node.js
PDF
Node.js at Uber
Building vs Buying Software
How and why to manage your manager
From programming to management
Building a modern SaaS in 2020
Everything You Need to Know About NewSQL in 2020
3 lessons on effective communication for engineers
ISTA 2019 - Migrating data-intensive microservices from Python to Go
Evolving big microservice architectures
The career path of software engineers and how to navigate it
Migrating a data intensive microservice from Python to Go
Using Apache Kafka from Go
Large scale stream processing with Apache Flink
Scaling big with Apache Kafka
NewSQL: what, when and how
How to read the v8 source code?
Running in multiple data centers
Distributed tracing for big systems
Reusable patterns for scalable APIs running on Docker @ Java2Days
Everyday tools and tricks for scaling Node.js
Node.js at Uber

Recently uploaded (20)

PPTX
Internet of Things (IOT) - A guide to understanding
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PPT
Mechanical Engineering MATERIALS Selection
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
composite construction of structures.pdf
PPTX
Current and future trends in Computer Vision.pptx
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
737-MAX_SRG.pdf student reference guides
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPT
Project quality management in manufacturing
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Internet of Things (IOT) - A guide to understanding
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
CYBER-CRIMES AND SECURITY A guide to understanding
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
Mechanical Engineering MATERIALS Selection
R24 SURVEYING LAB MANUAL for civil enggi
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
composite construction of structures.pdf
Current and future trends in Computer Vision.pptx
CH1 Production IntroductoryConcepts.pptx
737-MAX_SRG.pdf student reference guides
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Operating System & Kernel Study Guide-1 - converted.pdf
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Project quality management in manufacturing
Automation-in-Manufacturing-Chapter-Introduction.pdf
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...

A practical introduction to observability