SlideShare a Scribd company logo
Observability – the good, the bad, and the
ugly
Aleksandr Tavgen Playtech
vAbout me
More than 19 years of
professional experience
FinTech and Data Science
background
From Developer to SRE Engineer
Solved and automated some
problems in Operations on scale
Overall problem
• Zoo of monitoring solutions in large enterprises often distributed over
the world
• M&A transactions or distributed teams make central managing
impossible or ineffective
• For small enterprises or startups the key question is about finding the
best solution
• A lot of companies have failed this way
• A lot of anti-patterns have developed
Managing a
Zoo
• A lot of independent teams
• Everyone has some sort of
solution
• It is hard to get overall picture
of operations
• It is hard to orchestrate and
make changes
QUITE OFTEN A ZOO LOOKS LIKE THIS
Common Anti-
patterns
It is tempting to keep everything
recorded just in case
Amount of metrics in monitoring
grows exponentially
Nobody understands such huge
bunch of metrics
Engineering complexity grows as
well
Uber case – 9 billion of metrics / 1000 + instances for monitoring
IF YOU NEED 9
BILLION OF
METRICS, YOU
ARE PROBABLY
WRONG
Dashboards problem
• Proliferating amount of metrics leads to unusable
dashboards
• How can one observe 9 billion metrics?
• Quite often it looks like spaghetti
• It is ok to pursue anti-pattern for approx. 1,5 years
• GitLab Dashboards are a good example
Observability - the good, the bad, and the ugly
Observability - the good, the bad, and the ugly
Observability - the good, the bad, and the ugly
Observability - the good, the bad, and the ugly
Actually not
• Dashboards are very useful when
you know where and when to watch
• Our brain can recognize and process
visual patterns more effectively
• But only when you know what you
are looking for and when
Queries
vs.
Dashboards
Querying your data requires more cognitive
effort than a quick look at dashboards
Metrics are a low resolution of your
system’s dynamics
Metrics should not replace logs
It is not necessary to have millions of them
What are
Incidents
• Something that has impact
on operational/business
level
• Incidents are expensive
• Incidents come with
credibility costs
COST OF AN
HOUR OF
DOWNTIME
2017-2018
https://www.statista.com/statistics/753938/worldwide-enterprise-server-hourly-downtime-cost/
• Change
• Network Failure
• Bug
• Human Factor
• Unspecified
• Hardware Failure
Causes of outage
Outage in dynamics
Timeline of
Outage
Detection
Investigation
Escalation
Fixing
What is it all about?
• Any reduction of
outage/incident timeline
results in significant positive
financial impact
• It is about credibility as well
• And your DevOps teams
feel less pain and toil on
their way
Focus on KPI metrics
Metrics
• It is almost impossible to operate on
billions of metrics
• In case of normal system behavior there
will always be outliers in real production
data
• Therefore, not all outliers should be
flagged as anomalous incidents
• Etsy Kale project case
Observability - the good, the bad, and the ugly
Paradigm Shift
• The main paradigm shift comes from the fields of infrastructure and
architecture
• Cloud architectures, microservices, Kubernetes, and immutable
infrastructure have changed the way companies build and operate
systems
• Virtualization, containerization and orchestration frameworks abstract
infra level
• Moving towards abstraction from the underlying hardware and
networking means that we must focus on ensuring that our
applications work as intended in the context of our business
processes.
KPI monitoring
• KPI metrics are related to the core business
operations
• It could be logins, active sessions, any domain
specific operations
• Heavily seasoned
• Static thresholds can’t help here
Our Solution
• Narrowing down the
amount of metrics required
to defined KPI metrics
• We combined push/pull
model
• Local push
• Central pull
• And we created a ML-based
system, which learns your
metrics’ behavior
Predictive
Alerting System
Anomalies
combined with
rules
Based on dynamic
rules
Overwhelming
results
• Red area – Customer Detection
• Blue area – Own Observation (toil)
• Orange line – Central Grafana Introduced
• Green line – ML based solution in prod
Customer Detection has dropped to
low percentage points
General view
• Finding anomalies on metrics
• Finding regularities on a higher
level
• Combining events from
organization internals
(changes/deployments)
• Stream processing architectures
Why do we need time-series storage?
• We have unpredicted delay on networking
• Operating worldwide is a problem
• CAP theorem
• You can receive signals from the past
• But you should look into the future too
• How long should this window be in the future?
Why not Kafka and all those classical
streaming?
• Frameworks like Storm, Flink - oriented on tuples not time-ordered
events
• We do not want to process everything
• A lot of events are needed on-demand
• It is ok to lose some signals in favor of performance
• And we still have signals from the past
Why Influx v 2.0
• Flux
• Better isolation
• Central storage for metrics, events, traces
• Same streaming paradigm
• There is no mismatch between metaquering and quering
Taking a higher picture
• Finding anomalies on a lower level
• Tracing
• Event logs
• Finding regularities between them
• Building a topology
• We can call it AIOps as well
Open Tracing
• Tracing is a higher resolution of your
system’s dynamics
• Distributed tracing can show you unknown-
unknowns
• It reduces Investigation part of Incident
Timeline
• There is a good OSS Jaeger implementation
• Influx v 2.0 – the supported backend
storage
Jaeger with
Influxv2.0 as a
Backend storage
• Real prod case
• Every minute approx. 8000
traces
• Performance issue with
limitation on I/O ops
connections
• Bursts of context switches
on the kernel level
Impact on the particular
execution flow
• Db query is quite constant
• Processing time in normal case - 1-3 ms
• After a process context switch - more than 40 ms
Flux
• Multi-source joining
• Same functional composition paradigm
• Easy to test hypothesis
• You can combine metrics, event logs, and traces
• Data transformation based on conditions
Real incident
We need some statistical
models to operate on raw
data
Let’s check logins part
• Let’s check relations between them
• Looks more like stationary time – series
• Easier to model
• Let’s check relations between them
• Looks more like stationary time – series
• Easier to model
Random Walk
• Processes have a lot of random
factors
• Random Walk modelling
• X(t) = X(t-1) + Er(t)
• Er(t) = X(t) - X(t-1)
• Stationary time-series is very
easy to model
• Do not need statistical models
• Just reservoir with variance
Er(t) = X(t) - X(t-1)
Er(t) = discrete derivative of (X)
On a larger scale
• Simple to model
• Cheap memory reservoirs models
• Very fast
Security case
• Failed logins ratio is related to overall
statistical activity
• People make type-o’s
• Simple thresholds not working
One Flux transformation pipeline
Real Alerts related to attacks on Login Service
Combing all
together
Adding Traces and
Events can reduce
Investigation part
Can pinpoint to Root
Cause
•It is all about semantics
•Datacenters, sites, services
•Graph topology based on time-series data
Timetrix
• As a lot people involved in it from
different companies
• We decided to Open Source core
engine
• Integrations which are specific to
domain companies could be easily
added
• We plan to launch Q3/Q4 2019
• Core engine is written in Java
• Great Kudos to bonitoo.io team for
great drivers
Observability - the good, the bad, and the ugly

More Related Content

PDF
Using Time Series for Full Observability of a SaaS Platform
PPTX
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
PPTX
Using Machine Learning to Optimize DevOps Practices
PPTX
Estimating the Impact of Incidents on Process Delay - ICPM 2019
PPTX
Digitalization in Electronics Manufacturing
PPTX
Why Does (My) Monitoring Suck?
PPTX
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
PPTX
AgileIteration
Using Time Series for Full Observability of a SaaS Platform
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Using Machine Learning to Optimize DevOps Practices
Estimating the Impact of Incidents on Process Delay - ICPM 2019
Digitalization in Electronics Manufacturing
Why Does (My) Monitoring Suck?
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
AgileIteration

What's hot (8)

PPT
Data Center Migration
PDF
When down is not good enough. SRE On Azure - PolarConf
PDF
The 5 Laws of Software Estimates
PPTX
Site reliability engineering
PDF
The Ins and Outs of CTMS Data Migration
PPTX
The Troubleshooting Chart
PDF
Avoiding Performance Problems: When and How to Debug Production
PDF
Data-Driven Operations - Practice realtime data analyse
Data Center Migration
When down is not good enough. SRE On Azure - PolarConf
The 5 Laws of Software Estimates
Site reliability engineering
The Ins and Outs of CTMS Data Migration
The Troubleshooting Chart
Avoiding Performance Problems: When and How to Debug Production
Data-Driven Operations - Practice realtime data analyse
Ad

Similar to Observability - the good, the bad, and the ugly (20)

PPTX
Observability – the good, the bad, and the ugly
PPTX
Training - What is Performance ?
PDF
Building data intensive applications
PDF
Brighttalk high scale low touch and other bedtime stories - final
PDF
PAC 2019 virtual Alexander Podelko
PPTX
DMM9 - Data Migration Testing
PDF
CQRS + Event Sourcing
PDF
Building an Experimentation Platform in Clojure
PDF
Industrial Data Science
PDF
Do-It-Yourself ENOVIA PLM MIgration
PDF
A Practical Guide to Selecting a Stream Processing Technology
PDF
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...
PPTX
Azure architecture design patterns - proven solutions to common challenges
PDF
How KeyBank Used Elastic to Build an Enterprise Monitoring Solution
PPTX
SCM Transformation Challenges and How to Overcome Them
PPTX
Real World Performance - OLTP
PPTX
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...
PDF
Art of Cloud Workload Translation
PDF
Are we there Yet?? (The long journey of Migrating from close source to opens...
PDF
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Observability – the good, the bad, and the ugly
Training - What is Performance ?
Building data intensive applications
Brighttalk high scale low touch and other bedtime stories - final
PAC 2019 virtual Alexander Podelko
DMM9 - Data Migration Testing
CQRS + Event Sourcing
Building an Experimentation Platform in Clojure
Industrial Data Science
Do-It-Yourself ENOVIA PLM MIgration
A Practical Guide to Selecting a Stream Processing Technology
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...
Azure architecture design patterns - proven solutions to common challenges
How KeyBank Used Elastic to Build an Enterprise Monitoring Solution
SCM Transformation Challenges and How to Overcome Them
Real World Performance - OLTP
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...
Art of Cloud Workload Translation
Are we there Yet?? (The long journey of Migrating from close source to opens...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Ad

Recently uploaded (20)

PPTX
ai tools demonstartion for schools and inter college
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPT
Introduction Database Management System for Course Database
PPTX
history of c programming in notes for students .pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Online Work Permit System for Fast Permit Processing
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
ISO 45001 Occupational Health and Safety Management System
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
ai tools demonstartion for schools and inter college
Wondershare Filmora 15 Crack With Activation Key [2025
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Introduction Database Management System for Course Database
history of c programming in notes for students .pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Online Work Permit System for Fast Permit Processing
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Understanding Forklifts - TECH EHS Solution
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
CHAPTER 2 - PM Management and IT Context
ISO 45001 Occupational Health and Safety Management System
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Design an Analysis of Algorithms I-SECS-1021-03
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
ManageIQ - Sprint 268 Review - Slide Deck
Softaken Excel to vCard Converter Software.pdf
How to Choose the Right IT Partner for Your Business in Malaysia
Adobe Illustrator 28.6 Crack My Vision of Vector Design

Observability - the good, the bad, and the ugly

  • 1. Observability – the good, the bad, and the ugly Aleksandr Tavgen Playtech
  • 2. vAbout me More than 19 years of professional experience FinTech and Data Science background From Developer to SRE Engineer Solved and automated some problems in Operations on scale
  • 3. Overall problem • Zoo of monitoring solutions in large enterprises often distributed over the world • M&A transactions or distributed teams make central managing impossible or ineffective • For small enterprises or startups the key question is about finding the best solution • A lot of companies have failed this way • A lot of anti-patterns have developed
  • 4. Managing a Zoo • A lot of independent teams • Everyone has some sort of solution • It is hard to get overall picture of operations • It is hard to orchestrate and make changes
  • 5. QUITE OFTEN A ZOO LOOKS LIKE THIS
  • 6. Common Anti- patterns It is tempting to keep everything recorded just in case Amount of metrics in monitoring grows exponentially Nobody understands such huge bunch of metrics Engineering complexity grows as well
  • 7. Uber case – 9 billion of metrics / 1000 + instances for monitoring
  • 8. IF YOU NEED 9 BILLION OF METRICS, YOU ARE PROBABLY WRONG
  • 9. Dashboards problem • Proliferating amount of metrics leads to unusable dashboards • How can one observe 9 billion metrics? • Quite often it looks like spaghetti • It is ok to pursue anti-pattern for approx. 1,5 years • GitLab Dashboards are a good example
  • 14. Actually not • Dashboards are very useful when you know where and when to watch • Our brain can recognize and process visual patterns more effectively • But only when you know what you are looking for and when
  • 15. Queries vs. Dashboards Querying your data requires more cognitive effort than a quick look at dashboards Metrics are a low resolution of your system’s dynamics Metrics should not replace logs It is not necessary to have millions of them
  • 16. What are Incidents • Something that has impact on operational/business level • Incidents are expensive • Incidents come with credibility costs
  • 17. COST OF AN HOUR OF DOWNTIME 2017-2018 https://www.statista.com/statistics/753938/worldwide-enterprise-server-hourly-downtime-cost/
  • 18. • Change • Network Failure • Bug • Human Factor • Unspecified • Hardware Failure Causes of outage
  • 21. What is it all about? • Any reduction of outage/incident timeline results in significant positive financial impact • It is about credibility as well • And your DevOps teams feel less pain and toil on their way
  • 22. Focus on KPI metrics
  • 23. Metrics • It is almost impossible to operate on billions of metrics • In case of normal system behavior there will always be outliers in real production data • Therefore, not all outliers should be flagged as anomalous incidents • Etsy Kale project case
  • 25. Paradigm Shift • The main paradigm shift comes from the fields of infrastructure and architecture • Cloud architectures, microservices, Kubernetes, and immutable infrastructure have changed the way companies build and operate systems • Virtualization, containerization and orchestration frameworks abstract infra level • Moving towards abstraction from the underlying hardware and networking means that we must focus on ensuring that our applications work as intended in the context of our business processes.
  • 26. KPI monitoring • KPI metrics are related to the core business operations • It could be logins, active sessions, any domain specific operations • Heavily seasoned • Static thresholds can’t help here
  • 27. Our Solution • Narrowing down the amount of metrics required to defined KPI metrics • We combined push/pull model • Local push • Central pull • And we created a ML-based system, which learns your metrics’ behavior
  • 29. Overwhelming results • Red area – Customer Detection • Blue area – Own Observation (toil) • Orange line – Central Grafana Introduced • Green line – ML based solution in prod Customer Detection has dropped to low percentage points
  • 30. General view • Finding anomalies on metrics • Finding regularities on a higher level • Combining events from organization internals (changes/deployments) • Stream processing architectures
  • 31. Why do we need time-series storage? • We have unpredicted delay on networking • Operating worldwide is a problem • CAP theorem • You can receive signals from the past • But you should look into the future too • How long should this window be in the future?
  • 32. Why not Kafka and all those classical streaming? • Frameworks like Storm, Flink - oriented on tuples not time-ordered events • We do not want to process everything • A lot of events are needed on-demand • It is ok to lose some signals in favor of performance • And we still have signals from the past
  • 33. Why Influx v 2.0 • Flux • Better isolation • Central storage for metrics, events, traces • Same streaming paradigm • There is no mismatch between metaquering and quering
  • 34. Taking a higher picture • Finding anomalies on a lower level • Tracing • Event logs • Finding regularities between them • Building a topology • We can call it AIOps as well
  • 35. Open Tracing • Tracing is a higher resolution of your system’s dynamics • Distributed tracing can show you unknown- unknowns • It reduces Investigation part of Incident Timeline • There is a good OSS Jaeger implementation • Influx v 2.0 – the supported backend storage
  • 36. Jaeger with Influxv2.0 as a Backend storage • Real prod case • Every minute approx. 8000 traces • Performance issue with limitation on I/O ops connections • Bursts of context switches on the kernel level
  • 37. Impact on the particular execution flow • Db query is quite constant • Processing time in normal case - 1-3 ms • After a process context switch - more than 40 ms
  • 38. Flux • Multi-source joining • Same functional composition paradigm • Easy to test hypothesis • You can combine metrics, event logs, and traces • Data transformation based on conditions
  • 39. Real incident We need some statistical models to operate on raw data
  • 41. • Let’s check relations between them • Looks more like stationary time – series • Easier to model • Let’s check relations between them • Looks more like stationary time – series • Easier to model
  • 42. Random Walk • Processes have a lot of random factors • Random Walk modelling • X(t) = X(t-1) + Er(t) • Er(t) = X(t) - X(t-1) • Stationary time-series is very easy to model • Do not need statistical models • Just reservoir with variance
  • 43. Er(t) = X(t) - X(t-1) Er(t) = discrete derivative of (X)
  • 44. On a larger scale • Simple to model • Cheap memory reservoirs models • Very fast
  • 45. Security case • Failed logins ratio is related to overall statistical activity • People make type-o’s • Simple thresholds not working
  • 47. Real Alerts related to attacks on Login Service
  • 48. Combing all together Adding Traces and Events can reduce Investigation part Can pinpoint to Root Cause
  • 49. •It is all about semantics •Datacenters, sites, services •Graph topology based on time-series data
  • 50. Timetrix • As a lot people involved in it from different companies • We decided to Open Source core engine • Integrations which are specific to domain companies could be easily added • We plan to launch Q3/Q4 2019 • Core engine is written in Java • Great Kudos to bonitoo.io team for great drivers

Editor's Notes

  • #26: Virtualization, containerization, and orchestration frameworks are responsible for providing computational resources and handling failures creates an abstraction layer for hardware and networking. Moving towards abstraction from the underlying hardware and networking means that we must focus on ensuring that our applications work as intended in the context of our business processes.