SlideShare a Scribd company logo
Changing the
Game:
How Game
Theory can break
down silos
Kevin Crawley – Developer Relations // Instana
Principle SRE Architect & Co-Owner // Single
▫ Docker Captain
▫ Gitlab Hero
▫ DevOpsDays Nashville Organizer
▫ 20 years in software development
▫ 5+ years DevOps/SRE experience
About Me
Discussion Points
▪ How does Game Theory tear down Silos
▪ Characteristics of High Performance
Organizations
▪ DevOps and Site Reliability Engineers
▪ What SREs need to be effective
Let’s talk about
Game Theory
(I’m really bad at math)
source: Nirmal Mehta (Docker Captain)
What is Bad Equilibrium?
It’s a strategy that all players in the game can adoptand converge on, butit
won’tproduce a desirable outcome for anyone.
https://pdfs.semanticscholar.org/30d1/a03db196384a17fed3247407fb5859f7c76b.pdf
Transformation: Focusing on Automation
https://devops-research.com/
Where do silos come from?
Silos can be defined as the contention which exist
between functional units within an organization.
This contention usually manifests between teams
where change management policy requirements
and risks are high.
Nash Equilibrium
(Prisoners Dilemma)
A concept of game theory where the optimal
outcome of a game is one where no player has an
incentive to deviate from her chosen strategy after
considering an opponents choice.
https://en.wikipedia.org/wiki/Nash_equilibrium
Video Removed due to file size. Youtube “ Golden Balls”
Nash Equilibrium outcomes
Pareto Efficiency
Is a state of allocation of resources in which it is
impossible to make any one individual better off
without making at least one individual worse off.
… aka ZERO SUM
https://en.wikipedia.org/wiki/Pareto_efficiency
DevOps Game Theory / Observability Deck
Pareto Inefficiency
A situationis inefficient if someone canbe made better off even after
compensating those made worse off.
Pareto Inefficient Nash Equilibrium
… is a Bad Equilibrium
Video Removed due to file size. Youtube “ Golden Balls”
Don’t like the game?
Change the Game
New Nash Equilibrium
Pareto Inefficient Nash Equilibrium
Gives you permission and proof to change the
game
Change the Game
Percentage of Work Done Manually
ELITE
PERFORMERS
HIGH
PERFORMERS
LOW
PERFORMERS
Configuration
Management
5% 10% 30%
Testing 10% 20% 30%
Deployments 5% 10% 30%
Change
approval
process
10% 30% 40%
https://devops-research.com/
High Performance vs Low Performance
Organizations
High Performers
▪ Deployments:
> 1 hour and < 1 day
▪ Lead Time for
Changes:
> 1 day and < 1 week
▪ MTTR:
< 1 day
▪ Change Failure Rate:
0-15%
Low Performers
▪ Deployments:
Once per week/month
▪ Lead Time for Changes:
> 1 month and <6
months
▪ MTTR:
> 1 week and < 1 month
▪ Change Failure Rate:https://devops-research.com/
What happens when we tear down the silos
and become a DevOps organization?
▪ We ship more software more often,
complexity increases and reliability starts to
decline
▪ We naturally shift our focus to solve the
scalability and reliability issues (alternatively
we give up and readopt the monolith)
▪ Rise of the Site Reliability Engineers
Transformation: Focusing on Information
https://devops-research.com/
What are some tools / processes that
organizations can put in place to change
our equilibrium and communicate?
▪ Communication & Collaboration Tools
▫ Slack, Git, Pagerduty, OpsGenie
▪ Observability (SRE) Tooling
▫ Custom Dashboards / Metrics /
Alerting
▫ Log Analytics
▫ Distributed Tracing
What do SREs care about?
▪ Reliability (this one is obvious)
▪ Performance (is the customer happy?)
▪ Costs (is the business happy?)
SREs are in the business of measurement and
define objectives through SLOs by measuring SLIs.
What do SREs typically measure
▪ Error Rates
▪ Latency
▪ Throughput
▪ Saturation
“The Four Golden Signals” - https://landing.google.com/sre/sre-
book/chapters/monitoring-distributed-systems/
What is Observability?
Kalman, 1961 paper
On the general theory of control systems
▪ A system is observable if the behavior of the entire system
can be determined by only looking at its inputs and outputs.
▪ Lesson: control theory is a well-documented approach which
people can learn from vs trying to reinvent
Can we get some pillars?
The 4 pillars of Observability was originally described in a blog
article from Twitter:
▪ Monitoring
▪ Log Aggregation / Analytics
▪ Distributed systems tracing infrastructure
▪ Alerting / Visualization
https://blog.twitter.com/engineering/en_us/a/2016/observability-at-twitter-technical-
overview-part-i.html
More than just
pillars…
“While plainly having access to logs, metrics, and traces
doesn’t necessarily make systems more observable, these
are powerful tools that, if understood well, can unlock the
ability to build better systems.”
- Cindy Sridharen
https://www.oreilly.com/library/view/distributed-systems-observability/9781492033431/
Observability gives us the means to
understand all of the behavior in our
systems
▪ Not just tooling, it’s how
we model and analyze
data
▪ Similar to how DevOps is
a mindset / culture
▪ No longer treating
services like Schrödinger's
cat
▪ (A lot) more context
around events and
transactions
https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html
Why does my
organization need any
of this?
This sounds like a lot of work …
How many of you are running
staging environments?
How many of you actually trust
your staging environments?
DevOps Game Theory / Observability Deck
In order to observe a system, we must emit
signals and analyze the aggregates.
Those aggregates can answer the
following questions (and more):
▪ Number of Reqs / Retries / Backoffs
(throughput)
▪ Request parameters / Query Statements
(details)
▪ Latency / Outliers (performance)
▪ Top-Level Exceptions / Log Messages (error
analysis)
How can we collect this data?
Distributed Tracing
▪ Also known as Distributed Structured
Logging
▪ Larger Payloads
▪ Rich Contextual Data
https://w3c.github.io/trace-context
Sampling vs. No Sample
▪ Sampling traces may result in important
outliers (P95/P99) to be missed
▪ Extremely high volume systems must
sample due to massive overhead
▪ Start without sampling, adopt as needed,
incorporate solutions which sample
adaptively
How has Observability helped enable a
DevOps culture?
Let’s take a look at a production
microservice application which has been
instrumented by a distributed tracing
solution
▪ Operated by 3 engineers (1 FE/1 BE/1 SRE)
▪ Over 20k transaction / hour, 20+ integrations, 150k LOC, with less
than 15% test coverage
▪ Launched in 2018 with 15 microserviceson DockerSwarm – has since
expanded to over35 microserviceswith zero additional engineering
personnel
▪ One-touch deployment and provisioningfor newand existing services
Visualizing
Large and
Complex
Systems
DevOps Game Theory / Observability Deck
DevOps Game Theory / Observability Deck
Analyzing
Distributed Trace
Aggregates
What happens if we aggregate timing, error rate, and # of
reqs for each endpoint on a service
DevOps Game Theory / Observability Deck
DevOps Game Theory / Observability Deck
What problems
have Distributed
Tracing helped
solve?
Database Optimizations, Caching, and Concurrency
@notsureifkevin
Exponential
Backoff
Slow Death
of a Service
Rise in Latency + Processing Time
▪ DBO (Hibernate Query) causing O(n log n) rise in latency and
processing time
▪ Application Dashboard indicated an issue with overall latency
increasing
▪ Fix deployed and improvement was observed immediately
Issue Resolved
DevOps Game Theory / Observability Deck
Caching Solved one problem
… but caused another
▪ We implemented Redis for caching, and processing time went
down
▪ However, we didn’t account for token policies changing and
they suddenly began to expire after 30 seconds
▪ Alerting around error rates for this endpoint raised our
awareness around this issue
DevOps Game Theory / Observability Deck
DevOps Game Theory / Observability Deck
DevOps Game Theory / Observability Deck
Context is critical
Metrics are not standalone, they have relationships
DevOps Game Theory / Observability Deck
DevOps Game Theory / Observability Deck
DevOps Game Theory / Observability Deck
Custom
Dashboards
We utilize a mix of Instana, Logz.io and Grafana to manage
our systems
DevOps Game Theory / Observability Deck
DevOps Game Theory / Observability Deck
Focusing on Observability
▪ Enables your organization to understand the behavior
of your system
▪ Empowers your engineers to find and fix problems
▪ Enables you to build more reliable systems and ship
software faster
▪ Promotes empathy through understanding,
transparency, and communication.
Want to learn more about monitoring
production microservice apps?
▪ Follow me on twitter for upcoming workshops
@notsureifkevin & @InstanaHQ
▪ Sign up for our newsletter at the iPad and enter to win a e-gift
card (delivered via email)
▪ Get a free trial of Instana @ https://instana.com

More Related Content

PPTX
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
PDF
Getting started with Site Reliability Engineering (SRE)
PPTX
DevOps 101
PDF
DevOps Culture
PDF
Learn from the Experts: Using DORA Metrics to Accelerate Value Stream Flow
PDF
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
PPTX
SRE 101 (Site Reliability Engineering)
PPTX
About DevOps in simple steps
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Getting started with Site Reliability Engineering (SRE)
DevOps 101
DevOps Culture
Learn from the Experts: Using DORA Metrics to Accelerate Value Stream Flow
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
SRE 101 (Site Reliability Engineering)
About DevOps in simple steps

What's hot (20)

PPTX
Tcoe team
PDF
Building an SRE Organization @ Squarespace
PPTX
DevOps culture
PPTX
DevSecOps
PDF
Chris Munns, DevOps @ Amazon: Microservices, 2 Pizza Teams, & 50 Million Depl...
PPTX
Devops architecture
PPTX
Integrating Security into DevOps
PPTX
How Small Team Get Ready for SRE (public version)
PDF
What's an SRE at Criteo - Meetup SRE Paris
PDF
SRE Demystified - 05 - Toil Elimination
PPTX
Achieving Elite and High Performance DevOps Using DORA Metrics
PPTX
A Crash Course in Building Site Reliability
PPTX
Transforming Organizations with CI/CD
PPTX
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
PDF
Everything You Need to Know About the 2019 DORA Accelerate State of DevOps Re...
PDF
Test Environment Strategy
PPTX
DevOps Torino Meetup - SRE Concepts
PDF
Devops Devops Devops, at Froscon
PPTX
Introduction to DevOps
PDF
Welcome to Azure Devops
Tcoe team
Building an SRE Organization @ Squarespace
DevOps culture
DevSecOps
Chris Munns, DevOps @ Amazon: Microservices, 2 Pizza Teams, & 50 Million Depl...
Devops architecture
Integrating Security into DevOps
How Small Team Get Ready for SRE (public version)
What's an SRE at Criteo - Meetup SRE Paris
SRE Demystified - 05 - Toil Elimination
Achieving Elite and High Performance DevOps Using DORA Metrics
A Crash Course in Building Site Reliability
Transforming Organizations with CI/CD
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
Everything You Need to Know About the 2019 DORA Accelerate State of DevOps Re...
Test Environment Strategy
DevOps Torino Meetup - SRE Concepts
Devops Devops Devops, at Froscon
Introduction to DevOps
Welcome to Azure Devops
Ad

Similar to DevOps Game Theory / Observability Deck (20)

PPTX
Changing the Game: Breaking Down IT Silos
PDF
DevOpsDays Houston 2019 -Kevin Crawley - Practical Guide to Not Building Anot...
PDF
I pushed in production :). Have a nice weekend
PDF
SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
PPTX
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
PDF
Building and Scaling High Performing Technology Organizations by Jez Humble a...
PDF
SRE and GitOps for Building Robust Kubernetes Platforms.pdf
PDF
stackconf 2025 | Evolving Shift Left: Integrating Observability into Modern S...
PPTX
Observability - the good, the bad, and the ugly
PPTX
Observability Shivagami Gugan
PDF
SRE Demystified - 06 - Distributed Monitoring
PDF
5 practical operability techniques - Matthew Skelton - SkillsMatter 2018
PDF
Practical operability techniques - Matthew Skelton - Unicom DevOps Showcase N...
PDF
That's not a metric! Data for cloud-native success
PDF
"Resilient by Design: Strategies for Building Robust Architecture at Uklon", ...
PDF
Observe 2020-d mc
PDF
Short Data Rules for Observability.pdf
PPTX
DockerCon SF 2019 - TDD is Dead
PDF
Monitoring and Observability: Building Products That Don't Break in Silence
PDF
DevOps Observability & Monitoring_ Ultimate Guide.pdf
Changing the Game: Breaking Down IT Silos
DevOpsDays Houston 2019 -Kevin Crawley - Practical Guide to Not Building Anot...
I pushed in production :). Have a nice weekend
SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
Building and Scaling High Performing Technology Organizations by Jez Humble a...
SRE and GitOps for Building Robust Kubernetes Platforms.pdf
stackconf 2025 | Evolving Shift Left: Integrating Observability into Modern S...
Observability - the good, the bad, and the ugly
Observability Shivagami Gugan
SRE Demystified - 06 - Distributed Monitoring
5 practical operability techniques - Matthew Skelton - SkillsMatter 2018
Practical operability techniques - Matthew Skelton - Unicom DevOps Showcase N...
That's not a metric! Data for cloud-native success
"Resilient by Design: Strategies for Building Robust Architecture at Uklon", ...
Observe 2020-d mc
Short Data Rules for Observability.pdf
DockerCon SF 2019 - TDD is Dead
Monitoring and Observability: Building Products That Don't Break in Silence
DevOps Observability & Monitoring_ Ultimate Guide.pdf
Ad

Recently uploaded (20)

PPTX
Internet___Basics___Styled_ presentation
PDF
The Internet -By the Numbers, Sri Lanka Edition
PPTX
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
PPTX
Introduction to Information and Communication Technology
PPTX
Digital Literacy And Online Safety on internet
PPTX
PptxGenJS_Demo_Chart_20250317130215833.pptx
PPTX
Funds Management Learning Material for Beg
PDF
Slides PDF The World Game (s) Eco Economic Epochs.pdf
PPTX
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
PDF
RPKI Status Update, presented by Makito Lay at IDNOG 10
PDF
Tenda Login Guide: Access Your Router in 5 Easy Steps
PDF
Decoding a Decade: 10 Years of Applied CTI Discipline
PPTX
presentation_pfe-universite-molay-seltan.pptx
PDF
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
PPTX
Module 1 - Cyber Law and Ethics 101.pptx
PPT
tcp ip networks nd ip layering assotred slides
PPTX
international classification of diseases ICD-10 review PPT.pptx
PDF
Testing WebRTC applications at scale.pdf
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
DOCX
Unit-3 cyber security network security of internet system
Internet___Basics___Styled_ presentation
The Internet -By the Numbers, Sri Lanka Edition
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
Introduction to Information and Communication Technology
Digital Literacy And Online Safety on internet
PptxGenJS_Demo_Chart_20250317130215833.pptx
Funds Management Learning Material for Beg
Slides PDF The World Game (s) Eco Economic Epochs.pdf
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
RPKI Status Update, presented by Makito Lay at IDNOG 10
Tenda Login Guide: Access Your Router in 5 Easy Steps
Decoding a Decade: 10 Years of Applied CTI Discipline
presentation_pfe-universite-molay-seltan.pptx
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
Module 1 - Cyber Law and Ethics 101.pptx
tcp ip networks nd ip layering assotred slides
international classification of diseases ICD-10 review PPT.pptx
Testing WebRTC applications at scale.pdf
The New Creative Director: How AI Tools for Social Media Content Creation Are...
Unit-3 cyber security network security of internet system

DevOps Game Theory / Observability Deck

  • 1. Changing the Game: How Game Theory can break down silos Kevin Crawley – Developer Relations // Instana Principle SRE Architect & Co-Owner // Single
  • 2. ▫ Docker Captain ▫ Gitlab Hero ▫ DevOpsDays Nashville Organizer ▫ 20 years in software development ▫ 5+ years DevOps/SRE experience About Me
  • 3. Discussion Points ▪ How does Game Theory tear down Silos ▪ Characteristics of High Performance Organizations ▪ DevOps and Site Reliability Engineers ▪ What SREs need to be effective
  • 4. Let’s talk about Game Theory (I’m really bad at math) source: Nirmal Mehta (Docker Captain)
  • 5. What is Bad Equilibrium? It’s a strategy that all players in the game can adoptand converge on, butit won’tproduce a desirable outcome for anyone. https://pdfs.semanticscholar.org/30d1/a03db196384a17fed3247407fb5859f7c76b.pdf
  • 6. Transformation: Focusing on Automation https://devops-research.com/
  • 7. Where do silos come from? Silos can be defined as the contention which exist between functional units within an organization. This contention usually manifests between teams where change management policy requirements and risks are high.
  • 8. Nash Equilibrium (Prisoners Dilemma) A concept of game theory where the optimal outcome of a game is one where no player has an incentive to deviate from her chosen strategy after considering an opponents choice. https://en.wikipedia.org/wiki/Nash_equilibrium
  • 9. Video Removed due to file size. Youtube “ Golden Balls”
  • 11. Pareto Efficiency Is a state of allocation of resources in which it is impossible to make any one individual better off without making at least one individual worse off. … aka ZERO SUM https://en.wikipedia.org/wiki/Pareto_efficiency
  • 13. Pareto Inefficiency A situationis inefficient if someone canbe made better off even after compensating those made worse off.
  • 14. Pareto Inefficient Nash Equilibrium … is a Bad Equilibrium
  • 15. Video Removed due to file size. Youtube “ Golden Balls”
  • 19. Pareto Inefficient Nash Equilibrium Gives you permission and proof to change the game
  • 21. Percentage of Work Done Manually ELITE PERFORMERS HIGH PERFORMERS LOW PERFORMERS Configuration Management 5% 10% 30% Testing 10% 20% 30% Deployments 5% 10% 30% Change approval process 10% 30% 40% https://devops-research.com/
  • 22. High Performance vs Low Performance Organizations High Performers ▪ Deployments: > 1 hour and < 1 day ▪ Lead Time for Changes: > 1 day and < 1 week ▪ MTTR: < 1 day ▪ Change Failure Rate: 0-15% Low Performers ▪ Deployments: Once per week/month ▪ Lead Time for Changes: > 1 month and <6 months ▪ MTTR: > 1 week and < 1 month ▪ Change Failure Rate:https://devops-research.com/
  • 23. What happens when we tear down the silos and become a DevOps organization? ▪ We ship more software more often, complexity increases and reliability starts to decline ▪ We naturally shift our focus to solve the scalability and reliability issues (alternatively we give up and readopt the monolith) ▪ Rise of the Site Reliability Engineers
  • 24. Transformation: Focusing on Information https://devops-research.com/
  • 25. What are some tools / processes that organizations can put in place to change our equilibrium and communicate? ▪ Communication & Collaboration Tools ▫ Slack, Git, Pagerduty, OpsGenie ▪ Observability (SRE) Tooling ▫ Custom Dashboards / Metrics / Alerting ▫ Log Analytics ▫ Distributed Tracing
  • 26. What do SREs care about? ▪ Reliability (this one is obvious) ▪ Performance (is the customer happy?) ▪ Costs (is the business happy?) SREs are in the business of measurement and define objectives through SLOs by measuring SLIs.
  • 27. What do SREs typically measure ▪ Error Rates ▪ Latency ▪ Throughput ▪ Saturation “The Four Golden Signals” - https://landing.google.com/sre/sre- book/chapters/monitoring-distributed-systems/
  • 28. What is Observability? Kalman, 1961 paper On the general theory of control systems ▪ A system is observable if the behavior of the entire system can be determined by only looking at its inputs and outputs. ▪ Lesson: control theory is a well-documented approach which people can learn from vs trying to reinvent
  • 29. Can we get some pillars? The 4 pillars of Observability was originally described in a blog article from Twitter: ▪ Monitoring ▪ Log Aggregation / Analytics ▪ Distributed systems tracing infrastructure ▪ Alerting / Visualization https://blog.twitter.com/engineering/en_us/a/2016/observability-at-twitter-technical- overview-part-i.html
  • 30. More than just pillars… “While plainly having access to logs, metrics, and traces doesn’t necessarily make systems more observable, these are powerful tools that, if understood well, can unlock the ability to build better systems.” - Cindy Sridharen https://www.oreilly.com/library/view/distributed-systems-observability/9781492033431/
  • 31. Observability gives us the means to understand all of the behavior in our systems ▪ Not just tooling, it’s how we model and analyze data ▪ Similar to how DevOps is a mindset / culture ▪ No longer treating services like Schrödinger's cat ▪ (A lot) more context around events and transactions https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html
  • 32. Why does my organization need any of this? This sounds like a lot of work …
  • 33. How many of you are running staging environments?
  • 34. How many of you actually trust your staging environments?
  • 36. In order to observe a system, we must emit signals and analyze the aggregates. Those aggregates can answer the following questions (and more): ▪ Number of Reqs / Retries / Backoffs (throughput) ▪ Request parameters / Query Statements (details) ▪ Latency / Outliers (performance) ▪ Top-Level Exceptions / Log Messages (error analysis)
  • 37. How can we collect this data? Distributed Tracing ▪ Also known as Distributed Structured Logging ▪ Larger Payloads ▪ Rich Contextual Data https://w3c.github.io/trace-context
  • 38. Sampling vs. No Sample ▪ Sampling traces may result in important outliers (P95/P99) to be missed ▪ Extremely high volume systems must sample due to massive overhead ▪ Start without sampling, adopt as needed, incorporate solutions which sample adaptively
  • 39. How has Observability helped enable a DevOps culture? Let’s take a look at a production microservice application which has been instrumented by a distributed tracing solution
  • 40. ▪ Operated by 3 engineers (1 FE/1 BE/1 SRE) ▪ Over 20k transaction / hour, 20+ integrations, 150k LOC, with less than 15% test coverage ▪ Launched in 2018 with 15 microserviceson DockerSwarm – has since expanded to over35 microserviceswith zero additional engineering personnel ▪ One-touch deployment and provisioningfor newand existing services
  • 44. Analyzing Distributed Trace Aggregates What happens if we aggregate timing, error rate, and # of reqs for each endpoint on a service
  • 47. What problems have Distributed Tracing helped solve? Database Optimizations, Caching, and Concurrency
  • 49. Slow Death of a Service
  • 50. Rise in Latency + Processing Time ▪ DBO (Hibernate Query) causing O(n log n) rise in latency and processing time ▪ Application Dashboard indicated an issue with overall latency increasing ▪ Fix deployed and improvement was observed immediately
  • 53. Caching Solved one problem … but caused another ▪ We implemented Redis for caching, and processing time went down ▪ However, we didn’t account for token policies changing and they suddenly began to expire after 30 seconds ▪ Alerting around error rates for this endpoint raised our awareness around this issue
  • 57. Context is critical Metrics are not standalone, they have relationships
  • 61. Custom Dashboards We utilize a mix of Instana, Logz.io and Grafana to manage our systems
  • 64. Focusing on Observability ▪ Enables your organization to understand the behavior of your system ▪ Empowers your engineers to find and fix problems ▪ Enables you to build more reliable systems and ship software faster ▪ Promotes empathy through understanding, transparency, and communication.
  • 65. Want to learn more about monitoring production microservice apps? ▪ Follow me on twitter for upcoming workshops @notsureifkevin & @InstanaHQ ▪ Sign up for our newsletter at the iPad and enter to win a e-gift card (delivered via email) ▪ Get a free trial of Instana @ https://instana.com

Editor's Notes

  • #3: My name is Kevin. I’ve been using Docker and maintaining distributed application systems in production since 2014. I help organize events in my local area and speak on topics such as devops, automation, culture, and observability.
  • #7: This is what happens when orgs try to: Speed up delivery Reduce MTTR Reduce lead times
  • #10: Video removed
  • #15: We all understand the game, but we don’t know how to change the rules to gain an advantage
  • #25: This is what happens when orgs try to: Speed up delivery Reduce MTTR Reduce lead times
  • #29: Time-sharing computers Computer guided missles Air Defense Network goes online
  • #39: 2. computational complexity and bandwidth requirements of distributed tracing (Lyft, Netflix, Google, etc) 3. These solutions work around inefficient consumers and processing systems (they’re typically not stream based) 4. Unless of course you’re trying to do this yourself, in which case the complexity of running these systems is extremely high, the other condition is you truly are a behemoth, in which case you probably already know most of this stuff already
  • #43: Over 150 containers Spread across multiple hosts / azs Two separate environments
  • #46: High level overview of all the services in production
  • #47: Single music has over 30 services in production, we can’t possibly monitor 30 dashboards at a time … or can we?