SlideShare a Scribd company logo
Changing the
Game:
How Game
Theory can break
down silos
Kevin Crawley – Developer Relations // Instana
Principle SRE Architect & Co-Owner // Single
Twitter: @notsureifkevin
▫ Docker Captain
▫ Gitlab Hero
▫ DevOpsDays Nashville Organizer
▫ 20 years in software development
▫ 5+ years DevOps/SRE experience
About Me
Discussion Points
▪ How does Game Theory tear down Silos
▪ Characteristics of High Performance
Organizations
▪ DevOps and Site Reliability Engineers
▪ What SREs need to be effective
Let’s talk about
Game Theory
(Disclaimer: I’m bad at math)
source: Nirmal Mehta (Docker Captain)
What is Bad Equilibrium?
It’s a strategy that all players in the game can adoptand converge on, butit
won’tproduce a desirable outcome for anyone.
https://pdfs.semanticscholar.org/30d1/a03db196384a17fed3247407fb5859f7c76b.pdf
Transformation: Focusing on Automation
https://devops-research.com/
Where do silos come from?
Silos can be defined as the contention which exist
between functional units within an organization.
This contention usually manifests between teams
where change management policy requirements
and risks are high.
Nash Equilibrium
(Prisoners Dilemma)
A concept of game theory where the optimal
outcome of a game is one where no player has an
incentive to deviate from her chosen strategy after
considering an opponents choice.
https://en.wikipedia.org/wiki/Nash_equilibrium
Split / Steal – Example 1
▪ Video has been removed to save bandwidth, you may view it
here on YouTube
https://www.youtube.com/watch?v=p3Uos2fzIJ0
Nash Equilibrium outcomes
Pareto Efficiency
Is a state of allocation of resources in which it is
impossible to make any one individual better off
without making at least one individual worse off.
… aka ZERO SUM
https://en.wikipedia.org/wiki/Pareto_efficiency
Changing the Game: Breaking Down IT Silos
Pareto Inefficiency
A situationis inefficient if someone canbe made better off even after
compensating those made worse off.
Pareto Inefficient Nash Equilibrium
… is a Bad Equilibrium
Split / Steal Example 2
▪ Video has been removed to save bandwidth, you may view it
here on YouTube
https://www.youtube.com/watch?v=S0qjK3TWZE8
Don’t like the game?
Change the Game
New Nash Equilibrium
Pareto Inefficient Nash Equilibrium
Gives you permission and proof to change the
game
Change the Game
Percentage of Work Done Manually
ELITE
PERFORMERS
HIGH
PERFORMERS
LOW
PERFORMERS
Configuration
Management
5% 10% 30%
Testing 10% 20% 30%
Deployments 5% 10% 30%
Change
approval
process
10% 30% 40%
https://devops-research.com/
High Performance vs Low Performance
Organizations
High Performers
▪ Deployments:
> 1 hour and < 1 day
▪ Lead Time for
Changes:
> 1 day and < 1 week
▪ MTTR:
< 1 day
▪ Change Failure Rate:
0-15%
Low Performers
▪ Deployments:
Once per week/month
▪ Lead Time for Changes:
> 1 month and <6
months
▪ MTTR:
> 1 week and < 1 month
▪ Change Failure Rate:https://devops-research.com/
What happens when we tear down the silos
and become a DevOps organization?
▪ We ship more software more often,
complexity increases and reliability starts to
decline
▪ We naturally shift our focus to solve the
scalability and reliability issues (alternatively
we give up and readopt the monolith)
▪ Rise of the Site Reliability Engineers
Transformation: Focusing on Information
https://devops-research.com/
What are some tools / processes that
organizations can put in place to change
our equilibrium and communicate?
▪ Communication & Collaboration Tools
▫ Slack, Git, Pagerduty, OpsGenie
▪ Observability (SRE) Tooling
▫ Custom Dashboards / Metrics /
Alerting
▫ Log Analytics
▫ Distributed Tracing
What do SREs care about?
▪ Reliability (this one is obvious)
▪ Performance (is the customer happy?)
▪ Costs (is the business happy?)
SREs are in the business of measurement and
define objectives through SLOs by measuring SLIs.
What do SREs typically measure
▪ Error Rates
▪ Latency
▪ Throughput
▪ Saturation
“The Four Golden Signals” - https://landing.google.com/sre/sre-
book/chapters/monitoring-distributed-systems/
What is Observability?
Kalman, 1961 paper
On the general theory of control systems
▪ A system is observable if the behavior of the entire system
can be determined by only looking at its inputs and outputs.
▪ Lesson: control theory is a well-documented approach which
people can learn from vs trying to reinvent
Can we get some pillars?
The 4 pillars of Observability was originally described in a blog
article from Twitter:
▪ Monitoring
▪ Log Aggregation / Analytics
▪ Distributed systems tracing infrastructure
▪ Alerting / Visualization
https://blog.twitter.com/engineering/en_us/a/2016/observability-at-twitter-technical-
overview-part-i.html
More than just
pillars…
“While plainly having access to logs, metrics, and traces
doesn’t necessarily make systems more observable, these
are powerful tools that, if understood well, can unlock the
ability to build better systems.”
- Cindy Sridharen
https://www.oreilly.com/library/view/distributed-systems-observability/9781492033431/
Observability gives us the means to
understand all of the behavior in our
systems
▪ Not just tooling, it’s how
we model and analyze
data
▪ Similar to how DevOps is
a mindset / culture
▪ No longer treating
services like Schrödinger's
cat
▪ (A lot) more context
around events and
transactions
https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html
Why does my
organization need any
of this?
This sounds like a lot of work …
How many of you are running
staging environments?
How many of you actually trust
your staging environments?
Changing the Game: Breaking Down IT Silos
In order to observe a system, we must emit
signals and analyze the aggregates.
Those aggregates can answer the
following questions (and more):
▪ Number of Reqs / Retries / Backoffs
(throughput)
▪ Request parameters / Query Statements
(details)
▪ Latency / Outliers (performance)
▪ Top-Level Exceptions / Log Messages (error
analysis)
How can we collect this data?
Distributed Tracing
▪ Also known as Distributed Structured
Logging
▪ Larger Payloads
▪ Rich Contextual Data
https://w3c.github.io/trace-context
Sampling vs. No Sample
▪ Sampling traces may result in important
outliers (P95/P99) to be missed
▪ Extremely high volume systems must
sample due to massive overhead
▪ Start without sampling, adopt as needed,
incorporate solutions which sample
adaptively
How has Observability helped enable a
DevOps culture?
Let’s take a look at a production
microservice application which has been
instrumented by a distributed tracing
solution
▪ Operated by 3 engineers (1 FE/1 BE/1 SRE)
▪ Over 20k transaction / hour, 20+ integrations, 150k LOC, with less
than 15% test coverage
▪ Launched in 2018 with 15 microserviceson DockerSwarm – has since
expanded to over35 microserviceswith zero additional engineering
personnel
▪ One-touch deployment and provisioningfor newand existing services
Visualizing
Large and
Complex
Systems
Changing the Game: Breaking Down IT Silos
Changing the Game: Breaking Down IT Silos
Analyzing
Distributed Trace
Aggregates
What happens if we aggregate timing, error rate, and # of
reqs for each endpoint on a service
Changing the Game: Breaking Down IT Silos
Changing the Game: Breaking Down IT Silos
What problems
have Distributed
Tracing helped
solve?
Database Optimizations, Caching, and Concurrency
@notsureifkevin
Exponential
Backoff
Slow Death
of a Service
Rise in Latency + Processing Time
▪ DBO (Hibernate Query) causing O(n log n) rise in latency and
processing time
▪ Application Dashboard indicated an issue with overall latency
increasing
▪ Fix deployed and improvement was observed immediately
Issue Resolved
Changing the Game: Breaking Down IT Silos
Caching Solved one problem
… but caused another
▪ We implemented Redis for caching, and processing time went
down
▪ However, we didn’t account for token policies changing and
they suddenly began to expire after 30 seconds
▪ Alerting around error rates for this endpoint raised our
awareness around this issue
Changing the Game: Breaking Down IT Silos
Changing the Game: Breaking Down IT Silos
Changing the Game: Breaking Down IT Silos
Context is critical
Metrics are not standalone, they have relationships
Changing the Game: Breaking Down IT Silos
Changing the Game: Breaking Down IT Silos
Changing the Game: Breaking Down IT Silos
Custom
Dashboards
We utilize a mix of Instana, Logz.io and Grafana to manage
our systems
Changing the Game: Breaking Down IT Silos
Changing the Game: Breaking Down IT Silos
Focusing on Observability
▪ Enables your organization to understand the behavior
of your system
▪ Empowers your engineers to find and fix problems
▪ Enables you to build more reliable systems and ship
software faster
▪ Promotes empathy through understanding,
transparency, and communication.
Want to learn more about monitoring
production microservice apps?
▪ Follow me on twitter for upcoming workshops
@notsureifkevin & @InstanaHQ
▪ Get a free trial of Instana @ https://instana.com

More Related Content

PPTX
DevOps Game Theory / Observability Deck
PPTX
Top Lessons Learned From The DevOps Handbook
PDF
AtlasCamp 2015: Damn you Facebook - Raising the bar in SaaS
PPTX
Evolving toward devops through transaction centric monitoring
PDF
The History of DevOps (and what you need to do about it)
PDF
DevOps Picc12 Management Talk
PPTX
OpenDevOps 2019 - Disconnected pipelines the missing link
PDF
Agile Fundamentals
DevOps Game Theory / Observability Deck
Top Lessons Learned From The DevOps Handbook
AtlasCamp 2015: Damn you Facebook - Raising the bar in SaaS
Evolving toward devops through transaction centric monitoring
The History of DevOps (and what you need to do about it)
DevOps Picc12 Management Talk
OpenDevOps 2019 - Disconnected pipelines the missing link
Agile Fundamentals

What's hot (20)

PPTX
How Do We Better Sell DevOps? - PuppetConf 2013
PPTX
Scaling DevOps - delivering on the promise of business velocity and quality
PDF
How to address operational aspects effectively with Agile practices - Matthew...
PDF
AtlasCamp 2015 Keynote
PDF
LeSS-Intro - Scrum Meetup Berlin
PPT
Q Con 2008 - Unleashing the Fossa
PPT
What the Fuck is DevOps?
PDF
5 Steps for a High-Performing DevOps Culture
PPTX
Devops skills you got what it takes ?
PPTX
DevOps State of the Union 2015
PPTX
DevOpsGuys - Getting Started with DevOps - Github/Azure Webinar
PDF
Death to the DevOps team - Agile Yorkshire 2014
PPTX
DevOpsGuys - DevOps Automation - The Good, The Bad and The Ugly
PDF
Ship Faster Without Breaking Everything - XebiaLabs + SaltStack Webinar
PPTX
Why #DevOps Transformation has to start with you
PDF
Continuous Delivery Tools Collaboration Conways Law - QCon London - Matthew S...
PDF
The Atlassian Bug Bounty Program
PDF
An End to End Stack for a Container Age - Continuous Delivery London 2016
PDF
O365Engage17 - Ins and outs of monitoring office 365
PDF
Devops Kaizen - DevopsDays Dallas 2017
How Do We Better Sell DevOps? - PuppetConf 2013
Scaling DevOps - delivering on the promise of business velocity and quality
How to address operational aspects effectively with Agile practices - Matthew...
AtlasCamp 2015 Keynote
LeSS-Intro - Scrum Meetup Berlin
Q Con 2008 - Unleashing the Fossa
What the Fuck is DevOps?
5 Steps for a High-Performing DevOps Culture
Devops skills you got what it takes ?
DevOps State of the Union 2015
DevOpsGuys - Getting Started with DevOps - Github/Azure Webinar
Death to the DevOps team - Agile Yorkshire 2014
DevOpsGuys - DevOps Automation - The Good, The Bad and The Ugly
Ship Faster Without Breaking Everything - XebiaLabs + SaltStack Webinar
Why #DevOps Transformation has to start with you
Continuous Delivery Tools Collaboration Conways Law - QCon London - Matthew S...
The Atlassian Bug Bounty Program
An End to End Stack for a Container Age - Continuous Delivery London 2016
O365Engage17 - Ins and outs of monitoring office 365
Devops Kaizen - DevopsDays Dallas 2017
Ad

Similar to Changing the Game: Breaking Down IT Silos (20)

PPTX
Measuring Performance: See the Science of DevOps Measurement in Action
PDF
Turning Human Capital into High Performance Organizational Capital
PDF
DevOpsDays Houston 2019 -Kevin Crawley - Practical Guide to Not Building Anot...
PPTX
Measuring Performance: See the Science of DevOps Measurement in Action
PPTX
2011 06 15 velocity conf from visible ops to dev ops final
PDF
PMI Thailand: DevOps / Roles of Project Manager (20-May-2020)
PPTX
2011 09 19 LSPE Dev Ops Cookbook 1a
PPTX
DevOps Roadtrip Final Speaking Deck
PDF
DBA Role Shift in a DevOps World
PPTX
All you need is fast feedback loop, fast feedback loop, fast feedback loop is...
PPT
Agile2015: Introduction to DevOps with Chocolate and Lego Game
PDF
5 practical operability techniques - Matthew Skelton - SkillsMatter 2018
PDF
5 practical operability techniques for teams - Matthew Skelton - SQUID meetup...
PDF
Service Management in a DevOps World - by Helen Beal
PDF
DevOps Transformation - Another View
PDF
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
PPTX
DockerCon SF 2019 - Observability Workshop
PDF
Delivering Better Software Faster (Without Breaking Everything)
PPTX
All you need is fast feedback loop, fast feedback loop, fast feedback loop is...
PPTX
DevOpsGuys Scaling DevOps @ #CIOWaterCooler - June 2018
Measuring Performance: See the Science of DevOps Measurement in Action
Turning Human Capital into High Performance Organizational Capital
DevOpsDays Houston 2019 -Kevin Crawley - Practical Guide to Not Building Anot...
Measuring Performance: See the Science of DevOps Measurement in Action
2011 06 15 velocity conf from visible ops to dev ops final
PMI Thailand: DevOps / Roles of Project Manager (20-May-2020)
2011 09 19 LSPE Dev Ops Cookbook 1a
DevOps Roadtrip Final Speaking Deck
DBA Role Shift in a DevOps World
All you need is fast feedback loop, fast feedback loop, fast feedback loop is...
Agile2015: Introduction to DevOps with Chocolate and Lego Game
5 practical operability techniques - Matthew Skelton - SkillsMatter 2018
5 practical operability techniques for teams - Matthew Skelton - SQUID meetup...
Service Management in a DevOps World - by Helen Beal
DevOps Transformation - Another View
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
DockerCon SF 2019 - Observability Workshop
Delivering Better Software Faster (Without Breaking Everything)
All you need is fast feedback loop, fast feedback loop, fast feedback loop is...
DevOpsGuys Scaling DevOps @ #CIOWaterCooler - June 2018
Ad

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
A Presentation on Artificial Intelligence
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
A Presentation on Touch Screen Technology
PDF
August Patch Tuesday
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Hindi spoken digit analysis for native and non-native speakers
PPTX
TLE Review Electricity (Electricity).pptx
PPTX
1. Introduction to Computer Programming.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Mushroom cultivation and it's methods.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
Chapter 5: Probability Theory and Statistics
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Encapsulation theory and applications.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
A Presentation on Artificial Intelligence
SOPHOS-XG Firewall Administrator PPT.pptx
A Presentation on Touch Screen Technology
August Patch Tuesday
Zenith AI: Advanced Artificial Intelligence
Hindi spoken digit analysis for native and non-native speakers
TLE Review Electricity (Electricity).pptx
1. Introduction to Computer Programming.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Mushroom cultivation and it's methods.pdf
Programs and apps: productivity, graphics, security and other tools
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Approach and Philosophy of On baking technology
Chapter 5: Probability Theory and Statistics
NewMind AI Weekly Chronicles - August'25-Week II
Agricultural_Statistics_at_a_Glance_2022_0.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Group 1 Presentation -Planning and Decision Making .pptx
Encapsulation theory and applications.pdf

Changing the Game: Breaking Down IT Silos

  • 1. Changing the Game: How Game Theory can break down silos Kevin Crawley – Developer Relations // Instana Principle SRE Architect & Co-Owner // Single Twitter: @notsureifkevin
  • 2. ▫ Docker Captain ▫ Gitlab Hero ▫ DevOpsDays Nashville Organizer ▫ 20 years in software development ▫ 5+ years DevOps/SRE experience About Me
  • 3. Discussion Points ▪ How does Game Theory tear down Silos ▪ Characteristics of High Performance Organizations ▪ DevOps and Site Reliability Engineers ▪ What SREs need to be effective
  • 4. Let’s talk about Game Theory (Disclaimer: I’m bad at math) source: Nirmal Mehta (Docker Captain)
  • 5. What is Bad Equilibrium? It’s a strategy that all players in the game can adoptand converge on, butit won’tproduce a desirable outcome for anyone. https://pdfs.semanticscholar.org/30d1/a03db196384a17fed3247407fb5859f7c76b.pdf
  • 6. Transformation: Focusing on Automation https://devops-research.com/
  • 7. Where do silos come from? Silos can be defined as the contention which exist between functional units within an organization. This contention usually manifests between teams where change management policy requirements and risks are high.
  • 8. Nash Equilibrium (Prisoners Dilemma) A concept of game theory where the optimal outcome of a game is one where no player has an incentive to deviate from her chosen strategy after considering an opponents choice. https://en.wikipedia.org/wiki/Nash_equilibrium
  • 9. Split / Steal – Example 1 ▪ Video has been removed to save bandwidth, you may view it here on YouTube https://www.youtube.com/watch?v=p3Uos2fzIJ0
  • 11. Pareto Efficiency Is a state of allocation of resources in which it is impossible to make any one individual better off without making at least one individual worse off. … aka ZERO SUM https://en.wikipedia.org/wiki/Pareto_efficiency
  • 13. Pareto Inefficiency A situationis inefficient if someone canbe made better off even after compensating those made worse off.
  • 14. Pareto Inefficient Nash Equilibrium … is a Bad Equilibrium
  • 15. Split / Steal Example 2 ▪ Video has been removed to save bandwidth, you may view it here on YouTube https://www.youtube.com/watch?v=S0qjK3TWZE8
  • 19. Pareto Inefficient Nash Equilibrium Gives you permission and proof to change the game
  • 21. Percentage of Work Done Manually ELITE PERFORMERS HIGH PERFORMERS LOW PERFORMERS Configuration Management 5% 10% 30% Testing 10% 20% 30% Deployments 5% 10% 30% Change approval process 10% 30% 40% https://devops-research.com/
  • 22. High Performance vs Low Performance Organizations High Performers ▪ Deployments: > 1 hour and < 1 day ▪ Lead Time for Changes: > 1 day and < 1 week ▪ MTTR: < 1 day ▪ Change Failure Rate: 0-15% Low Performers ▪ Deployments: Once per week/month ▪ Lead Time for Changes: > 1 month and <6 months ▪ MTTR: > 1 week and < 1 month ▪ Change Failure Rate:https://devops-research.com/
  • 23. What happens when we tear down the silos and become a DevOps organization? ▪ We ship more software more often, complexity increases and reliability starts to decline ▪ We naturally shift our focus to solve the scalability and reliability issues (alternatively we give up and readopt the monolith) ▪ Rise of the Site Reliability Engineers
  • 24. Transformation: Focusing on Information https://devops-research.com/
  • 25. What are some tools / processes that organizations can put in place to change our equilibrium and communicate? ▪ Communication & Collaboration Tools ▫ Slack, Git, Pagerduty, OpsGenie ▪ Observability (SRE) Tooling ▫ Custom Dashboards / Metrics / Alerting ▫ Log Analytics ▫ Distributed Tracing
  • 26. What do SREs care about? ▪ Reliability (this one is obvious) ▪ Performance (is the customer happy?) ▪ Costs (is the business happy?) SREs are in the business of measurement and define objectives through SLOs by measuring SLIs.
  • 27. What do SREs typically measure ▪ Error Rates ▪ Latency ▪ Throughput ▪ Saturation “The Four Golden Signals” - https://landing.google.com/sre/sre- book/chapters/monitoring-distributed-systems/
  • 28. What is Observability? Kalman, 1961 paper On the general theory of control systems ▪ A system is observable if the behavior of the entire system can be determined by only looking at its inputs and outputs. ▪ Lesson: control theory is a well-documented approach which people can learn from vs trying to reinvent
  • 29. Can we get some pillars? The 4 pillars of Observability was originally described in a blog article from Twitter: ▪ Monitoring ▪ Log Aggregation / Analytics ▪ Distributed systems tracing infrastructure ▪ Alerting / Visualization https://blog.twitter.com/engineering/en_us/a/2016/observability-at-twitter-technical- overview-part-i.html
  • 30. More than just pillars… “While plainly having access to logs, metrics, and traces doesn’t necessarily make systems more observable, these are powerful tools that, if understood well, can unlock the ability to build better systems.” - Cindy Sridharen https://www.oreilly.com/library/view/distributed-systems-observability/9781492033431/
  • 31. Observability gives us the means to understand all of the behavior in our systems ▪ Not just tooling, it’s how we model and analyze data ▪ Similar to how DevOps is a mindset / culture ▪ No longer treating services like Schrödinger's cat ▪ (A lot) more context around events and transactions https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html
  • 32. Why does my organization need any of this? This sounds like a lot of work …
  • 33. How many of you are running staging environments?
  • 34. How many of you actually trust your staging environments?
  • 36. In order to observe a system, we must emit signals and analyze the aggregates. Those aggregates can answer the following questions (and more): ▪ Number of Reqs / Retries / Backoffs (throughput) ▪ Request parameters / Query Statements (details) ▪ Latency / Outliers (performance) ▪ Top-Level Exceptions / Log Messages (error analysis)
  • 37. How can we collect this data? Distributed Tracing ▪ Also known as Distributed Structured Logging ▪ Larger Payloads ▪ Rich Contextual Data https://w3c.github.io/trace-context
  • 38. Sampling vs. No Sample ▪ Sampling traces may result in important outliers (P95/P99) to be missed ▪ Extremely high volume systems must sample due to massive overhead ▪ Start without sampling, adopt as needed, incorporate solutions which sample adaptively
  • 39. How has Observability helped enable a DevOps culture? Let’s take a look at a production microservice application which has been instrumented by a distributed tracing solution
  • 40. ▪ Operated by 3 engineers (1 FE/1 BE/1 SRE) ▪ Over 20k transaction / hour, 20+ integrations, 150k LOC, with less than 15% test coverage ▪ Launched in 2018 with 15 microserviceson DockerSwarm – has since expanded to over35 microserviceswith zero additional engineering personnel ▪ One-touch deployment and provisioningfor newand existing services
  • 44. Analyzing Distributed Trace Aggregates What happens if we aggregate timing, error rate, and # of reqs for each endpoint on a service
  • 47. What problems have Distributed Tracing helped solve? Database Optimizations, Caching, and Concurrency
  • 49. Slow Death of a Service
  • 50. Rise in Latency + Processing Time ▪ DBO (Hibernate Query) causing O(n log n) rise in latency and processing time ▪ Application Dashboard indicated an issue with overall latency increasing ▪ Fix deployed and improvement was observed immediately
  • 53. Caching Solved one problem … but caused another ▪ We implemented Redis for caching, and processing time went down ▪ However, we didn’t account for token policies changing and they suddenly began to expire after 30 seconds ▪ Alerting around error rates for this endpoint raised our awareness around this issue
  • 57. Context is critical Metrics are not standalone, they have relationships
  • 61. Custom Dashboards We utilize a mix of Instana, Logz.io and Grafana to manage our systems
  • 64. Focusing on Observability ▪ Enables your organization to understand the behavior of your system ▪ Empowers your engineers to find and fix problems ▪ Enables you to build more reliable systems and ship software faster ▪ Promotes empathy through understanding, transparency, and communication.
  • 65. Want to learn more about monitoring production microservice apps? ▪ Follow me on twitter for upcoming workshops @notsureifkevin & @InstanaHQ ▪ Get a free trial of Instana @ https://instana.com

Editor's Notes

  • #3: My name is Kevin. I’ve been using Docker and maintaining distributed application systems in production since 2014. I help organize events in my local area and speak on topics such as devops, automation, culture, and observability.
  • #7: This is what happens when orgs try to: Speed up delivery Reduce MTTR Reduce lead times
  • #15: We all understand the game, but we don’t know how to change the rules to gain an advantage
  • #25: This is what happens when orgs try to: Speed up delivery Reduce MTTR Reduce lead times
  • #29: Time-sharing computers Computer guided missles Air Defense Network goes online
  • #39: 2. computational complexity and bandwidth requirements of distributed tracing (Lyft, Netflix, Google, etc) 3. These solutions work around inefficient consumers and processing systems (they’re typically not stream based) 4. Unless of course you’re trying to do this yourself, in which case the complexity of running these systems is extremely high, the other condition is you truly are a behemoth, in which case you probably already know most of this stuff already
  • #43: Over 150 containers Spread across multiple hosts / azs Two separate environments
  • #46: High level overview of all the services in production
  • #47: Single music has over 30 services in production, we can’t possibly monitor 30 dashboards at a time … or can we?