The MTTR Chronicles
Evolution of SRE Self Service Operations Platform
SRE Director: Jason Wik
Senior SRE: Jayan Kuttagupthan
SRE: Shubham Patil
SRECon Asia 14 June 2019
Agenda
2
SRE @ VMC on AWS
Square One
βeta
Connecting the Dots
Virtual War Room
Around the Corner
Takeaways
3
SRE @ VMC on AWS
4
Hybrid Cloud Challenge
5
SRE is focused on ensuring services are homogeneous as they scale
Complexity and variability increase risk to the SLAs
SaaS vs Individual Multi Environment Infrastructure at Scale
Storage
SaaS Scales Horizontally
Compute
Networking
6
SaaS Service continues to scale horizontally with the number of customers
Each customer environment scales vertically
SaaS vs Individual Multi Environment Infrastructure at Scale
Each customer environmentis provisioned as a homogenous environment.
Multi VPN
DX
CPU
Bound
Storage
BoundHCX
DraaSAWS DX
HCX
HorizonsMulti-Az Scaled up
Mgt. VMsCustomer
Environment
Each environment quickly deviates as utilization, features, network access, and resource bounds vary significantly.
7
MTTR directly correlates to meeting your SLAs and SLOs
Information and automation is Key
The right information at the specific time
The MTTR Challenge
8
An Extensible, Dynamic, and Collaborative platform
to
reduce MTTR and improve operational efficiency
for
unique and constantly changing environments
North Star
“
”
9
Square One
10
Information silos
High Context Switch
Cross team Collaboration
Time consuming Postmortem & RCA
Longer MTTR == More Impact to the Customer
High Operational Toil
Cumbersome impact assessment
Past learnings not used effectively
Problems Effects
Problems at Hand
11
Automated Remediation
12
Challenges
What data should
be displayed?
How to leverage
this for users
beyond SRE?
What are the
information
sources?
How should the
data be
displayed?
13
βeta: Services under a roof
How to correlate data & make it work @ scale?
14
Connecting the Dots: Single Pane of Glass
15
Cause, Symptom & Action at a single place
Display appropriate data
Intuitive UI : Realtime & Responsive
Integration with BI* & ticketing service
Increase in user adoption. Easier onboarding
Easier correlation & minimal context switch
Data Driven Decisions & Trend Analysis
~20% improvement in MTTR (on initial release !!)
*BI = Business Intelligence
Changes Results
Connecting the Dots ...
How to collaborate better?
16
Virtual War Room
17
Central Place for
Incident Remediation
More focus on
development
One integrated
platform
Effective Collaboration
with Service Owners
Results
Agility, Consistency
and Control @ SCALE
18
Avoid information Silos
Do not reinvent tools
Display what’s necessary
Automate ! Integrate ! Collaborate !
Learn from past incidents
Cap operational work
More Code, Less Toil !
"Build the platform for your Organization!"
Takeaways to Empower your SRE
Thank You
jasonwik@outlook.com
jayank.87@gmail.com
theshubhamp@gmail.com

More Related Content

PPTX
Citrix Service Provider Business Overview (070809)Final
PDF
Extending the Benefits of Virtualization from Data Center to Devices
PPTX
Creating Insightful Reports with Data from Sugar and Other Critical SaaS Sources
PPT
Clouddeckmaster109 124645576738 Phpapp01
PPTX
Cloud Cost Management Tools - CloudAtlas®
PPT
Vm turbo
PDF
Business benefits of Software as a Service
PPTX
WTIA Cloud Computing Series - Part IV: Microsofts World View of Cloud Computing
Citrix Service Provider Business Overview (070809)Final
Extending the Benefits of Virtualization from Data Center to Devices
Creating Insightful Reports with Data from Sugar and Other Critical SaaS Sources
Clouddeckmaster109 124645576738 Phpapp01
Cloud Cost Management Tools - CloudAtlas®
Vm turbo
Business benefits of Software as a Service
WTIA Cloud Computing Series - Part IV: Microsofts World View of Cloud Computing

What's hot (19)

PDF
CloudMgr partner connections deck
PPTX
Business Response To Crisis
PPTX
Navigating a disrupted market
PPT
How Cloud Changes Business Expectations
PDF
10 Good Reasons: NetApp for Financial Services
PDF
"5 Things in 5 Minutes" Series No.4 - "Mr. Banker, 5-Reasons Why the Bank Nee...
PDF
Cloud Management and Automation, Highlights from 451 Research
PDF
Big data drives better decisions
PPT
It’s raining savings from the cloud
PDF
NexGen Cloud - Shark Tanks Presentation
PPT
adam vile why investment banks won’t use public clouds
PDF
MeasureWorks - Stay in control when moving into the cloud, Compuware May 4th
PDF
Vmw top5-reasons-infographic
PPTX
OpsRamp Spring Release Webinar | May 2021
PDF
CLOUDFX: Addressing Challenges in Cloud Migration and Paving the Way for IT T...
PDF
KMWorld - June 2015 - Why KM in the Cloud May Be Right for You
PPT
PPTX
Azure businessoverview daliborkacmar
PDF
Tips To Create Stronger Business On Cloud
CloudMgr partner connections deck
Business Response To Crisis
Navigating a disrupted market
How Cloud Changes Business Expectations
10 Good Reasons: NetApp for Financial Services
"5 Things in 5 Minutes" Series No.4 - "Mr. Banker, 5-Reasons Why the Bank Nee...
Cloud Management and Automation, Highlights from 451 Research
Big data drives better decisions
It’s raining savings from the cloud
NexGen Cloud - Shark Tanks Presentation
adam vile why investment banks won’t use public clouds
MeasureWorks - Stay in control when moving into the cloud, Compuware May 4th
Vmw top5-reasons-infographic
OpsRamp Spring Release Webinar | May 2021
CLOUDFX: Addressing Challenges in Cloud Migration and Paving the Way for IT T...
KMWorld - June 2015 - Why KM in the Cloud May Be Right for You
Azure businessoverview daliborkacmar
Tips To Create Stronger Business On Cloud
Ad

Similar to The MTTR Chronicles: Evolution of SRE Self Service Operations Platform (20)

PPTX
Rethinking Site Reliability Engineering for ITSM - SDI virtual event "New Way...
PPTX
Site Reliability Engineering: Harnessing (and redefining) it for ITSM
PDF
DeliverAgile2018 - from Apollo 13 to Google SRE
PDF
Cloud expo 2018: From Apollo 13 to Google SRE - When DevOps meets SRE
PPTX
Itlc13 spk template - decisiv 6713
PDF
DevOps vs SRE vs Cloud Native
PPTX
Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Tel...
PPTX
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
PDF
SRE Demystified - 14 - SRE Practices overview
PPTX
ADDO_2022_SRE Architectural Patterns_Nov10.pptx
PPTX
ADDO_2022_SRE Architectural Patterns_Nov10.pptx
PDF
From Apollo 13 to Google SRE
PPTX
Service quality monitoring system architecture
PPTX
A Crash Course in Building Site Reliability
PDF
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
PDF
Incident Management in the Age of DevOps and SRE
PDF
ADDO_2020-Driving-Digital-Transformation-through-CloudOps-and-SRE.pdf
PPTX
Jazz for Service Management
PDF
Site-Reliability-Engineering-v2[6241].pdf
PDF
Incident Management in the Age of DevOps and SRE
Rethinking Site Reliability Engineering for ITSM - SDI virtual event "New Way...
Site Reliability Engineering: Harnessing (and redefining) it for ITSM
DeliverAgile2018 - from Apollo 13 to Google SRE
Cloud expo 2018: From Apollo 13 to Google SRE - When DevOps meets SRE
Itlc13 spk template - decisiv 6713
DevOps vs SRE vs Cloud Native
Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Tel...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRE Demystified - 14 - SRE Practices overview
ADDO_2022_SRE Architectural Patterns_Nov10.pptx
ADDO_2022_SRE Architectural Patterns_Nov10.pptx
From Apollo 13 to Google SRE
Service quality monitoring system architecture
A Crash Course in Building Site Reliability
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Incident Management in the Age of DevOps and SRE
ADDO_2020-Driving-Digital-Transformation-through-CloudOps-and-SRE.pdf
Jazz for Service Management
Site-Reliability-Engineering-v2[6241].pdf
Incident Management in the Age of DevOps and SRE
Ad

Recently uploaded (20)

PPTX
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
PDF
LOW POWER CLASS AB SI POWER AMPLIFIER FOR WIRELESS MEDICAL SENSOR NETWORK
PPTX
CN_Unite_1 AI&DS ENGGERING SPPU PUNE UNIVERSITY
PDF
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
PDF
First part_B-Image Processing - 1 of 2).pdf
PPTX
CONTRACTS IN CONSTRUCTION PROJECTS: TYPES
PDF
Java Basics-Introduction and program control
PPTX
PRASUNET_20240614003_231416_0000[1].pptx
PPT
Chapter 1 - Introduction to Manufacturing Technology_2.ppt
PDF
Unit I -OPERATING SYSTEMS_SRM_KATTANKULATHUR.pptx.pdf
PDF
UEFA_Embodied_Carbon_Emissions_Football_Infrastructure.pdf
PPTX
Software Engineering and software moduleing
PPTX
Feature types and data preprocessing steps
PDF
20250617 - IR - Global Guide for HR - 51 pages.pdf
PDF
distributed database system" (DDBS) is often used to refer to both the distri...
PDF
Computer organization and architecuture Digital Notes....pdf
PPTX
Management Information system : MIS-e-Business Systems.pptx
PPTX
wireless networks, mobile computing.pptx
PDF
Unit1 - AIML Chapter 1 concept and ethics
PPTX
Information Storage and Retrieval Techniques Unit III
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
LOW POWER CLASS AB SI POWER AMPLIFIER FOR WIRELESS MEDICAL SENSOR NETWORK
CN_Unite_1 AI&DS ENGGERING SPPU PUNE UNIVERSITY
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
First part_B-Image Processing - 1 of 2).pdf
CONTRACTS IN CONSTRUCTION PROJECTS: TYPES
Java Basics-Introduction and program control
PRASUNET_20240614003_231416_0000[1].pptx
Chapter 1 - Introduction to Manufacturing Technology_2.ppt
Unit I -OPERATING SYSTEMS_SRM_KATTANKULATHUR.pptx.pdf
UEFA_Embodied_Carbon_Emissions_Football_Infrastructure.pdf
Software Engineering and software moduleing
Feature types and data preprocessing steps
20250617 - IR - Global Guide for HR - 51 pages.pdf
distributed database system" (DDBS) is often used to refer to both the distri...
Computer organization and architecuture Digital Notes....pdf
Management Information system : MIS-e-Business Systems.pptx
wireless networks, mobile computing.pptx
Unit1 - AIML Chapter 1 concept and ethics
Information Storage and Retrieval Techniques Unit III

The MTTR Chronicles: Evolution of SRE Self Service Operations Platform

  • 1. The MTTR Chronicles Evolution of SRE Self Service Operations Platform SRE Director: Jason Wik Senior SRE: Jayan Kuttagupthan SRE: Shubham Patil SRECon Asia 14 June 2019
  • 2. Agenda 2 SRE @ VMC on AWS Square One βeta Connecting the Dots Virtual War Room Around the Corner Takeaways
  • 3. 3 SRE @ VMC on AWS
  • 5. 5 SRE is focused on ensuring services are homogeneous as they scale Complexity and variability increase risk to the SLAs SaaS vs Individual Multi Environment Infrastructure at Scale Storage SaaS Scales Horizontally Compute Networking
  • 6. 6 SaaS Service continues to scale horizontally with the number of customers Each customer environment scales vertically SaaS vs Individual Multi Environment Infrastructure at Scale Each customer environmentis provisioned as a homogenous environment. Multi VPN DX CPU Bound Storage BoundHCX DraaSAWS DX HCX HorizonsMulti-Az Scaled up Mgt. VMsCustomer Environment Each environment quickly deviates as utilization, features, network access, and resource bounds vary significantly.
  • 7. 7 MTTR directly correlates to meeting your SLAs and SLOs Information and automation is Key The right information at the specific time The MTTR Challenge
  • 8. 8 An Extensible, Dynamic, and Collaborative platform to reduce MTTR and improve operational efficiency for unique and constantly changing environments North Star “ ”
  • 10. 10 Information silos High Context Switch Cross team Collaboration Time consuming Postmortem & RCA Longer MTTR == More Impact to the Customer High Operational Toil Cumbersome impact assessment Past learnings not used effectively Problems Effects Problems at Hand
  • 12. 12 Challenges What data should be displayed? How to leverage this for users beyond SRE? What are the information sources? How should the data be displayed?
  • 13. 13 βeta: Services under a roof How to correlate data & make it work @ scale?
  • 14. 14 Connecting the Dots: Single Pane of Glass
  • 15. 15 Cause, Symptom & Action at a single place Display appropriate data Intuitive UI : Realtime & Responsive Integration with BI* & ticketing service Increase in user adoption. Easier onboarding Easier correlation & minimal context switch Data Driven Decisions & Trend Analysis ~20% improvement in MTTR (on initial release !!) *BI = Business Intelligence Changes Results Connecting the Dots ... How to collaborate better?
  • 17. 17 Central Place for Incident Remediation More focus on development One integrated platform Effective Collaboration with Service Owners Results Agility, Consistency and Control @ SCALE
  • 18. 18 Avoid information Silos Do not reinvent tools Display what’s necessary Automate ! Integrate ! Collaborate ! Learn from past incidents Cap operational work More Code, Less Toil ! "Build the platform for your Organization!" Takeaways to Empower your SRE