SlideShare a Scribd company logo
Helping operations top-heavy
teams the smart way
Jeff Weiner
Chief Executive Officer
Michael Kehoe
Staff Site Reliability Engineer
Todd Palino
Sr Staff Site Reliability Engineer
This Is The Only Slide You May Need a Picture Of
slideshare.net/ToddPalino slideshare.net/MichaelKehoe3
Michael Kehoe
$ WHOAMI
• Staff Site Reliability Engineer @ LinkedIn
• Production-SRE Team
• Former Network Engineer at the
University of Queensland
Todd Palino
$ WHOAMI
• Senior Staff SRE @ LinkedIn
• Capacity Engineering Team
• Co-Author of Kafka: The Definitive Guide
• Late of VeriSign Infrastructure
Engineering
When Operations Isn’t Perfect
Code Yellow
https://devops.com/code-yellow-when-operations-isnt-perfect/
• How to quickly erase all your
technical debt
• How to change your engineering
culture
This talk is not
• How to identify team anti-patterns
• How to work through high toil
• How to create sustainable
workloads
This talk is
Today’s
agenda
1 Background
2 Scenario 1: Traffic-SRE
3 Scenario 2: Kafka-SRE
4 Building A Formula For Success
5 Key Learnings
6 Q&A
Background
Personal Experience in the past two years
ASSISTANCE RENDERED
• Traffic-SRE: Technical Debt/ Resource
Allocation
• Voyager-SRE: Technical Debt
• Capacity War-room
• Espresso-SRE: Reliability
• Kafka-SRE: Capacity and Alert Fatigue
Scenario 1: Traffic-SRE
Problem Statement
Technical Debt
• Written documentation needed
improvement
• Deployment infrastructure needed
investment
• Alert Fatigue
Traffic-SRE
Problem Statement
Resource Allocations
• Backlog of work for clients
• Staff shortage
Scenario 2: Kafka
Code Yellow: Helping operations top-heavy teams the smart way
Problem Statement
Capacity Planning
• Multi-tenant Infrastructure
• No resource controls
• Unclear resource ownership
• Ad-hoc capacity planning
• Sudden 100% increase in traffic
Problem Statement
Alert Fatigue
• Multiple applications overutilized
• No time for proactive work
• Most alerts non-actionable
Building a formula for
success
Code Yellow
Building a formula for success
Define the areas
that need attacking
Problem Statement
Communicate
expectations with
clients & partners
Communication &
Partnerships
Define success
criteria
Exit Criteria
Get the help that
you require
Resource
Acquisition
Plan for short-term
& long-term
Planning
Define the areas that need attacking
Problem Statement
• Admit there is a problem
• Measure the problem
• Understand the problem
• Determines underlying causes that
need to be fixed
Building a formula for success
Define success criteria
Exit Criteria
• Define concrete goals
• Define concrete success criteria
• Measure via an operational metric
• Measure via a project being
completed
• Define timelines for completion
Building a formula for success
Get the help you require
Resource Acquisition
• Ask other teams for help
• Get dedicated engineers/ project
managers/ other roles as required
• Set exit-date for resources
Building a formula for success
Plan for the short-term & long-term
Planning
• Plan out short-term work
• Plan out longer-term projects
• Do they need to be rescheduled?
• Prioritize work that will reduce toil &
burnout (Automation +
Measurement)
Building a formula for success
Communicate expectations with
clients & partners
Communicatio
n &
Partnerships
• Communicate problem statement &
exit criteria
• Send regular progress updates
• Ensure that stakeholders
understand delays & expected
outcomes
Building a formula for success
Key Learnings
Key Learnings
Measure toil/
overhead
Measure
Prioritize efforts to
remove overhead/toil
Prioritize
Communicate with
partners & teams
Communicate
Q&A
Code Yellow: Helping operations top-heavy teams the smart way

More Related Content

PPTX
Helping operations top-heavy teams the smart way
PPTX
Agile foundation and agile myths
PDF
Project Estimation Tool
PPTX
Agile strategy
PDF
All You Want To About Kanban Before Doing Kanban Certification | AgileFever
PDF
Anton Muzhailo - Practical Test Process Improvement using ISTQB
PDF
LeanKit Webinar: Managing Complex Workflows
Helping operations top-heavy teams the smart way
Agile foundation and agile myths
Project Estimation Tool
Agile strategy
All You Want To About Kanban Before Doing Kanban Certification | AgileFever
Anton Muzhailo - Practical Test Process Improvement using ISTQB
LeanKit Webinar: Managing Complex Workflows

What's hot (19)

PDF
How to Thrive in a Fast-Paced Environment - The Art of Quarterly Strategic Pl...
PDF
User Story Cycle Time - An Universal Agile Maturity Measurement
PDF
Automate estimates, resource loading , and sprint plans!
PPTX
Agile Project Development
PDF
Optimize Portfolio Performance with Simple Agile Techniques and Jira - Part 1...
PPTX
Introducing Agile to the Enterprise
PPTX
Understanding the Relationship between Lean, Agile, and DevOps: Jon's Slides
PPTX
Go Live is Just the Start - Managing AX Improvement Projects | Carlo DiPucchio
PDF
Software Advice UserView: Agile Project Management Report 2015
PPTX
Effective engineer
PPT
Lean Software Development
PPTX
PPTX
Implement Agile Practices That Work
PPTX
Project management tips and trick
PPTX
The Agile Project Portfolio - A 'Pecha Kucha' presentation
PPTX
LSCTIG 2015 Session Materials - Are you agile
PPTX
DevOps By The Numbers
PPTX
Oana Feidi - Debugging - Root cause analysis - CodeCamp-10-may-2014
How to Thrive in a Fast-Paced Environment - The Art of Quarterly Strategic Pl...
User Story Cycle Time - An Universal Agile Maturity Measurement
Automate estimates, resource loading , and sprint plans!
Agile Project Development
Optimize Portfolio Performance with Simple Agile Techniques and Jira - Part 1...
Introducing Agile to the Enterprise
Understanding the Relationship between Lean, Agile, and DevOps: Jon's Slides
Go Live is Just the Start - Managing AX Improvement Projects | Carlo DiPucchio
Software Advice UserView: Agile Project Management Report 2015
Effective engineer
Lean Software Development
Implement Agile Practices That Work
Project management tips and trick
The Agile Project Portfolio - A 'Pecha Kucha' presentation
LSCTIG 2015 Session Materials - Are you agile
DevOps By The Numbers
Oana Feidi - Debugging - Root cause analysis - CodeCamp-10-may-2014
Ad

Similar to Code Yellow: Helping operations top-heavy teams the smart way (20)

PPTX
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
PPTX
Helping operations top-heavy teams the smart way
PPTX
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
PPTX
Building Talent Pipelines vs Lean/Just-In-Time Recruiting - Talent 42 Keynote
PDF
SRE Lessons for the Enterprise
PDF
It is the IT world
PPTX
DevOps Kanban Meet Up 3/22/12
PDF
SRE Organizational Framework
PDF
Intro to DevOps
PPTX
IIE Midwest Region 6 Conference Lunch Keynote Presentation
PDF
2015 06-24 innovation in the large enterprise final-v2
PDF
S.R.E - create ultra-scalable and highly reliable systems
PDF
How to Build a Healthy On-Call Culture
PDF
NovaTec Company Overview
PDF
DevOps for Managers
PDF
Operational Excellence PowerPoint Presentation Slides
PPTX
From DevOps to Operations Science
PDF
Industry Keynote: Redefine Operations in a DevOps World—The New Role for Site...
PDF
Operations as a Service: Because Failure Still Happens
PDF
Agile Mumbai 2019 Conference | Intelligent DevOps enabling Enterprise Agilit...
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Helping operations top-heavy teams the smart way
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Building Talent Pipelines vs Lean/Just-In-Time Recruiting - Talent 42 Keynote
SRE Lessons for the Enterprise
It is the IT world
DevOps Kanban Meet Up 3/22/12
SRE Organizational Framework
Intro to DevOps
IIE Midwest Region 6 Conference Lunch Keynote Presentation
2015 06-24 innovation in the large enterprise final-v2
S.R.E - create ultra-scalable and highly reliable systems
How to Build a Healthy On-Call Culture
NovaTec Company Overview
DevOps for Managers
Operational Excellence PowerPoint Presentation Slides
From DevOps to Operations Science
Industry Keynote: Redefine Operations in a DevOps World—The New Role for Site...
Operations as a Service: Because Failure Still Happens
Agile Mumbai 2019 Conference | Intelligent DevOps enabling Enterprise Agilit...
Ad

More from Michael Kehoe (20)

PPTX
eBPF Workshop
PPTX
eBPF Basics
PDF
QConSF 2018: Building Production-Ready Applications
PDF
AllDayDevops: What the NTSB teaches us about incident management & postmortems
PPTX
Linux Container Basics
PPTX
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
PDF
What the NTSB teaches us about incident management & postmortems
PPTX
PyBay 2018: Production-Ready Python Applications
PPTX
The Next Wave of Reliability Engineering
PPTX
Building Production-Ready Microservices: DevopsExchangeSF
PPTX
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
PPTX
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
PPTX
SRECon-Europe-2017: Networks for SREs
PPTX
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
PPTX
Reducing MTTR and False Escalations: Event Correlation at LinkedIn
PPTX
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
PPTX
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
PPTX
Couchbase Connect 2016
PPTX
Using SaltStack to Auto Triage and Remediate Production Systems
PDF
SRECon USA 2016: Growing your Entry Level Talent
eBPF Workshop
eBPF Basics
QConSF 2018: Building Production-Ready Applications
AllDayDevops: What the NTSB teaches us about incident management & postmortems
Linux Container Basics
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
What the NTSB teaches us about incident management & postmortems
PyBay 2018: Production-Ready Python Applications
The Next Wave of Reliability Engineering
Building Production-Ready Microservices: DevopsExchangeSF
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Networks for SREs
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Reducing MTTR and False Escalations: Event Correlation at LinkedIn
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016
Using SaltStack to Auto Triage and Remediate Production Systems
SRECon USA 2016: Growing your Entry Level Talent

Recently uploaded (20)

PPTX
Geodesy 1.pptx...............................................
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
Lecture Notes Electrical Wiring System Components
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Sustainable Sites - Green Building Construction
PDF
Well-logging-methods_new................
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPT
Project quality management in manufacturing
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Geodesy 1.pptx...............................................
Internet of Things (IOT) - A guide to understanding
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Structs to JSON How Go Powers REST APIs.pdf
Lecture Notes Electrical Wiring System Components
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Sustainable Sites - Green Building Construction
Well-logging-methods_new................
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Model Code of Practice - Construction Work - 21102022 .pdf
Operating System & Kernel Study Guide-1 - converted.pdf
Foundation to blockchain - A guide to Blockchain Tech
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Project quality management in manufacturing
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...

Code Yellow: Helping operations top-heavy teams the smart way

  • 1. Helping operations top-heavy teams the smart way Jeff Weiner Chief Executive Officer Michael Kehoe Staff Site Reliability Engineer Todd Palino Sr Staff Site Reliability Engineer
  • 2. This Is The Only Slide You May Need a Picture Of slideshare.net/ToddPalino slideshare.net/MichaelKehoe3
  • 3. Michael Kehoe $ WHOAMI • Staff Site Reliability Engineer @ LinkedIn • Production-SRE Team • Former Network Engineer at the University of Queensland
  • 4. Todd Palino $ WHOAMI • Senior Staff SRE @ LinkedIn • Capacity Engineering Team • Co-Author of Kafka: The Definitive Guide • Late of VeriSign Infrastructure Engineering
  • 5. When Operations Isn’t Perfect Code Yellow https://devops.com/code-yellow-when-operations-isnt-perfect/
  • 6. • How to quickly erase all your technical debt • How to change your engineering culture This talk is not
  • 7. • How to identify team anti-patterns • How to work through high toil • How to create sustainable workloads This talk is
  • 8. Today’s agenda 1 Background 2 Scenario 1: Traffic-SRE 3 Scenario 2: Kafka-SRE 4 Building A Formula For Success 5 Key Learnings 6 Q&A
  • 10. Personal Experience in the past two years ASSISTANCE RENDERED • Traffic-SRE: Technical Debt/ Resource Allocation • Voyager-SRE: Technical Debt • Capacity War-room • Espresso-SRE: Reliability • Kafka-SRE: Capacity and Alert Fatigue
  • 12. Problem Statement Technical Debt • Written documentation needed improvement • Deployment infrastructure needed investment • Alert Fatigue Traffic-SRE
  • 13. Problem Statement Resource Allocations • Backlog of work for clients • Staff shortage
  • 16. Problem Statement Capacity Planning • Multi-tenant Infrastructure • No resource controls • Unclear resource ownership • Ad-hoc capacity planning • Sudden 100% increase in traffic
  • 17. Problem Statement Alert Fatigue • Multiple applications overutilized • No time for proactive work • Most alerts non-actionable
  • 18. Building a formula for success
  • 20. Building a formula for success Define the areas that need attacking Problem Statement Communicate expectations with clients & partners Communication & Partnerships Define success criteria Exit Criteria Get the help that you require Resource Acquisition Plan for short-term & long-term Planning
  • 21. Define the areas that need attacking Problem Statement • Admit there is a problem • Measure the problem • Understand the problem • Determines underlying causes that need to be fixed Building a formula for success
  • 22. Define success criteria Exit Criteria • Define concrete goals • Define concrete success criteria • Measure via an operational metric • Measure via a project being completed • Define timelines for completion Building a formula for success
  • 23. Get the help you require Resource Acquisition • Ask other teams for help • Get dedicated engineers/ project managers/ other roles as required • Set exit-date for resources Building a formula for success
  • 24. Plan for the short-term & long-term Planning • Plan out short-term work • Plan out longer-term projects • Do they need to be rescheduled? • Prioritize work that will reduce toil & burnout (Automation + Measurement) Building a formula for success
  • 25. Communicate expectations with clients & partners Communicatio n & Partnerships • Communicate problem statement & exit criteria • Send regular progress updates • Ensure that stakeholders understand delays & expected outcomes Building a formula for success
  • 27. Key Learnings Measure toil/ overhead Measure Prioritize efforts to remove overhead/toil Prioritize Communicate with partners & teams Communicate
  • 28. Q&A