SlideShare a Scribd company logo
Lessons
Learned
from a
Parallel Universe
David N. Blank-Edelman
@otterbook
photo: Will Langenberg
SRE for
DevOps
David N. Blank-Edelman
@otterbook
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel Universe
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel Universe
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel Universe
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel Universe
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel Universe
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel Universe
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel Universe
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel Universe
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel Universe
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel Universe
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel Universe
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel Universe
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel Universe
What Makes SRE, SRE (dramatic recreation)
• hire only coders
• have an SLA for your service
• measure and report performance against SLA
• Use Error Budgets and gate launches on them
• Common staffing pool for SRE and DEV
• Excess Ops work overflows to DEV team
• Cap SRE operational load at 50%
• Share 5% of ops work with DEV team
• Oncall teams at least 8 people, or 6x2
• Maximum of 2 events per oncall shift
• Post mortem for every event
• Post mortems are blameless and focus on process and technology, not people
What Makes SRE, SRE (dramatic recreation)
• hire only coders
• have an SLA for your service
• measure and report performance against SLA
• Use Error Budgets and gate launches on them
• Common staffing pool for SRE and DEV
• Excess Ops work overflows to DEV team
• Cap SRE operational load at 50%
• Share 5% of ops work with DEV team
• Oncall teams at least 8 people, or 6x2
• Maximum of 2 events per oncall shift
• Post mortem for every event
• Post mortems are blameless and focus on process and technology, not people
SLO monitor decide
What Makes SRE, SRE (dramatic recreation)
• hire only coders
• have an SLA for your service
• measure and report performance against SLA
• Use Error Budgets and gate launches on them
• Common staffing pool for SRE and DEV
• Excess Ops work overflows to DEV team
• Cap SRE operational load at 50%
• Share 5% of ops work with DEV team
• Oncall teams at least 8 people, or 6x2
• Maximum of 2 events per oncall shift
• Post mortem for every event
• Post mortems are blameless and focus on process and technology, not people
What Makes SRE, SRE (dramatic recreation)
• hire only coders
• have an SLA for your service
• measure and report performance against SLA
• Use Error Budgets and gate launches on them
• Common staffing pool for SRE and DEV
• Excess Ops work overflows to DEV team
• Cap SRE operational load at 50%
• Share 5% of ops work with DEV team
• Oncall teams at least 8 people, or 6x2
• Maximum of 2 events per oncall shift
• Post mortem for every event
• Post mortems are blameless and focus on process and technology, not people
Observation #1:
Create virtuous and reinforcing
feedback loops
What Makes SRE, SRE (dramatic recreation)
• hire only coders
• have an SLA for your service
• measure and report performance against SLA
• Use Error Budgets and gate launches on them
• Common staffing pool for SRE and DEV
• Excess Ops work overflows to DEV team
• Cap SRE operational load at 50%
• Share 5% of ops work with DEV team
• Oncall teams at least 8 people, or 6x2
• Maximum of 2 events per oncall shift
• Post mortem for every event
• Post mortems are blameless and focus on process and technology, not people
What Makes SRE, SRE (dramatic recreation)
• hire only coders
• have an SLA for your service
• measure and report performance against SLA
• Use Error Budgets and gate launches on them
• Common staffing pool for SRE and DEV
• Excess Ops work overflows to DEV team
• Cap SRE operational load at 50%
• Share 5% of ops work with DEV team
• Oncall teams at least 8 people, or 6x2
• Maximum of 2 events per oncall shift
• Post mortem for every event
• Post mortems blameless and focus on process and technology, not people
Observation #2:
You can’t fire your way to reliable.
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel Universe
• Airbnb
• Amazon
• Apple
• Baidu
• Dropbox
• Etsy
• Facebook
• GitHub
• LinkedIn
•Microsoft
•Netflix
•Pinterest
•Spotify
•Stack Exchange
•Twitter
•Uber
•Yahoo!
•Yelp
• Airbnb
• Amazon
• Apple
• Baidu
• Dropbox
• Etsy
• Facebook
• GitHub
• LinkedIn
•Microsoft
•Netflix
•Pinterest
•Spotify
•Stack Exchange
•Twitter
•Uber
•Yahoo!
•Yelp
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel Universe
Facebook has Production Engineers
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel Universe
Facebook has Production Engineers
“Production Engineers at Facebook are hybrid
software/systems engineers who ensure that
Facebook's services run smoothly and have the
capacity for future growth. They are embedded in
every one of Facebook's product and infrastructure
teams, and are core participants in every significant
engineering effort underway in the company.”
Facebook has Production Engineers
“Production Engineers at Facebook are hybrid
software/systems engineers who ensure that
Facebook's services run smoothly and have the
capacity for future growth. They are embedded in
every one of Facebook's product and infrastructure
teams, and are core participants in every significant
engineering effort underway in the company.”
Facebook has Production Engineers
“Production Engineers at Facebook are hybrid
software/systems engineers who ensure that
Facebook's services run smoothly and have the
capacity for future growth. They are embedded
in every one of Facebook's product and infrastructure
teams, and are core participants in every significant
engineering effort underway in the company.”
Observation #3:
There is no one right way to
build an SRE team. But
there are wrong ways.
Facebook has Production Engineers
•~1-10 ratio
•18, 24, 36 months with a service
(not a S.W.A.T. team or ops monkeys)
•Also do Bootcamp
•Lead reports to Facebook head of engineering
(same person responsible for Ops and Eng)
Observation #4:
SRE requires management support
at the highest levels.
Production Engineering Maturity Model
•maturity of SW/PE team’s work
•maturity of level of service
•relationship status of PE/SWE teams
•stages:
• bootstrap/build out
• scale/initial deployment
• awesomize
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel Universe
Get in Touch!
www.apcera.com/dnb
dnb@apcera.com
@otterbook

More Related Content

PPTX
DOES SFO 2016 San Francisco - Julia Wester - Predictability: No Magic Required
PPTX
DOES SFO 2016 - Ray Krueger - Speed as a Prime Directive
PPTX
DOES SFO 2016 - Scott Willson - Top 10 Ways to Fail at DevOps
PPTX
DevOps: A Value Proposition
PPTX
Nf final chef-lisa-metrics-2015-ss
PDF
Quality Jam 2017: Jesse Reed & Kyle McMeekin "Test Case Management & Explorat...
PPTX
Moving QA from Reactive to Proactive with qTest
PPTX
Tools Won't Fix Your Broken DevOps
DOES SFO 2016 San Francisco - Julia Wester - Predictability: No Magic Required
DOES SFO 2016 - Ray Krueger - Speed as a Prime Directive
DOES SFO 2016 - Scott Willson - Top 10 Ways to Fail at DevOps
DevOps: A Value Proposition
Nf final chef-lisa-metrics-2015-ss
Quality Jam 2017: Jesse Reed & Kyle McMeekin "Test Case Management & Explorat...
Moving QA from Reactive to Proactive with qTest
Tools Won't Fix Your Broken DevOps

What's hot (20)

PPTX
Tis The Season: Load Testing Tips and Checklist for Retail Seasonal Readiness
PDF
The Anti-Transformation transformation @DevOps Summit Amsterdam
PPTX
DevOps: The Key to IT Performance
PPTX
DevOps Enterprise Summit 2016
PPTX
Continuous Testing: Preparing for DevOps
PDF
Continuous Delivery in a Legacy Shop—One Step at a Time
PDF
Are We There Yet? Signposts On Your Journey to Awesome
PDF
Continuous Integration Is for Everyone—Especially DevOps
PDF
Vmware2021 why even devop nicolefv
PDF
Advance ALM and DevOps Practices with Continuous Improvement
PDF
DevOps Picc12 Management Talk
PPTX
Quality Jam 2016 Product Roadmap
PPTX
How Continuous Delivery and Lean Management Make your DevOps Amazeballs
PDF
SOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHub
PDF
What I learned from 5 years of sciencing the crap out of DevOps
PDF
DevOPs Transformation Workshop
PDF
DOES 2016 Sciencing the Crap Out of DevOps
PPTX
Continuous Delivery + DevOps = Awesome
PPTX
QASymphony Atlanta Customer User Group Fall 2017
PPTX
Making disaster routine
Tis The Season: Load Testing Tips and Checklist for Retail Seasonal Readiness
The Anti-Transformation transformation @DevOps Summit Amsterdam
DevOps: The Key to IT Performance
DevOps Enterprise Summit 2016
Continuous Testing: Preparing for DevOps
Continuous Delivery in a Legacy Shop—One Step at a Time
Are We There Yet? Signposts On Your Journey to Awesome
Continuous Integration Is for Everyone—Especially DevOps
Vmware2021 why even devop nicolefv
Advance ALM and DevOps Practices with Continuous Improvement
DevOps Picc12 Management Talk
Quality Jam 2016 Product Roadmap
How Continuous Delivery and Lean Management Make your DevOps Amazeballs
SOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHub
What I learned from 5 years of sciencing the crap out of DevOps
DevOPs Transformation Workshop
DOES 2016 Sciencing the Crap Out of DevOps
Continuous Delivery + DevOps = Awesome
QASymphony Atlanta Customer User Group Fall 2017
Making disaster routine
Ad

Viewers also liked (20)

PPTX
DOES SFO 2016 - Paula Thrasher & Kevin Stanley - Building Brilliant Teams
PPTX
DOES SFO 2016 - David Habershon - Ministry of Social Development New Zealand
PPTX
DOES SFO 2016 - Avan Mathur - Planning for Huge Scale
PPTX
DOES SFO 2016 - Courtney Kissler - Inspire and Nurture the Human Spirit
PPTX
DOES SFO 2016 - Greg Maxey and Laurent Rochette - DSL at Scale
PPTX
DOES SFO 2016 - Daniel Perez - Doubling Down on ChatOps in the Enterprise
PPTX
DOES SFO 2016 - Chris Fulton - CD for DBs
PPTX
DOES SFO 2016 - Steve Mayner - Transformational Leadership
PDF
DOES SFO 2016 - Greg Padak - Default to Open
PPTX
DOES SFO 2016 - Mark Imbriaco - Lessons From the Bleeding Edge
PPTX
DOES SFO 2016 - Marc Priolo - Are we there yet?
PPTX
DOES SFO 2016 - Rich Jackson & Rosalind Radcliffe - The Mainframe DevOps Team...
PPTX
DOES SFO 2016 - Alexa Alley - Value Stream Mapping
PPTX
DOES SFO 2016 - Cornelia Davis - DevOps: Who Does What?
PPTX
DOES SFO 2016 - Michael Nygard - Tempo, Maneuverability, Initiative
PDF
DOES SFO 2016 - Ross Clanton and Chivas Nambiar - DevOps at Verizon
PPTX
DOES SFO 2016 - Topo Pal - DevOps at Capital One
PDF
DOES SFO 2016 - Kaimar Karu - ITIL. You keep using that word. I don't think i...
PDF
The Social Requirements Engineering (SRE) Approach to Developing a Large-scal...
PDF
SRE in Startup
DOES SFO 2016 - Paula Thrasher & Kevin Stanley - Building Brilliant Teams
DOES SFO 2016 - David Habershon - Ministry of Social Development New Zealand
DOES SFO 2016 - Avan Mathur - Planning for Huge Scale
DOES SFO 2016 - Courtney Kissler - Inspire and Nurture the Human Spirit
DOES SFO 2016 - Greg Maxey and Laurent Rochette - DSL at Scale
DOES SFO 2016 - Daniel Perez - Doubling Down on ChatOps in the Enterprise
DOES SFO 2016 - Chris Fulton - CD for DBs
DOES SFO 2016 - Steve Mayner - Transformational Leadership
DOES SFO 2016 - Greg Padak - Default to Open
DOES SFO 2016 - Mark Imbriaco - Lessons From the Bleeding Edge
DOES SFO 2016 - Marc Priolo - Are we there yet?
DOES SFO 2016 - Rich Jackson & Rosalind Radcliffe - The Mainframe DevOps Team...
DOES SFO 2016 - Alexa Alley - Value Stream Mapping
DOES SFO 2016 - Cornelia Davis - DevOps: Who Does What?
DOES SFO 2016 - Michael Nygard - Tempo, Maneuverability, Initiative
DOES SFO 2016 - Ross Clanton and Chivas Nambiar - DevOps at Verizon
DOES SFO 2016 - Topo Pal - DevOps at Capital One
DOES SFO 2016 - Kaimar Karu - ITIL. You keep using that word. I don't think i...
The Social Requirements Engineering (SRE) Approach to Developing a Large-scal...
SRE in Startup
Ad

Similar to DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel Universe (20)

PDF
Kks sre book_ch1,2
PDF
S.R.E - create ultra-scalable and highly reliable systems
PPTX
A Crash Course in Building Site Reliability
PPTX
Site (Service) Reliability Engineering
PDF
SRE in Apiary
PDF
Essential_Skills_of_a_Site_Reliability_E.pdf
PDF
Site Reliability Engineering slide deck 101
PPTX
SRE (service reliability engineer) on big DevOps platform running on the clou...
PDF
Site-Reliability-Engineering-v2[6241].pdf
PDF
Bjorn Rabenstein. SRE, DevOps, Google, and you
PPTX
What is Site Reliability Engineering (SRE)
PDF
LeadDev NYC 2022: Calling Out a Terrible On-call System
PPTX
Google Cloud Summit - Solving Reliability Fears with SRE
PPTX
Facilitating DevOps Execution in an All Digital Environment
PDF
SRE - drupal day aveiro 2016
PDF
Getting started with Site Reliability Engineering (SRE)
PPTX
Site reliability engineering
PDF
SRE for Everyone: Making Tomorrow Better Than Today
PDF
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
PPTX
Unicorn on-call :: Landing Festival Lisbon 2019
Kks sre book_ch1,2
S.R.E - create ultra-scalable and highly reliable systems
A Crash Course in Building Site Reliability
Site (Service) Reliability Engineering
SRE in Apiary
Essential_Skills_of_a_Site_Reliability_E.pdf
Site Reliability Engineering slide deck 101
SRE (service reliability engineer) on big DevOps platform running on the clou...
Site-Reliability-Engineering-v2[6241].pdf
Bjorn Rabenstein. SRE, DevOps, Google, and you
What is Site Reliability Engineering (SRE)
LeadDev NYC 2022: Calling Out a Terrible On-call System
Google Cloud Summit - Solving Reliability Fears with SRE
Facilitating DevOps Execution in an All Digital Environment
SRE - drupal day aveiro 2016
Getting started with Site Reliability Engineering (SRE)
Site reliability engineering
SRE for Everyone: Making Tomorrow Better Than Today
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
Unicorn on-call :: Landing Festival Lisbon 2019

More from Gene Kim (9)

PPTX
DOES SFO 2016 - Steve Brodie - The Future of DevOps in the Enterprise
PDF
DOES SFO 2016 - Aimee Bechtle - Utilizing Distributed Dojos to Transform a Wo...
PDF
DOES SFO 2016 - Kevina Finn-Braun & J. Paul Reed - Beyond the Retrospective: ...
PPTX
DOES SFO 2016 - Andy Cooper & Brandon Holcomb - When IT Closes the Deal
PDF
DOES SFO 2016 - Matthew Barr - Enterprise Git - the hard bits
PPTX
DOES SFO 2016 - Sam Guckenheimer & Ed Blankenship "Moving to One Engineering ...
PPTX
DOES16 San Francisco - Opal Perry - Technology Transformation: How Team Value...
PPTX
DOES16 San Francisco - Dominica DeGrandis - Time Theft: How Hidden and Unplan...
PPTX
DOES16 San Francisco - Marc Ng - SAP’s DevOps Journey: From Building an App t...
DOES SFO 2016 - Steve Brodie - The Future of DevOps in the Enterprise
DOES SFO 2016 - Aimee Bechtle - Utilizing Distributed Dojos to Transform a Wo...
DOES SFO 2016 - Kevina Finn-Braun & J. Paul Reed - Beyond the Retrospective: ...
DOES SFO 2016 - Andy Cooper & Brandon Holcomb - When IT Closes the Deal
DOES SFO 2016 - Matthew Barr - Enterprise Git - the hard bits
DOES SFO 2016 - Sam Guckenheimer & Ed Blankenship "Moving to One Engineering ...
DOES16 San Francisco - Opal Perry - Technology Transformation: How Team Value...
DOES16 San Francisco - Dominica DeGrandis - Time Theft: How Hidden and Unplan...
DOES16 San Francisco - Marc Ng - SAP’s DevOps Journey: From Building an App t...

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
cuic standard and advanced reporting.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
A Presentation on Artificial Intelligence
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
KodekX | Application Modernization Development
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Per capita expenditure prediction using model stacking based on satellite ima...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Unlocking AI with Model Context Protocol (MCP)
Network Security Unit 5.pdf for BCA BBA.
cuic standard and advanced reporting.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
The AUB Centre for AI in Media Proposal.docx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Review of recent advances in non-invasive hemoglobin estimation
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
NewMind AI Weekly Chronicles - August'25 Week I
A Presentation on Artificial Intelligence
“AI and Expert System Decision Support & Business Intelligence Systems”
KodekX | Application Modernization Development
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel Universe

  • 1. Lessons Learned from a Parallel Universe David N. Blank-Edelman @otterbook photo: Will Langenberg
  • 2. SRE for DevOps David N. Blank-Edelman @otterbook
  • 16. What Makes SRE, SRE (dramatic recreation) • hire only coders • have an SLA for your service • measure and report performance against SLA • Use Error Budgets and gate launches on them • Common staffing pool for SRE and DEV • Excess Ops work overflows to DEV team • Cap SRE operational load at 50% • Share 5% of ops work with DEV team • Oncall teams at least 8 people, or 6x2 • Maximum of 2 events per oncall shift • Post mortem for every event • Post mortems are blameless and focus on process and technology, not people
  • 17. What Makes SRE, SRE (dramatic recreation) • hire only coders • have an SLA for your service • measure and report performance against SLA • Use Error Budgets and gate launches on them • Common staffing pool for SRE and DEV • Excess Ops work overflows to DEV team • Cap SRE operational load at 50% • Share 5% of ops work with DEV team • Oncall teams at least 8 people, or 6x2 • Maximum of 2 events per oncall shift • Post mortem for every event • Post mortems are blameless and focus on process and technology, not people
  • 19. What Makes SRE, SRE (dramatic recreation) • hire only coders • have an SLA for your service • measure and report performance against SLA • Use Error Budgets and gate launches on them • Common staffing pool for SRE and DEV • Excess Ops work overflows to DEV team • Cap SRE operational load at 50% • Share 5% of ops work with DEV team • Oncall teams at least 8 people, or 6x2 • Maximum of 2 events per oncall shift • Post mortem for every event • Post mortems are blameless and focus on process and technology, not people
  • 20. What Makes SRE, SRE (dramatic recreation) • hire only coders • have an SLA for your service • measure and report performance against SLA • Use Error Budgets and gate launches on them • Common staffing pool for SRE and DEV • Excess Ops work overflows to DEV team • Cap SRE operational load at 50% • Share 5% of ops work with DEV team • Oncall teams at least 8 people, or 6x2 • Maximum of 2 events per oncall shift • Post mortem for every event • Post mortems are blameless and focus on process and technology, not people
  • 21. Observation #1: Create virtuous and reinforcing feedback loops
  • 22. What Makes SRE, SRE (dramatic recreation) • hire only coders • have an SLA for your service • measure and report performance against SLA • Use Error Budgets and gate launches on them • Common staffing pool for SRE and DEV • Excess Ops work overflows to DEV team • Cap SRE operational load at 50% • Share 5% of ops work with DEV team • Oncall teams at least 8 people, or 6x2 • Maximum of 2 events per oncall shift • Post mortem for every event • Post mortems are blameless and focus on process and technology, not people
  • 23. What Makes SRE, SRE (dramatic recreation) • hire only coders • have an SLA for your service • measure and report performance against SLA • Use Error Budgets and gate launches on them • Common staffing pool for SRE and DEV • Excess Ops work overflows to DEV team • Cap SRE operational load at 50% • Share 5% of ops work with DEV team • Oncall teams at least 8 people, or 6x2 • Maximum of 2 events per oncall shift • Post mortem for every event • Post mortems blameless and focus on process and technology, not people
  • 24. Observation #2: You can’t fire your way to reliable.
  • 26. • Airbnb • Amazon • Apple • Baidu • Dropbox • Etsy • Facebook • GitHub • LinkedIn •Microsoft •Netflix •Pinterest •Spotify •Stack Exchange •Twitter •Uber •Yahoo! •Yelp
  • 27. • Airbnb • Amazon • Apple • Baidu • Dropbox • Etsy • Facebook • GitHub • LinkedIn •Microsoft •Netflix •Pinterest •Spotify •Stack Exchange •Twitter •Uber •Yahoo! •Yelp
  • 31. Facebook has Production Engineers “Production Engineers at Facebook are hybrid software/systems engineers who ensure that Facebook's services run smoothly and have the capacity for future growth. They are embedded in every one of Facebook's product and infrastructure teams, and are core participants in every significant engineering effort underway in the company.”
  • 32. Facebook has Production Engineers “Production Engineers at Facebook are hybrid software/systems engineers who ensure that Facebook's services run smoothly and have the capacity for future growth. They are embedded in every one of Facebook's product and infrastructure teams, and are core participants in every significant engineering effort underway in the company.”
  • 33. Facebook has Production Engineers “Production Engineers at Facebook are hybrid software/systems engineers who ensure that Facebook's services run smoothly and have the capacity for future growth. They are embedded in every one of Facebook's product and infrastructure teams, and are core participants in every significant engineering effort underway in the company.”
  • 34. Observation #3: There is no one right way to build an SRE team. But there are wrong ways.
  • 35. Facebook has Production Engineers •~1-10 ratio •18, 24, 36 months with a service (not a S.W.A.T. team or ops monkeys) •Also do Bootcamp •Lead reports to Facebook head of engineering (same person responsible for Ops and Eng)
  • 36. Observation #4: SRE requires management support at the highest levels.
  • 37. Production Engineering Maturity Model •maturity of SW/PE team’s work •maturity of level of service •relationship status of PE/SWE teams •stages: • bootstrap/build out • scale/initial deployment • awesomize

Editor's Notes

  • #21: toil
  • #38: deployment (work in production, instrumentation, canary, new features), awesomize (outliers last 5%, productionized)