SlideShare a Scribd company logo
Building Disaster Recovery via
Resilience Engineering
Michael Kehoe
Staff SRE - LinkedIn
Tonight’s
agenda
1 Introductions
2 What is Resilience Engineering
3 The Problem Statement
4 Project Overview
5 Testing Process
6 Project Outcomes
7 Key Takeaways
8 Q&A
Introduction
Michael Kehoe
/USR/BIN/WHOAMI
• Staff Site Reliability Engineer @ LinkedIn
• Production-SRE Team
• Funny accent = Australian + 4 years
American
• Former Network Engineer at the
University of Queensland
Who are we?
PRODUCTION-SRE TEAM AT LINKEDIN
• Disaster Recovery Planning and
Automation
• Incident Response and Automation
• Visibility Engineering
• Reliability Principles
LinkedIn
EVOLUTION OF THE INFRASTRUCTURE
2003 2010 2011 2013 2014 2015
Active &
Passive
Active &
Active
Multi-colo 3-
way Active &
Active
Multi-colo n-
way Active &
Active
LinkedIn
2018
4 Data Centers 21 PoPs 1000+ services
What is Resilience
Engineering?
What is Resilience Engineering?
• Projects that directly demand increased
resilience from our applications and
infrastructure.
• Application Injection Failure
• Infrastructure Injection Failure
• Full Disaster-Recovery Tests
Problem Statement
How often have you heard stories where someone
thought they had a disaster strategy, never tested it and
it fails when you need it the most?
Problem Statement
• How do we ensure that we always have
disaster recovery ability without incident?
• How do we consistently test for disaster
recovery ability without disrupting the
company?
Project Overview
Project Overview
1
• Build a process (with Automation) to facilitate disaster recovery
• Operate the process on regular cadence
• Provide reporting on outcomes of tests with engineering executives
Testing Process
What is Load Testing?
5x a week Peak hour traffic Fixed SLA
LinkedIn Traffic-Tier
Border
Router IPVS ATS ATS Frontend
EDGE FABRIC
Stickyrouting
LinkedIn Traffic-Tier
Fabric
Buckets
1
91
2 3 10
92 93 100
LinkedIn Traffic-Tier
EDGE FABRIC
DC1
DC2
DC1 in Cookie
Got DC2 as secondary fabric
Gets
secondary
fabric for userStickyrouting
TrafficShift Architecture
Web
application
Salt master
Stickyrouting
ServiceCouchbase Backend Worker
Processes
FABRIC
BUCKETS
Load Testing
FABRIC
DC3
DC1 DC2
60%
Traffic
Percentage
Load Testing
22
Project Outcomes
Benefits of Load-testing
Capacity
Planning
Identify Bugs Confidence
Benefits of Load-testing
CAPACITY PLANNING
• Through this process, we continuously validate our infrastructure
capacity
• This is the best signal we can possibly get since we’re simulating a
real disaster
Benefits of Load-testing
IDENTIFY BUGS
2
• Some bugs are only found at high load (under duress)
• Helps find inefficiency’s that otherwise may not be found until it’s too late
• Gives us clues on how to make our code more resilient to potential failure
Benefits of Load-testing
CONFIDENCE
2
• Through load-testing, we’ve built confidence in our disaster recovery
strategy
• We understand exactly:
• What process to follow
• How long it takes to avert disaster
• What are the risks associated with a disaster incident
Key Takeaways
Key Takeaways
• Resilience Engineering is a must for
LinkedIn
• Design infrastructure to facilitate disaster
recovery
• Disaster-test regularly to avoid surprises
• Automate your testing/ process to reduce
engagement time
Q&A
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engineering

More Related Content

PPTX
The Human Side of DevSecOps
PPTX
Continuous Delivery
PPTX
Security & DevOps- Ways To Make Sure Your Apps & Infrastructure Are Secure
PDF
DevSecOps and the CI/CD Pipeline
PDF
NYIT DSC/ Spring 2021 - Introduction to DevOps (CI/CD)
PDF
DevSecCon London 2017: How far left do you want to go with security? by Javie...
PDF
QA in DevOps: Transformation thru Automation via Jenkins
The Human Side of DevSecOps
Continuous Delivery
Security & DevOps- Ways To Make Sure Your Apps & Infrastructure Are Secure
DevSecOps and the CI/CD Pipeline
NYIT DSC/ Spring 2021 - Introduction to DevOps (CI/CD)
DevSecCon London 2017: How far left do you want to go with security? by Javie...
QA in DevOps: Transformation thru Automation via Jenkins

What's hot (20)

PDF
DevOps Continuous Integration & Delivery - A Whitepaper by RapidValue
PDF
How to plug the data gap in DevOps
PPTX
ATAGTR2017 Security Testing / IoT Testing in Real World
PDF
DevOps the Big Picture for Testers by Joseph Ours
PDF
Continuous Testing in DevOps
PDF
Scaling Enterprise DevOps with CloudBees
PPTX
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
PPTX
Building DevOps Toolchain
PPTX
DevOps and All the Continuouses w/ Helen Beal
PDF
Reliability (R)evolution: Turning the DevOps World Upside Down (Again).
PPTX
The Art of Container Monitoring
PPTX
Drive Continuous Delivery With Continuous Testing
PDF
A True Story of Why QA Loves DevOps
PDF
Secure your Azure and DevOps in a smart way
PDF
¿Qué es DevOps y por qué es importante en el Ciclo de Software? por michelada.io
PDF
Engineering Trust in Your Automated Tests
PPTX
Where Testers & QA Fit in the Story of DevOps
PPTX
Designing for the internet - Page Objects for the Real World
PPTX
Why Serverless is scary without DevSecOps and Observability
DevOps Continuous Integration & Delivery - A Whitepaper by RapidValue
How to plug the data gap in DevOps
ATAGTR2017 Security Testing / IoT Testing in Real World
DevOps the Big Picture for Testers by Joseph Ours
Continuous Testing in DevOps
Scaling Enterprise DevOps with CloudBees
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
Building DevOps Toolchain
DevOps and All the Continuouses w/ Helen Beal
Reliability (R)evolution: Turning the DevOps World Upside Down (Again).
The Art of Container Monitoring
Drive Continuous Delivery With Continuous Testing
A True Story of Why QA Loves DevOps
Secure your Azure and DevOps in a smart way
¿Qué es DevOps y por qué es importante en el Ciclo de Software? por michelada.io
Engineering Trust in Your Automated Tests
Where Testers & QA Fit in the Story of DevOps
Designing for the internet - Page Objects for the Real World
Why Serverless is scary without DevSecOps and Observability
Ad

Similar to SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engineering (20)

PPTX
SaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
PPTX
I'm No Hero: Full Stack Reliability at LinkedIn
PPTX
Site reliability engineering
PDF
ADDO_2020-Driving-Digital-Transformation-through-CloudOps-and-SRE.pdf
PDF
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf
PDF
Bronack Skills - Risk Management and SRE v1.0 12-10-2023.pdf
PPTX
The Next Wave of Reliability Engineering
PDF
S.R.E - create ultra-scalable and highly reliable systems
PPTX
Site (Service) Reliability Engineering
PPTX
What is Site Reliability Engineering (SRE)
PPTX
Couchbase Connect 2016
PPTX
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
PPTX
A Crash Course in Building Site Reliability
PPTX
Enabling City Resilience through Building Performance
PDF
Site-Reliability-Engineering-v2[6241].pdf
PPTX
Multi tier, multi-tenant, multi-problem kafka
PPTX
Linked in multi tier, multi-tenant, multi-problem kafka
PDF
What is DevOps? And Why Use DevOps? What?
PDF
Resisting to The Shocks
PDF
Girl Geek X Indeed Talks (January 18, 2018)
SaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
I'm No Hero: Full Stack Reliability at LinkedIn
Site reliability engineering
ADDO_2020-Driving-Digital-Transformation-through-CloudOps-and-SRE.pdf
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf
Bronack Skills - Risk Management and SRE v1.0 12-10-2023.pdf
The Next Wave of Reliability Engineering
S.R.E - create ultra-scalable and highly reliable systems
Site (Service) Reliability Engineering
What is Site Reliability Engineering (SRE)
Couchbase Connect 2016
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
A Crash Course in Building Site Reliability
Enabling City Resilience through Building Performance
Site-Reliability-Engineering-v2[6241].pdf
Multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafka
What is DevOps? And Why Use DevOps? What?
Resisting to The Shocks
Girl Geek X Indeed Talks (January 18, 2018)
Ad

More from Michael Kehoe (20)

PPTX
eBPF Workshop
PPTX
eBPF Basics
PPTX
Code Yellow: Helping operations top-heavy teams the smart way
PDF
QConSF 2018: Building Production-Ready Applications
PPTX
Helping operations top-heavy teams the smart way
PDF
AllDayDevops: What the NTSB teaches us about incident management & postmortems
PPTX
Linux Container Basics
PPTX
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
PDF
What the NTSB teaches us about incident management & postmortems
PPTX
PyBay 2018: Production-Ready Python Applications
PPTX
Helping operations top-heavy teams the smart way
PPTX
Building Production-Ready Microservices: DevopsExchangeSF
PPTX
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
PPTX
SRECon-Europe-2017: Networks for SREs
PPTX
Reducing MTTR and False Escalations: Event Correlation at LinkedIn
PPTX
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
PPTX
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
PPTX
Using SaltStack to Auto Triage and Remediate Production Systems
PDF
SRECon USA 2016: Growing your Entry Level Talent
PPTX
SouthBay SRE Meetup Jan 2016
eBPF Workshop
eBPF Basics
Code Yellow: Helping operations top-heavy teams the smart way
QConSF 2018: Building Production-Ready Applications
Helping operations top-heavy teams the smart way
AllDayDevops: What the NTSB teaches us about incident management & postmortems
Linux Container Basics
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
What the NTSB teaches us about incident management & postmortems
PyBay 2018: Production-Ready Python Applications
Helping operations top-heavy teams the smart way
Building Production-Ready Microservices: DevopsExchangeSF
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Networks for SREs
Reducing MTTR and False Escalations: Event Correlation at LinkedIn
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Using SaltStack to Auto Triage and Remediate Production Systems
SRECon USA 2016: Growing your Entry Level Talent
SouthBay SRE Meetup Jan 2016

Recently uploaded (20)

DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Digital Logic Computer Design lecture notes
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
DOCX
573137875-Attendance-Management-System-original
PPTX
web development for engineering and engineering
PPTX
Sustainable Sites - Green Building Construction
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Geodesy 1.pptx...............................................
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Operating System & Kernel Study Guide-1 - converted.pdf
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Structs to JSON How Go Powers REST APIs.pdf
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Digital Logic Computer Design lecture notes
Embodied AI: Ushering in the Next Era of Intelligent Systems
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
573137875-Attendance-Management-System-original
web development for engineering and engineering
Sustainable Sites - Green Building Construction
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Lesson 3_Tessellation.pptx finite Mathematics
OOP with Java - Java Introduction (Basics)
Foundation to blockchain - A guide to Blockchain Tech
Geodesy 1.pptx...............................................

SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engineering

Editor's Notes

  • #21: Anil TrafficShift is a two part application - A web application provides easy way for engineers to create planned and emergency offline plans. We leverage couchbase as our key/value persistence store Python backend worker processes talks to Salt Master via Salt API And instructs stickyrouting service to turn buckets online and offline We leverage this toolset to run load tests or stress tests of our datacenters Uff that’s a lot of talk, how to mitigate issues by doing trafficshift. But if you keenly observe, we are migrating live traffic across datacenter, why not leverage the same to stress test datacenter ? How awesome is that ? Not stress test single service, stress the whole system. I am gonna talk about load testing next.
  • #22: Anil As you can see by turning precise number of buckets offline in US-West and US-East - we can reroute that extra traffic to Target datacenter We do this in a pretty controlled manner in steps until the threshold level of 50% is reached. If for any reason, an alert fires during this stress test, our TrafficShift tool acknowledges that automatically rebalances the site traffic, sends out the stress test report to SREs