SlideShare a Scribd company logo
What Can Be Learned
About Application Resiliency
When Your Datacenter Burns Down?
Lessons from a real-world disaster
Peter Corless
+ Listen to customer stories
+ Write blogs & case studies
+ Play (and design) strategy &
roleplaying games
Technical Marketing Manager
ScyllaDB
10 March 2021, 0:47 AM — Disaster Strikes
SBG2 datacenter of OVHcloud
was entirely destroyed by fire
What would do if your datacenter
was on fire?
Massive, Systemic Disaster
...as if millions of alarms triggered at once and were suddenly silenced
OVHcloud Fire
OVHcloud’s Strasbourg SBG2 Datacenter engulfed
in flames.
(Image: SDIS du Bas Rhin)
About Kiwi.com
➔ Virtual global supercarrier
➔ Seamless travel experience
➔ Connecting “A” to “B”
➔ Virtual interlining
8
A Database Built for High Availability
Latencies briefly rise
until unavailable servers
are taken out of cluster
10 of 30 servers
are suddenly
unavailable
Requests per
second per server;
note how some
drop towards zero
then blip out of
existence
“This is a non-stop flight…”
Geographically Distributed Datacenters
>500km
>250km
>550km
Main database locations
OVHcloud’s Strasbourg SBG2 Datacenter
the next morning. (Image via Twitter)
➔ Strasbourg datacenter impact
■ SBG2 totally consumed
■ SBG1 4 of 12 rooms gutted
■ SBG3 & SBG4 proactively taken offline
➔ Internet impact (as per Netcraft)
■ 3.6 million websites
■ 464,000 domains
■ 1 in 50 sites in all of .fr TLD
Datacenter Damage Assessment
Kiwi.com’s Timeline of Fire
00:47 CET Fire breaks out in OVHcloud Strasbourg SBG2
01:12 CET Kiwi.com nodes in Strasbourg start falling off the cluster
01:15 CET All 10 Strasbourg nodes offline; traffic diverted to 2x other Kiwi.com datacenters (20
servers remaining)
02:23 CET Production operational, we manually need to tweak some services around the main
database.
08:54 CET Tweaks deployed, we are fully operational
Basically...
ScyllaDB
15
+ Reimagined the distributed NoSQL database
+ Close-to-the-hardware design, written in C++
+ Open source, enterprise & DBaaS
+ From the creators of KVM hypervisor
Winner Infoworld
Technology
of the Year
What we Learned About Application Resiliency When the Data Center Burned Down
17
Used across industries
AdTech/MarTech
Multimedia Finance/FinTech Security
Ride-hailing/
Food Delivery
Social Retail Travel IoT Logistics/Transportation
– Consistency Options –
Eventual consistency
to linearizability
– Presences –
1 to 10+ datacenter
replications
– Volume –
Multi-petabyte
– Throughput –
1 billion OPS
18
– Vertical Scalability –
1 to 416 vCPUs
– Horizontal Scalability –
1,000-node cluster
– Availability –
1 to 10+ replicas
within a datacenter
– Unlimited –
Cell sizes and
partition width
Grows with your business & your data
Learn more ▸ scylladb.com
+ Architect for disaster from the start
+ Eliminate any and all single points of failure (SPOFs = bad)
+ Datacenter-aware topology & distribution; multi-datacenter replication
+ Consider sufficient excess capacity in case a whole datacenter goes down
+ Consider contingency plans for multicloud architectures or alternate standby
sites [and standby hardware]
+ Do regular chaos monkey work
Lessons Learned
+ Consider running a “torpedo test” during Proof of Concepts & scalability testing
+ Take down a node, then another, then another…
+ How long does it take for a cluster to rebalance after losing a node?
+ How much of a latency/throughput hit do you take after losing a node?
+ After how many “torpedoes” does your system just “sink?” [Go non-linear]
+ How long does it take for a single node to be restored?
+ After taking n torpedoes, long does a system take to restore back to full
operations?
Consider “Torpedo Testing”
A New Hope [For DevOps]
United States
2445 Faber St, Suite #200
Palo Alto, CA USA 94303
Israel
Maskit 4
Herzliya, Israel 4673304
www.scylladb.com
@scylladb
Learn NoSQL for free!
university.scylladb.com
@petercorless

More Related Content

PDF
Critical Attributes for a High-Performance, Low-Latency Database
PDF
Building modern data lakes
PDF
The Do’s and Don’ts of Benchmarking Databases
PPTX
Introduction to Container Storage Interface (CSI)
PDF
Steering the Sea Monster - Integrating Scylla with Kubernetes
PDF
Webinar how to build a highly available time series solution with kairos-db (1)
PDF
NoSQL and NewSQL: Tradeoffs between Scalable Performance & Consistency
PDF
Jolt: Distributed, fault-tolerant test running at scale using Mesos
Critical Attributes for a High-Performance, Low-Latency Database
Building modern data lakes
The Do’s and Don’ts of Benchmarking Databases
Introduction to Container Storage Interface (CSI)
Steering the Sea Monster - Integrating Scylla with Kubernetes
Webinar how to build a highly available time series solution with kairos-db (1)
NoSQL and NewSQL: Tradeoffs between Scalable Performance & Consistency
Jolt: Distributed, fault-tolerant test running at scale using Mesos

What's hot (20)

PDF
Under the Hood of a Shard-per-Core Database Architecture
PDF
Scylla Virtual Workshop 2020
PDF
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
PDF
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDB
PDF
TechTalk: Reduce Your Storage Footprint with a Revolutionary New Compaction S...
PPTX
Overcoming Barriers of Scaling Your Database
PDF
Introducing Scylla Open Source 4.0
PDF
Spark Powered by Scylla
PDF
How to achieve no compromise performance and availability
PDF
12.07.2017 Docker Meetup - POSTGRE SQL ON KUBERNETES
PDF
Dynamic Object Routing
PDF
Running a DynamoDB-compatible Database on Managed Kubernetes Services
PPTX
Big Data on Cloud Native Platform
PDF
Measuring Database Performance on Bare Metal AWS Instances
PPTX
Adam Dagnall: Advanced S3 compatible storage integration in CloudStack
PDF
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
PDF
Database Jiu Jitsu: How ScyllaDB Open Sourced a DynamoDB-compatible API
PDF
How to Build a Scylla Database Cluster that Fits Your Needs
PDF
Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...
PPTX
Introducing Cloudian HyperStore 6.0
Under the Hood of a Shard-per-Core Database Architecture
Scylla Virtual Workshop 2020
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDB
TechTalk: Reduce Your Storage Footprint with a Revolutionary New Compaction S...
Overcoming Barriers of Scaling Your Database
Introducing Scylla Open Source 4.0
Spark Powered by Scylla
How to achieve no compromise performance and availability
12.07.2017 Docker Meetup - POSTGRE SQL ON KUBERNETES
Dynamic Object Routing
Running a DynamoDB-compatible Database on Managed Kubernetes Services
Big Data on Cloud Native Platform
Measuring Database Performance on Bare Metal AWS Instances
Adam Dagnall: Advanced S3 compatible storage integration in CloudStack
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
Database Jiu Jitsu: How ScyllaDB Open Sourced a DynamoDB-compatible API
How to Build a Scylla Database Cluster that Fits Your Needs
Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...
Introducing Cloudian HyperStore 6.0
Ad

Similar to What we Learned About Application Resiliency When the Data Center Burned Down (20)

PDF
Exploring Phantom Traffic Jams in Your Data Flows
PDF
Dataline Tysons Corner 100808 Barry Lynn
PDF
Techmeeting-17feb2016
PDF
Designing Low-Latency Systems with Rust: An Architectural Deep Dive
PDF
Bringing Private Cloud computing to HPC and Science - EGI TF tf 2013
PDF
EGITF 2013 - Bringing Private Cloud Computing to HPC and Science with OpenNebula
PDF
Citi Tech Talk: Hybrid Cloud
PDF
The Crown Jewels: Is Enterprise Data Ready for the Cloud?
PDF
Distributed Database Design Decisions to Support High Performance Event Strea...
PDF
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
PDF
2014 01-23-eranea-apalia-private-cloud
PDF
Using ScyllaDB for Extreme Scale Workloads
PDF
Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
PDF
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
PDF
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
PDF
Data Pipeline with Docker on AWS
PDF
New Ways to Reduce Database Costs with ScyllaDB
PDF
cncf overview and building edge computing using kubernetes
PDF
CisCon 2018 - SDN, complessità e TCO: non c’è un modo più semplice?
PPTX
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Exploring Phantom Traffic Jams in Your Data Flows
Dataline Tysons Corner 100808 Barry Lynn
Techmeeting-17feb2016
Designing Low-Latency Systems with Rust: An Architectural Deep Dive
Bringing Private Cloud computing to HPC and Science - EGI TF tf 2013
EGITF 2013 - Bringing Private Cloud Computing to HPC and Science with OpenNebula
Citi Tech Talk: Hybrid Cloud
The Crown Jewels: Is Enterprise Data Ready for the Cloud?
Distributed Database Design Decisions to Support High Performance Event Strea...
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
2014 01-23-eranea-apalia-private-cloud
Using ScyllaDB for Extreme Scale Workloads
Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Data Pipeline with Docker on AWS
New Ways to Reduce Database Costs with ScyllaDB
cncf overview and building edge computing using kubernetes
CisCon 2018 - SDN, complessità e TCO: non c’è un modo più semplice?
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ad

More from ScyllaDB (20)

PDF
Understanding The True Cost of DynamoDB Webinar
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
PDF
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
PDF
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
PDF
Leading a High-Stakes Database Migration
PDF
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
PDF
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
PDF
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
PDF
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
PDF
ScyllaDB: 10 Years and Beyond by Dor Laor
PDF
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
PDF
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
PDF
Vector Search with ScyllaDB by Szymon Wasik
PDF
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
PDF
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
PDF
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
PDF
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
PDF
Lessons Learned from Building a Serverless Notifications System by Srushith R...
PDF
A Dist Sys Programmer's Journey into AI by Piotr Sarna
PDF
High Availability: Lessons Learned by Paul Preuveneers
Understanding The True Cost of DynamoDB Webinar
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
Leading a High-Stakes Database Migration
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB: 10 Years and Beyond by Dor Laor
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
Vector Search with ScyllaDB by Szymon Wasik
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
Lessons Learned from Building a Serverless Notifications System by Srushith R...
A Dist Sys Programmer's Journey into AI by Piotr Sarna
High Availability: Lessons Learned by Paul Preuveneers

Recently uploaded (20)

PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
KodekX | Application Modernization Development
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Electronic commerce courselecture one. Pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Empathic Computing: Creating Shared Understanding
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
MYSQL Presentation for SQL database connectivity
Digital-Transformation-Roadmap-for-Companies.pptx
Machine learning based COVID-19 study performance prediction
KodekX | Application Modernization Development
Diabetes mellitus diagnosis method based random forest with bat algorithm
Understanding_Digital_Forensics_Presentation.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Chapter 3 Spatial Domain Image Processing.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
20250228 LYD VKU AI Blended-Learning.pptx
sap open course for s4hana steps from ECC to s4
Electronic commerce courselecture one. Pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Advanced methodologies resolving dimensionality complications for autism neur...
Programs and apps: productivity, graphics, security and other tools
Empathic Computing: Creating Shared Understanding
NewMind AI Weekly Chronicles - August'25 Week I
The Rise and Fall of 3GPP – Time for a Sabbatical?
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy

What we Learned About Application Resiliency When the Data Center Burned Down

  • 1. What Can Be Learned About Application Resiliency When Your Datacenter Burns Down? Lessons from a real-world disaster
  • 2. Peter Corless + Listen to customer stories + Write blogs & case studies + Play (and design) strategy & roleplaying games Technical Marketing Manager ScyllaDB
  • 3. 10 March 2021, 0:47 AM — Disaster Strikes SBG2 datacenter of OVHcloud was entirely destroyed by fire
  • 4. What would do if your datacenter was on fire?
  • 5. Massive, Systemic Disaster ...as if millions of alarms triggered at once and were suddenly silenced
  • 6. OVHcloud Fire OVHcloud’s Strasbourg SBG2 Datacenter engulfed in flames. (Image: SDIS du Bas Rhin)
  • 7. About Kiwi.com ➔ Virtual global supercarrier ➔ Seamless travel experience ➔ Connecting “A” to “B” ➔ Virtual interlining
  • 8. 8 A Database Built for High Availability Latencies briefly rise until unavailable servers are taken out of cluster 10 of 30 servers are suddenly unavailable Requests per second per server; note how some drop towards zero then blip out of existence
  • 9. “This is a non-stop flight…”
  • 12. OVHcloud’s Strasbourg SBG2 Datacenter the next morning. (Image via Twitter) ➔ Strasbourg datacenter impact ■ SBG2 totally consumed ■ SBG1 4 of 12 rooms gutted ■ SBG3 & SBG4 proactively taken offline ➔ Internet impact (as per Netcraft) ■ 3.6 million websites ■ 464,000 domains ■ 1 in 50 sites in all of .fr TLD Datacenter Damage Assessment
  • 13. Kiwi.com’s Timeline of Fire 00:47 CET Fire breaks out in OVHcloud Strasbourg SBG2 01:12 CET Kiwi.com nodes in Strasbourg start falling off the cluster 01:15 CET All 10 Strasbourg nodes offline; traffic diverted to 2x other Kiwi.com datacenters (20 servers remaining) 02:23 CET Production operational, we manually need to tweak some services around the main database. 08:54 CET Tweaks deployed, we are fully operational
  • 15. ScyllaDB 15 + Reimagined the distributed NoSQL database + Close-to-the-hardware design, written in C++ + Open source, enterprise & DBaaS + From the creators of KVM hypervisor Winner Infoworld Technology of the Year
  • 17. 17 Used across industries AdTech/MarTech Multimedia Finance/FinTech Security Ride-hailing/ Food Delivery Social Retail Travel IoT Logistics/Transportation
  • 18. – Consistency Options – Eventual consistency to linearizability – Presences – 1 to 10+ datacenter replications – Volume – Multi-petabyte – Throughput – 1 billion OPS 18 – Vertical Scalability – 1 to 416 vCPUs – Horizontal Scalability – 1,000-node cluster – Availability – 1 to 10+ replicas within a datacenter – Unlimited – Cell sizes and partition width Grows with your business & your data Learn more ▸ scylladb.com
  • 19. + Architect for disaster from the start + Eliminate any and all single points of failure (SPOFs = bad) + Datacenter-aware topology & distribution; multi-datacenter replication + Consider sufficient excess capacity in case a whole datacenter goes down + Consider contingency plans for multicloud architectures or alternate standby sites [and standby hardware] + Do regular chaos monkey work Lessons Learned
  • 20. + Consider running a “torpedo test” during Proof of Concepts & scalability testing + Take down a node, then another, then another… + How long does it take for a cluster to rebalance after losing a node? + How much of a latency/throughput hit do you take after losing a node? + After how many “torpedoes” does your system just “sink?” [Go non-linear] + How long does it take for a single node to be restored? + After taking n torpedoes, long does a system take to restore back to full operations? Consider “Torpedo Testing”
  • 21. A New Hope [For DevOps]
  • 22. United States 2445 Faber St, Suite #200 Palo Alto, CA USA 94303 Israel Maskit 4 Herzliya, Israel 4673304 www.scylladb.com @scylladb Learn NoSQL for free! university.scylladb.com @petercorless