Surviving Region Failures with CockroachDB

Cockroach Labs

Cockroach Labs is the creator of CockroachDB, the most highly evolved cloud-native, distributed SQL database.

Published Jun 17, 2025

Greg Turnquist here from Cockroach Labs. I recently sat down with my colleague Felipe Gutierrez on our SELECT STAR podcast to talk about something that keeps every database administrator up at night: surviving regional failures.

If you've dealt with outages or just want to understand how CockroachDB handles high availability and data consistency when entire zones or regions go down, this conversation covered the essentials. As always, I had a great time getting super geeky on database design with Felipe! Here's what we discussed.

Building Blocks: Ranges, Replicas, and Leaseholders

CockroachDB's architecture starts with ranges — contiguous chunks of your table data. These ranges get replicated, typically three times, for redundancy. Each set of replicas has a leaseholder that coordinates reads and writes for that range.

The key insight here is that the UPDATE is committed as soon as the leaseholder gets responses from a quorum — in this case two out of three replicas. Your application doesn't wait for the slowest node.

But there's a critical placement rule: Don't put two replicas in the same rack. Try to imagine what would happen if the power failed on that rack! With two replicas out of the picture, that range would no longer be able to complete UPDATEs.

That’s why you need to spread replicas across zones or even regions.

Multi-Region Setup: More Than Just Redundancy

We walked through a practical example using a fictional ride-sharing database called "Mover Rides," distributed across U.S. East, Central, and West regions. Each region had three availability zones, each running a node.

Turning this into a multi-regional database involves first designating the primary region by typing

ALTER DATABASE ... SET PRIMARY REGION

Then you add additional regions by typing

ALTER DATABASE ... ADD REGION

When you do that, all voting replicas move to the primary region — which is good for performance, but a vulnerability if that region fails.

Zone-level Survivability: Understanding the Impact

Zone-level survivability is your baseline defense. It spreads voting replicas across all three availability zones in your primary region. You can survive one zone outage, but you can’t survive two or more.

Sound good?

Zone-level survivability is great for zone-level outages. But what happens if an entire region goes down?

That’s the bigger challenge.

Region-Level Survivability: The Five-Replica Solution

For true region survivability, you run

ALTER DATABASE ... SURVIVE REGION FAILURE

CockroachDB automatically increases your replica count to five because that's what's needed to survive the failure of an entire region (two replicas) and maintain quorum.

In an example with 3 regions, us-east (primary), us-central, and us-west, the distribution looks like this:

2 replicas in the primary region (each in a separate AZ)
2 in the central region
1 in the western region

If any region goes down, there are still enough replicas in the other regions to make quorum and process UPDATEs. If that region happens to take out a leaseholder, CockroachDB elects a new leaseholder from the remaining replicas and continues operating, as shown below.

Basically, CockroachDB sees that you want a region level configuration... and it will handle that!

Now one thing we didn’t delve into was the performative consequence of all the leaseholders moving to the primary region for zone-level survivability. And what happens if you upgrade to region-level survivability?

That’s a very real concern. What happens if you’re in the western region attempting to coordinate an UPDATE with a leaseholder in the eastern region? CockroachDB has a solution…which we’ll cover in a future article!

Training and Resources

If this seems complex, Cockroach University offers the Multi-Region Resilience hands-on course.

We are going to start having public training very soon. Starting mid-June, we're launching daily live sessions covering everything from basics to advanced topics like point-in-time recovery, snapshots, and incremental backups. Stay tuned!

The Bottom Line

CockroachDB isn't just built to be resilient to single nodes going down on occasion — it's designed to expect failure and keep running when things go wrong. Whether you're building a ride-sharing app or a financial platform, you have options for surviving the inevitable outages.

The database you just can't kill? That's not marketing speak. That's reality.

Check out Cockroach University and our upcoming training sessions if you want to dig deeper into distributed SQL and high availability patterns.

— Greg Turnquist , Sr. Staff Technical Content Engineer, Cockroach Labs

Timothy Cerjan

Your database should be a cockroach. Public Sector Sales

1mo

If it's not available, it can't be consistent by default because going down means data loss. If you want consistent data, it needs to be resilient and #alwayson

David Weiss

Thanks Greg Turnquist -- excellent explanation of Cockroach Labs' approach to multi-region resilience. #data #database

2 Reactions

See more comments

Surviving Region Failures with CockroachDB

Cockroach Labs

Cockroach Labs is the creator of CockroachDB, the most highly evolved cloud-native, distributed SQL database.

Building Blocks: Ranges, Replicas, and Leaseholders

Multi-Region Setup: More Than Just Redundancy

Zone-level Survivability: Understanding the Impact

Region-Level Survivability: The Five-Replica Solution

Training and Resources

The Bottom Line

More articles by this author

Explore topics

Building Blocks: Ranges, Replicas, and Leaseholders

Multi-Region Setup: More Than Just Redundancy

Zone-level Survivability: Understanding the Impact

Region-Level Survivability: The Five-Replica Solution

Training and Resources

The Bottom Line

July 2025 | Here’s why Distributed SQL makes sense in AI architectures

Jul 2, 2025

June 2025 | Meet CockroachDB 25.2: Faster, smarter, and ready for AI

Jun 5, 2025

May 2025 | What If Your Database Never Went Down?

May 1, 2025

April 2025 | No joke: Database migration just got (shockingly) easy…

Apr 1, 2025

March 2025 | Why Distributed SQL Is Defining The Next Era of Cloud Apps

Mar 13, 2025

February 2025 | What’s Slowing Down Your Database? (And How to Fix It)

Feb 6, 2025

January 2025 | DORA Is Coming … Are Your Systems Ready?

Jan 7, 2025

December 2024 | The Battle of Distributed Relational Databases

Dec 10, 2024

November 2024 | New Report: Only 1 in 5 orgs are outage-ready.

Nov 4, 2024

October 2024 | RoachFest ‘24 Talks Now Streaming!

Oct 7, 2024

Explore topics