Surviving Region Failures with CockroachDB
Greg Turnquist here from Cockroach Labs. I recently sat down with my colleague Felipe Gutierrez on our SELECT STAR podcast to talk about something that keeps every database administrator up at night: surviving regional failures.
If you've dealt with outages or just want to understand how CockroachDB handles high availability and data consistency when entire zones or regions go down, this conversation covered the essentials. As always, I had a great time getting super geeky on database design with Felipe! Here's what we discussed.
Building Blocks: Ranges, Replicas, and Leaseholders
CockroachDB's architecture starts with ranges — contiguous chunks of your table data. These ranges get replicated, typically three times, for redundancy. Each set of replicas has a leaseholder that coordinates reads and writes for that range.
The key insight here is that the UPDATE is committed as soon as the leaseholder gets responses from a quorum — in this case two out of three replicas. Your application doesn't wait for the slowest node.
But there's a critical placement rule: Don't put two replicas in the same rack. Try to imagine what would happen if the power failed on that rack! With two replicas out of the picture, that range would no longer be able to complete UPDATEs.
That’s why you need to spread replicas across zones or even regions.
Multi-Region Setup: More Than Just Redundancy
We walked through a practical example using a fictional ride-sharing database called "Mover Rides," distributed across U.S. East, Central, and West regions. Each region had three availability zones, each running a node.
Turning this into a multi-regional database involves first designating the primary region by typing
ALTER DATABASE ... SET PRIMARY REGION
Then you add additional regions by typing
ALTER DATABASE ... ADD REGION
When you do that, all voting replicas move to the primary region — which is good for performance, but a vulnerability if that region fails.
Zone-level Survivability: Understanding the Impact
Zone-level survivability is your baseline defense. It spreads voting replicas across all three availability zones in your primary region. You can survive one zone outage, but you can’t survive two or more.
Sound good?
Zone-level survivability is great for zone-level outages. But what happens if an entire region goes down?
That’s the bigger challenge.
Region-Level Survivability: The Five-Replica Solution
For true region survivability, you run
ALTER DATABASE ... SURVIVE REGION FAILURE
CockroachDB automatically increases your replica count to five because that's what's needed to survive the failure of an entire region (two replicas) and maintain quorum.
In an example with 3 regions, us-east (primary), us-central, and us-west, the distribution looks like this:
If any region goes down, there are still enough replicas in the other regions to make quorum and process UPDATEs. If that region happens to take out a leaseholder, CockroachDB elects a new leaseholder from the remaining replicas and continues operating, as shown below.
Basically, CockroachDB sees that you want a region level configuration... and it will handle that!
Now one thing we didn’t delve into was the performative consequence of all the leaseholders moving to the primary region for zone-level survivability. And what happens if you upgrade to region-level survivability?
That’s a very real concern. What happens if you’re in the western region attempting to coordinate an UPDATE with a leaseholder in the eastern region? CockroachDB has a solution…which we’ll cover in a future article!
Training and Resources
If this seems complex, Cockroach University offers the Multi-Region Resilience hands-on course.
We are going to start having public training very soon. Starting mid-June, we're launching daily live sessions covering everything from basics to advanced topics like point-in-time recovery, snapshots, and incremental backups. Stay tuned!
The Bottom Line
CockroachDB isn't just built to be resilient to single nodes going down on occasion — it's designed to expect failure and keep running when things go wrong. Whether you're building a ride-sharing app or a financial platform, you have options for surviving the inevitable outages.
The database you just can't kill? That's not marketing speak. That's reality.
Check out Cockroach University and our upcoming training sessions if you want to dig deeper into distributed SQL and high availability patterns.
— Greg Turnquist , Sr. Staff Technical Content Engineer, Cockroach Labs
Your database should be a cockroach. Public Sector Sales
1moIf it's not available, it can't be consistent by default because going down means data loss. If you want consistent data, it needs to be resilient and #alwayson
Thanks Greg Turnquist -- excellent explanation of Cockroach Labs' approach to multi-region resilience. #data #database