The Odd Wisdom Behind 100% Availability in Cloud-Native Architecture
<Created by Gemini>

The Odd Wisdom Behind 100% Availability in Cloud-Native Architecture


In the world of resilient cloud-native systems, odd numbers hold a surprisingly critical place. As a seasoned cloud architect, having designed and deployed highly available platforms across Azure and AWS, I’ve often revisited this fundamental principle: odd numbers aren't just mathematical quirks, they are operational enablers.

Take Zookeeper—a consistent cornerstone for distributed coordination. Zookeeper enforces a quorum-based mechanism where more than half the nodes must agree to proceed. Here, an odd number of nodes is a deliberate design—3, 5, or 7—ensuring that the system can tolerate failures while still achieving consensus. For example, a 3-node ensemble can survive 1 failure, while a 5-node setup can withstand 2. An even number, say 4, offers no real advantage over 3 in terms of fault tolerance, while increasing cost and complexity.

Cloud platforms have embraced this wisdom. AWS Availability Zones (AZs) and Azure Availability Sets recommend deploying across an odd number of fault and update domains. A classic design uses 3 AZs with auto-scaling groups and health probes ensuring workloads are balanced, self-healing, and always on—even if one zone fails.

In platform design, this principle of odd-numbered redundancy maps directly to a minimal compute footprint with maximum availability. Auto-scaling policies can operate effectively on a 3-node Kubernetes control plane, achieving both cost-efficiency and high resilience. Similarly, consensus-based services like etcd (in Kubernetes) or Raft-based DBs thrive on odd-numbered quorum nodes.

Odd numbers offer a balance: enough to handle failure, minimal to avoid waste. As architects, our job isn’t to overprovision—it’s to design just enough to stay online, self-repairing, and always available. Oddly enough, that number is usually odd.

The Strategic Importance of Odd-Numbered Node Configurations in Cloud-Native Architectures

In cloud-native architecture, ensuring high availability (HA) and resilience is paramount. A fundamental yet often overlooked principle in achieving this is the strategic use of odd-numbered node configurations. This approach is pivotal in maintaining quorum-based consensus, preventing split-brain scenarios, and optimising resource utilisation.


Understanding Quorum in Distributed Systems

Distributed systems rely on quorum mechanisms to maintain consistency and coordination among nodes. A quorum requires a majority of nodes to agree on decisions, calculated as (n/2) + 1, where n is the total number of nodes. Utilising an odd number of nodes ensures that a clear majority can always be achieved, even in the event of node failures.


AWS: Implementing Odd-Numbered Configurations

Amazon OpenSearch Service recommends deploying three dedicated master nodes to maintain quorum and elect a new master in case of a failure. Adding a fourth node doesn't enhance fault tolerance and may introduce unnecessary complexity.

Similarly, in Amazon RDS for MySQL with Group Replication, a seven-node cluster requires at least four nodes to achieve quorum. If a network partition divides the cluster into groups of four and three, the group with three nodes cannot achieve quorum, emphasising the importance of maintaining an odd number of nodes.


Azure: Leveraging Odd-Numbered Nodes for High Availability

Azure Cache for Redis in the Enterprise tier mandates an odd number of server nodes to form a quorum, typically deploying three nodes by default. This configuration ensures that the cache remains available even if one node fails.

In SQL Server on Azure VMs, it's recommended to have an odd number of votes, with a minimum of three quorum votes. In two-node clusters, adding a quorum witness provides an additional vote, maintaining an odd number and ensuring cluster availability.


Reference Link :

  1. https://learn.microsoft.com/en-us/windows-server/storage/storage-spaces/quorum
  2. https://stackoverflow.com/questions/58823341/why-is-it-recommended-to-create-clusters-with-odd-number-of-nodes

To view or add a comment, sign in

Others also viewed

Explore topics