What is Disaster in terms of cloud computing?

Amit Kumar

AWS Solution Architect | 3X AWS | Terraform | Kubernetes | EKS | DXC Diamond Award Winner

Published Jan 14, 2025

A disaster refers to any unexpected event or situation that disrupts the normal operations of cloud-based systems, applications, or services. These disruptions can significantly impact data availability, system performance, and overall business continuity.

Types of Disasters:

Natural Disasters: Earthquakes, hurricanes, floods, or wildfires affecting physical data centers.
Hardware Failures: Issues like server crashes, disk failures, or network hardware malfunctions.
Cyberattacks: DDoS attacks, ransomware, or data breaches compromising system integrity and security.
Human Errors: Accidental deletion of critical data, incorrect configurations, or misuse of cloud services.
Power Outages: Loss of electricity supply to data centers or network infrastructure.
Network Failures: Disruptions in internet connectivity or issues with network providers.

What is Disaster Recovery?

Cloud providers and organizations adopt Disaster Recovery (DR) strategies to minimize the impact of such events. Key components include:

Backup Solutions: Regular snapshots and offsite backups to ensure data availability.
Redundancy: Using multiple availability zones, regions, or hybrid setups for failover.
Replication: Real-time replication of data and applications to secondary systems.
Automated Failover: Switching traffic to a healthy system or region when the primary system fails.
Testing and Drills: Routine simulation of disaster scenarios to evaluate DR readiness.
Disaster Recovery Plans: Comprehensive documentation of recovery procedures and RTO/RPO goals.

What are disaster recovery strategies in AWS?

Backup and Restore:

How It Works:

Data and application backups are regularly created and stored in a secure, durable storage service like Amazon S3, EFS, AWS Backup. During a disaster, backups are restored to the infrastructure in the desired region.

Challenges:

Recovery time might be high as restoring infrastructure and data takes time.

2. Pilot Light:

How It Works:

A minimal version of the application is always running in a secondary region (critical services only).
During a disaster, the infrastructure is scaled up to full production in the secondary region.

Challenges:

Configuration drift between primary and secondary environments can lead to inconsistencies.
Requires automation to scale quickly during failover.

3. Warm Standby:

How It Works:

A scaled-down version of the entire production environment runs in a secondary region.
In case of a disaster, the secondary environment is scaled up to handle full production traffic.

Challenges:

Costs are higher compared to Pilot Light because of the partially running environment.
Scaling up during a disaster requires time and resources.

4. Multi-Site Active-Active:

How It Works:

Applications run simultaneously in two or more regions, sharing the load.
Traffic is distributed across regions using Route 53 and Global Accelerator.

Challenges:

Complex configuration and synchronization between regions.
High operational costs due to fully active infrastructure in multiple regions.

What decides which strategy you should follow for your application?

Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are critical metrics in disaster recovery planning. They define how quickly you need to recover and how much data loss is acceptable during a disaster.

Recovery Time Objective (RTO): The maximum amount of time an application or system can be down after a failure.

Recovery Point Objective (RPO): The maximum amount of data loss that is acceptable, measured in time.

Trade-offs:

Lower RTO & RPO: Require more complex and expensive setups (e.g., Active-Active). Suitable for mission-critical applications where downtime or data loss is unacceptable.

Higher RTO & RPO: Cost-efficient, simpler to manage (e.g., Backup & Restore). Best for non-critical workloads where occasional downtime and data loss are tolerable.

Examples of DR architecture in 3 tier architecture of each DR strategy:

Backup & Restore Strategy:

Architecture:

User -> Route53 -> ALB -> EC2 with auto scaling in private subnet -> ALB (internal) -> EC2 with auto scaling (Application layer) in private subnet -> Database RDS MySQL in private subnet

Backup Process:

RDS MySQL data is periodically backed up using automated snapshots and stored in Amazon S3.
EC2 instance’s configurations, application code, and dependencies are backed up as AMIs or stored in version-controlled repositories like Git.

Restore Process:

In a disaster scenario: Restore RDS MySQL from S3 backups. Deploy EC2 instances using saved AMIs or launch configurations. Recreate ALBs and Route 53 settings to point to the restored environment.

2. Pilot Light Strategy

Architecture:

User -> Route53 -> ALB -> EC2 (minimal instances) -> ALB (internal) -> Minimal EC2 -> Replicated RDS Read Replica (DR Region)

Key Details:

Maintain a minimal DR environment in the DR region: Minimal EC2 instances are pre-provisioned but not scaled. RDS is configured with a read replica in the DR region.

Failover Process:

Promote the RDS read replica to the primary database.
Scale up EC2 Auto Scaling groups for both layers.
Update Route 53 records to direct traffic to the DR region's ALB.

3. Warm Standby Strategy

Architecture:

User -> Route53 -> ALB -> EC2 (scaled-down Auto Scaling) -> ALB (internal) -> EC2 (scaled-down Auto Scaling) -> Replicated RDS Read Replica (DR Region)

Key Details:

A scaled-down version of the production environment is always running in the DR region: Both ALBs and EC2 instances are deployed with lower capacity. RDS uses a read replica for database synchronization.

Failover Process:

Promote the RDS read replica to the primary database.
Scale up EC2 Auto Scaling groups to match production capacity.
Update Route 53 records to point traffic to the DR region's ALB.

4. Active-Active Strategy

Architecture:

User -> Route53 (Latency-Based Routing) -> ALB (in both regions) -> EC2 with auto scaling (in both regions) -> ALB (internal in both regions) -> EC2 (Application layer in both regions) -> RDS with Multi-Region Aurora or Active-Active Replication

Key Details:

Fully operational setups in both regions: Traffic is distributed using Route 53 latency-based routing. RDS uses Aurora Global Database or Active-Active replication for real-time synchronization between regions.

Failover Process:

If one region fails, Route 53 automatically redirects traffic to the healthy region.

Conclusion:

Implementing a robust Disaster Recovery (DR) strategy is no longer optional in today’s cloud-centric world—it's a necessity. By leveraging the power of cloud-native tools and services, such as multi-region deployments, real-time data replication, and automated failover mechanisms, businesses can build resilient systems that quickly recover from disasters with minimal impact.

However, DR is not a one-time setup but an ongoing process of testing, refining, and adapting strategies to evolving risks. A well-implemented DR plan not only protects your operations but also fosters customer trust and confidence, giving your organization a competitive edge. As the saying goes, “Failing to plan is planning to fail,” and with cloud-based DR solutions, planning for the unexpected has never been more achievable.

*All image credit goes to AWS Blog

Janarthanan Ravikumar

Cloud Architect @NUS | AWS & Microsoft Azure | Building Scalable, Secure and Resilient Cloud Solutions | Global IT Leadership| Lifelong learner

6mo

Clearly Explained all the strategies 🙌

1 Reaction

Jayas Balakrishnan

Hands-On Technical Leader @Federal Reserve Bank NY | 8x AWS, KCNA, KCSA & 3x GCP Certified | Multi-Cloud Architect | US Citizen

7mo

Amit Kumar Good explanation and details - thanks!

1 Reaction

Vijay Vishnu

Distributed Systems + Real-time Data Engineering + AI | Building 29 AI-Enhanced Production Systems | Mission: 1B+ TPS

7mo

Insightful

1 Reaction

See more comments

To view or add a comment, sign in

See all

What is Disaster in terms of cloud computing?

Amit Kumar

AWS Solution Architect | 3X AWS | Terraform | Kubernetes | EKS | DXC Diamond Award Winner

Types of Disasters:

What is Disaster Recovery?

What are disaster recovery strategies in AWS?

What decides which strategy you should follow for your application?

Trade-offs:

Examples of DR architecture in 3 tier architecture of each DR strategy:

Backup & Restore Strategy:

2. Pilot Light Strategy

3. Warm Standby Strategy

4. Active-Active Strategy

More articles by this author

Others also viewed

Failover Routing for Disaster Recovery - Ensuring Your Customers Get to The Good Place

Automated Failover vs. Traditional DR: A Technical Comparison Using Carbonite Availability

All about 'Disaster Recovery' in the Cloud

CLOUD DISASTER RECOVERY PLANNING : ASSESSING RISKS IN THE CLOUD

Amazon Web Services: Disaster Recovery

Disaster Recovery as a Service (DRaaS): The Role of Infrastructure Management Providers

🚨 Disaster Recovery and Backup in Networking: A Guide to Business Resilience 🌐

AWS Elastic Disaster Recovery (AWS DRS): Ensuring Business Continuity and Resilience

DRP: Why Backing Up Cloud Applications is Essential

The need for a modern backup solution

Explore topics

Types of Disasters:

What is Disaster Recovery?

What are disaster recovery strategies in AWS?

What decides which strategy you should follow for your application?

Trade-offs:

Examples of DR architecture in 3 tier architecture of each DR strategy:

Backup & Restore Strategy:

2. Pilot Light Strategy

3. Warm Standby Strategy

4. Active-Active Strategy

🚀 Deploying Microservices in AWS EKS using Ingress resource

Feb 25, 2025

Microservices deployment in EKS

Feb 18, 2025

Monolith to microservice

Jan 4, 2025

AWS Well-Architected Framework

Dec 24, 2024

ALL ABOUT AWS LAMBDA SERVICE

Dec 14, 2024

Taco Bell's Event-Driven Architecture with AWS Serverless Services

Nov 26, 2024

AWS Elastic Container Service (ECS)

Nov 22, 2024

Building a Secure and Scalable API Architecture for Enterprise Applications

Nov 2, 2024

Common Amazon API Gateway Patterns for microservice architecture:

Oct 25, 2024

EC2 Cost Optimization

Oct 21, 2024

Others also viewed

Failover Routing for Disaster Recovery - Ensuring Your Customers Get to The Good Place

Automated Failover vs. Traditional DR: A Technical Comparison Using Carbonite Availability

All about 'Disaster Recovery' in the Cloud

CLOUD DISASTER RECOVERY PLANNING : ASSESSING RISKS IN THE CLOUD

Amazon Web Services: Disaster Recovery

Disaster Recovery as a Service (DRaaS): The Role of Infrastructure Management Providers

🚨 Disaster Recovery and Backup in Networking: A Guide to Business Resilience 🌐

AWS Elastic Disaster Recovery (AWS DRS): Ensuring Business Continuity and Resilience

DRP: Why Backing Up Cloud Applications is Essential

The need for a modern backup solution

Explore topics