AWS DRS, What is it, and how it works?
Nowadays every enterprise or small business wants to be safe from any disaster. This could be provided in some ways, like Backup, Snapshot, or replication. However, one of the most impressive ways to protect data from loss and disaster is to deploy disaster recovery.
Like other SAAS or PAAS platforms, we can deploy and implement DR on the cloud. In AWS, we can provide this service with AWS DRS, or Disaster Recovery Service. Now, let's examine what it is and how we can deploy it in the AWS environment.
AWS Elastic Disaster Recovery minimizes downtime and data loss with the fast and reliable recovery of on-premise or cloud-based applications using affordable storage, minimal computing, and point-in-time recovery.
Using AWS DRS to replicate on-premise or cloud-based applications running on supported Operating Systems can increase IT resilience. Configuration is easy, and we can deploy DR in a few steps.
Before deploying DRS step-by-step I want to talk briefly about DRS strategies and some benefits of them. Based on AWS documentation, 4 strategies can be deployed in cloud environments.
As I mentioned these strategies have some advantages and some disadvantages. We can see all of these below:
Actually, it's simple, if you want lower RPO/RTO you have to pay more 😊. Let's deep dive more, what are RPO and RTO?
RPO stands for Recovery Time Objective refers to Data loss and RTO stands for Recovery Point Objective refers to Downtime.
As I mentioned, AWS DRS has 4 main strategies, lets deep dive on these strategies:
Backup and Restore:
Backup and restore is a suitable approach for mitigating data loss or corruption. This approach can also be used to mitigate against a regional disaster by replicating data to other AWS Regions or to mitigate the lack of redundancy for workloads deployed to a single Availability Zone.
Pilot Light:
With the pilot light approach, you replicate your data from one Region to another and provision a copy of your core workload infrastructure. Resources required to support data replication and backup, such as databases and object storage, are always on. Other elements, such as application servers, are loaded with application code and configurations, but are "switched off" and are only used during testing or when disaster recovery failover is invoked.
Warm Standby:
The warm standby approach involves ensuring that there is a scaled-down, but fully functional, copy of your production environment in another Region. This approach extends the pilot light concept and decreases the time to recovery because your workload is always on in another Region. This approach also allows you to more easily perform testing or implement continuous testing to increase confidence in your ability to recover from a disaster.
Multi-site Active/Active
You can run your workload simultaneously in multiple Regions as part of a multi-site active/active or hot standby active/passive strategy. Multi-site active/active serves traffic from all regions to which it is deployed, whereas hot standby serves traffic only from a single region, and the other Region(s) are only used for disaster recovery. With a multi-site active/active approach, users can access your workload in any of the Regions in which it is deployed. This approach is the most complex and costly approach to disaster recovery, but it can reduce your recovery time to near zero for most disasters with the correct technology choices and implementation (however data corruption may need to rely on backups, which usually results in a non-zero recovery point).
Ok, it's time to deploy DRS on AWS. As I mentioned before normally DRS is used to backup your workload in the Cloud or if your workload is in the Cloud, it is used to backup in another region and keeps safe your workloads in disaster time. For deploying Elastic disaster recovery service you can type “drs” in the search box. After that, you can start to configure by clicking on configure and initialize.
From a networking perspective, we have 2 VPC. Source VPC and Destination VPC with 2 Private Subnets and 2 Public Subnets.
As you can see initializing AWS DRS consists of 6 steps. In 1st step we have to define the Staging area subnet and Replication Instance type, so what are the staging area and replication instance types? The staging area subnet is the subnet within which replication servers and conversion servers are launched.
By default, AWS Elastic Disaster Recovery will use the default subnet on your AWS account (this is the Subnet that is created for the VPC when you first create your account). The best practice is to create and choose a dedicated standalone subnet as the staging area subnet. We will proceed with one of the private subnets of source VPC.
The replication server instance type is the default EC2 instance type to use for replication servers. The recommended best practice is to not change the replication server instance type unless there is a business need for doing so. We continue to deploy with the t3 instance type.
The 2nd step is about specifying volumes and security groups. For each disk on an added source server, there is an identically-sized EBS volume attached to a replication server, and each replication server can handle the replication of disks from multiple source servers. You can select the EBS volume type based on your workload. For example,e you can select gp3 SSD disk type for SQL or CRM Servers for fast responses.
For more security also preferred to use default Encryption like KMS encryption for EBS.
About the Security group, for ongoing data replication to work, inbound TCP Port 1500 needs to be allowed (for receiving the data sent from your source servers to the replication servers), as well as 443 so that the replication servers can communicate with the AWS DRS service. When the Always use the default AWS Elastic Disaster Recovery security group option is checked, AWS Elastic Disaster Recovery constantly monitors to ensure that this group exists and is correctly configured on the Replication Server.
In 3rd step, we can deploy and configure Data routing and throttling for better data transfer and replication. If you are using VPN or Direct Connect for transferring data to the Cloud side, its better to use private IP addresses for replication. Also, you can define a limit or bandwidth for your Network (in Mbps). I don’t need any limitation onthe network side so continue with default settings.
Point in time (PIT) policy is a disaster recovery feature that allows launching an instance from a snapshot captured at a specific point in time. As source servers are replicated, snapshots are taken over time. This section allows to configuration of a retention policy that will determine which snapshots are not required after a defined duration. So you can change it based on your retention requirements.
In 4th step, we can deploy the EC2 instance launch type. Actually, we enable instance right-sizing based on the situation and instance configuration.
It can be Active (basic): AWS DRS will select the instance type
It can be Active (in-aws): AWS DRS will periodically update the EC2 launch template based on the hardware configuration of the EC2 instance source server.
And finally, it can be Inactive: if you want to select your instance type based on your plan or your Architecture you have to select this option to avoid changing instance type during DRS. We will continue with this option.
The important note about licensing is that the license method only will be used on Windows Operating Systems. If you want to use the BYOL option, you have to dedicated host for the placement of your workloads. Or if you want to use an AWS-provisioned license, AWS will provide an AWS-provisioned license for the launched instance.
For 5th step, you have to provide all of the necessary options for destination VPC.
Finally, review all of the options you selected and click on configure and initiate.
Ok, I hope to enjoy reading this part. We continue to deploy agents on source servers. But first, let's review RDS architecture again.
We will run the AWS DRS agent on supported Operating Systems.
All of the networking and security prerequisites will be passed without any problem if we deploy all of the necessary ports like TCP 443 and 1500. After that data on the server and disks will start to replicate to the AWS environment through DX or S2S VPN.
All of the data will remain in the Staging area until you need to recover.
My simulated scenario is like 2 VPC in the same Account. Every VPC has 2 private and 2 public subnets which connect to the internet through Nat GW.
I have installed Source Server with Amazon Linux 23 AMI and connected it through the session manager to run the commands.
For following all of the instructions about installing a DRS agent you can follow this link. In my case, I will follow Linux OS Instructions.
For installing an agent on Linux you need a user with the “AWSElasticDisasterRecoveryAgentInstallationPolicy” inline policy and allocated Access Key and Secret Access Key.
Run these commands:
1. wget -O ./aws-replication-installer-init https://aws-elastic-disaster-recovery-us-east-1.s3.us-east-1.amazonaws.com/latest/linux/aws-replication-installer-init
(You can change your workload region in this script)
2. chmod +x aws-replication-installer-init; sudo ./aws-replication-installer-init
After running this command you will face with below requirements which consist of Destination region, User Access Key, and user Secret Access Key. After that agent will ask you to select which disks will replicate to destination VPC. You can select all of the disks to replicate the server.
Ok, after almost 5 minutes source server is ready and synced with the DRS Source Server.
It takes time to sync all of the data on a server to the AWS DRS Source Server based on Disk size, Disk type, and Network Bandwidth.
We can review all of the information about the source server when you click on the source server like below.
After completing the initial sync of the source server, it is time to initiate a recovery job, but first in a recovery drill for testing the replicated server, and then we can finalize the recovery by initiating recovery.
For initiating the recovery drill, we need to take a snapshot of the server to create a PIT policy to recover to requested snapshot. As you see below, the Server is ready for recovery and we can initiate drill recovery.
Select which snapshot you want to deploy and click on initiate drill.
As you see below initiative process ended successfully and we can test the drill instance for checking the instance.
Key Takeaways:
As I mentioned above, today, every organization, whether small or enterprise, needs a safe place to replicate its data. AWS DRS can be a suitable solution for this purpose. It has a very simple structure to protect your live data against Disasters, and you can be sure about your workload safety.