October 10, 2018
Disaster Recovery for AWS
Amazon Web Services (AWS) are among the most popular cloud services available – to the extent that they have set the corporate standard for clouds. Disaster recovery in AWS has similarities and differences in comparison to the disaster recovery of virtual machines running on physical servers. Today’s blog post explains the features of disaster recovery in AWS, the role of a disaster recovery plan, and how to protect your virtual environment in Amazon cloud.
What Factors Can Cause Failure in AWS?
Amazon provides cloud services with reliability and availability of more than 99%. At first glance it seems that the provided reliability is enough to leave running instances “as is”, but this approach is not recommended for reasons that are to be explained. Human factors may cause an error; for example, one of your workers may configure software improperly, or accidentally delete the files, which can cause failure of the application or the entire EC2 instance. Another factor that can cause failure is a virus attack; viruses could violate the proper function of applications, and could also corrupt important files. For this reason, you should prepare for AWS disaster recovery.
DR Plan as a Prerequisite for Fast Disaster Recovery
Disaster recovery plan (DR plan) is a documented set of measures that must be performed in the case of disaster. A DR plan for AWS includes structured instructions on how EC2 instances, operating systems, applications, and files can be recovered in different scenarios. Backup policy, used software, recovery destination, and other components of a DR plan are a prerequisite for fast recovery when disaster occurs. Rather than hoping that disaster never occurs in your virtual environment in AWS, prepare for possible disaster beforehand.
Differences Between a Snapshot and a Backup
First, let’s go over some key terminology.
EC2 is Elastic Computing Cloud. EC2 provides a scalable virtual infrastructure in the cloud. You can increase or decrease the amount of RAM, CPU, and disk space for your EC2 instances at any time.
S3 is an Amazon Simple Storage Service designed for storing files, while the purpose of EBS is to provide storage for EC2 instances.
Amazon Machine Image (AMI) is the image that includes all needed information for running an Amazon instance. You can use already created AMIs as a template when launching instances, or create your own customized AMIs and use them as a template.
Elastic Block Store (EBS) volume is the analog of a virtual hard disk in Amazon cloud environment. EBS volumes are high-scalable. You can change the volume size easily at any time.
EBS snapshot is a point-in-time copy of data on the EBS volume to Amazon S3. The first snapshot contains all data; all following snapshots are increments and contain changes made since previous snapshots. The deletion of one or multiple old snapshots does not affect other snapshots, and the data can be recovered from them at the appropriate time point. You can use data stored in snapshots in a case of AWS disaster recovery, but be aware that there are disadvantages when using this method:
- EC2 instances must be shut down when a snapshot is taken because EBS snapshots are not application-aware.
- EBS snapshots cannot be copied anywhere else other than Amazon S3 cloud.
- It consumes a large amount of space in the cloud, which results in a higher price for this service.
The alternative of using the built-in snapshot feature is to use specialized backup solutions that support disaster recovery in AWS. The advantages of using such solutions are:
- Different targets to store backups
- Application-aware backup
- Deduplication and compression
- Flexible recovery options
- Convenient scheduling and retention settings
Read the white paper “AWS Snapshot vs Backup” to learn more about the differences.
Backup vs. Having a DR Plan
If you have backups of your Amazon EC2 instances that are created regularly, then you stand a chance to recover your data and restore the workloads in a case of disaster. You should make decisions on the fly when recovering the instances after disaster if you don’t have a disaster recovery plan. Some of these decisions may not be rational in the process of recovery after disaster, and as a result mistakes may occur. These mistakes can cause improper data recovery and may result in excessive time spent on recovery. When you have a tested disaster recovery plan in place for your AWS virtual environment, then you know what actions to perform in which order. As a result, the probability of successful recovery increases, and the process of recovery consumes less time.
How to Develop a DR Plan for AWS?
As discussed above, it is important to have a DR plan for AWS environment, now let’s consider the main points of DR plan creation.
Determining RPO and RTO
Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are the key metrics used for disaster recovery and business continuity planning. These metrics are calculated by conducting a business impact analysis, which defines the influence of different factors on business processes.
RPO determines how much data your company can afford to lose. This parameter describes how much time might pass from the last backup or replication to disaster without your company losing significant funds. This means that the time interval between backups or replications must not exceed the RPO value.
RTO determines how much time your company can spend on data recovery without significant loss for your company. The RTO dictates which protection method to choose – backup or replication. Having a replica allows you to restore an EC2 instance faster than when you have a backup.
Choosing Your Backup Strategy
Determine which Amazon EC2 instances are the most critical and which are less critical. Depending on these results, choose the protection method for these instances.
Backup and Restore is a simple and cost-effective method. EC2 instances and critical data are backed up to a safe place. Backed up data can be compressed and deduplicated to consume less storage space. If a disaster occurs, the data can be recovered from these backups. Recovery time may be quite long when this method is used.
Replication and Failover allow you to restore EC2 instances with running services faster than when you use the backup and restore method. Replica is an identical copy of your EC2 instance, which is stored on a DR site and is ready to be powered on during the event of failover when a source instance is down. Using this method allows you to save your time during the process of AWS disaster recovery.
Pilot Light is the idea based on the analogy of a gas heater with small idle flame that can be amplified at any time when needed. In the framework of disaster recovery in AWS, this term means having a minimal set of core components of your virtual environment constantly running in the cloud. By having critical data and applications ready, you can recover other components of your infrastructure in a shorter time, thereby achieving tighter RTO values.
Selecting a Remote AWS Site for DR
Amazon provides cloud-based AWS datacenters in multiple geographical regions and availability zones around the world. Different availability zones in the same region use different datacenters. If you are located in the Netherlands and your primary site is located in eu-west1 region (Ireland), you can select eu-central-1 region (Frankfurt, Germany) as a DR site. Notice that the closer the AWS datacenter is to the geographical location of your company, the less latency there is for a network connection between you and your AWS virtual infrastructure.
Choosing the Right Backup and DR Solution
When you have determined your RPO, RTO, disaster recovery strategy, and remote DR site, you can choose the appropriate software product to meet your DR strategy. Check if the product supports the following features:
Application-aware backup and replication. When an application is running, some of the data used by the application is stored in memory, and is changed in time. Making a backup of an EC2 instance with applications that are in a running state could cause problems when the time for recovery comes. The effect could be similar to situations such as when a computer is powered off unexpectedly after power outage. The application must be quiesced, and the memory must be flushed in order to make an application-consistent backup or replica.
Flexible retention policy. Sometimes you may not notice data loss immediately and when you have noticed it, the old backup is needed. Keeping all backups is expensive, and keeping only the most recent backups (recovery points) disallows you to restore the older version of data. If the software that you are using for backup and disaster recovery supports a flexible retention policy, then you can set how many recovery points to keep for the last year, last week, last month, etc.
Automated failover, failback. Failover is a process of switching from a failed instance to a replica stored on a DR site during disaster recovery in AWS. The network settings on the source site and the DR site may be different. In this case, automatically changing the network settings with disaster recovery software makes the process easier and saves time. Failback is the process of restoring the operations back on the source EC2 instance when a source instance turns back into a normal state. Some of new data that has been written inside a replica after disaster recovery must be transferred to a source EC2 instance during failback.
Granular recovery allows you to restore either the entire EC2 instance or separate files and application objects. This approach helps save time when only specific data is needed for recovery.
When you have all of the input parameters including the software used for AWS disaster recovery, you can compose a detailed disaster recovery plan. Describe how often the data must be backed up or replicated. Specify what the recovery order of instances is, and what the recovery time limits are.
Testing Your DR Plan in AWS
DR plan testing is an important part of preparing for disaster recovery in AWS. Testing helps you determine if your AWS disaster recovery plan is workable, as well as whether you can recover your data and workloads. Test your DR plan on a regular basis. Infrastructures change with time, applications get updated, and virtual disks grow – these factors make running periodic tests a must. If a test fails, make a correction to the DR approach.
Disaster recovery is as important for AWS as it is for virtualized environments. Create a DR plan to protect your AWS environment before disaster occurs, and test your DR plan periodically. Using special software for disaster recovery, such as NAKIVO Backup & Replication, has a list of advantages as compared to built-in snapshot features. NAKIVO Backup & Replication includes the advanced Site Recovery functionality that enables you to perform automated, one-click failover and failback for EC2 instances. The product allows you to store backups of EC2 instances on cold EBS volumes, which is more cost-effective than storing EBS snapshots in S3. Moreover, backups are compressed in a way that provides more space savings when storing backups in the Amazon cloud.