May 24, 2022
AWS Disaster Recovery Best Practices
Any activity that can negatively affect a company’s business continuity when it comes to on-premises or cloud workloads could be termed a disaster. It is crucial for a company to invest time and resources into identifying all possible risks and defining plans to prevent them – or at least act accordingly to mitigate any negative impact.
Creating a thorough disaster recovery (DR) plan for your on-premises and AWS cloud infrastructure is a matter of the highest priority. In this blog post, we cover the best practices for AWS disaster recovery planning for both cases with an emphasis on AWS workloads.
Benefits of Using AWS for Disaster Recovery
The Amazon Web Services (AWS) platform offers a wide range of services, including database storage, compute power, content delivery, and other distinct features. AWS can also be used to quickly restore business operations running on virtual machines and EC2 instances in case of a disaster. AWS allows you to create replicas and configure disaster recovery for both on-premises and cloud environments. Keeping business-critical data in the AWS cloud also removes the necessity for a secondary physical storage system, which generally entails significant costs.
In fact, your backup and replication data can be stored in multiple AWS regions across the world, securely and reliably. As a part of its disaster recovery functionality, AWS enables you to run and test a third-party DR solution to check for any deficiencies (perform AWS disaster recovery testing). Then, you can use AWS CloudFormation templates to define the most efficient DR practices and save them in an Amazon Virtual Private Cloud for further use.
AWS Disaster Recovery Scenarios
There are four AWS disaster recovery strategies offered by Amazon. The choice depends on your organization’s needs and budget. Various combinations are possible to accommodate the specific needs of any given virtual infrastructure.
- Backup and restore. Critical data can be backed up and sent to an offsite location such as Amazon S3 storage, where it is well protected and can be rapidly restored as needed. Amazon S3’s web user interface makes it accessible from anywhere. You can copy data directly to Amazon S3 or create backups and store them in the cloud. This is one of the most popular disaster recovery scenarios in AWS.
- Pilot light. This disaster recovery scenario lets you have a small version of a virtual environment in the cloud, always keeping it running and up to date. You can rapidly recover and launch the most critical components of your AWS-based infrastructure. Services such as Amazon Machine Images (AMIs) and Amazon EBS snapshots are used. The pilot light method is more convenient than the backup-and-restore AWS DR strategy as it significantly reduces the time spent on recovery.
- Warm standby. In this disaster recovery scenario, a scaled-down version of your production infrastructure is always running in the cloud. During a DR event, it can be rapidly scaled up to minimize downtime and restore critical operations and workloads.
- Multi-site deployment (“hot standby”). This method entails replicating business-critical data and the core components of your infrastructure and distributing them across several on-premises or cloud locations. All of these sites are active; they share the traffic and workloads. If a disaster affects one of the locations, you still have an intact system ready to operate in full production mode. Amazon EC2 Auto Scaling is used to run this process. With hot standby, minimal Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are achieved. If you decide to use hot standby among AWS disaster recovery scenarios, remember that running several virtual systems at once can be quite costly.
The following features should also be mentioned in the context of disaster recovery:
- Replication. To ensure high availability, Cross-Region Replication can be implemented if your primary workloads are in the AWS cloud. Here, critical data and system components are replicated to any other AWS region that you choose. If any changes are made in the primary database, data can be updated either instantly (synchronous replication) or with a small delay (asynchronous replication). These two types of replication serve different business needs.
- Failback. During the DR process, the workload of the affected instance is moved to the target site and the replica instance is powered on (failover). Once the primary site is restored, you can recover the original instance. To save all the changes in data that were executed in the DR instance since failover, you need to reverse the flow of data replication back to the primary site (failback).
- Multiple AWS regions. Each AWS region is a separate and independent area intended to store either instances or data. For successful disaster recovery, you might choose to store data in two or more AWS regions to mitigate the impact of extremely large-scale disasters.
AWS Disaster Recovery Best Practices
Here are the best practices for AWS disaster recovery, which you should remember when creating an AWS disaster recovery plan for your environment.
- AWS disaster recovery testing. After installing a DR solution, you should test it. Testing can be run on demand, or it can be scheduled. You can practice “game-day testing”, which is a way of testing your applications and instances in order to check whether your DR plan works as expected and RTOs can be met. For this purpose, AWS CloudFormation can be used to deploy complete environments on Amazon EC2. You can create a resource template, which allows you to model and manage infrastructure components in your cloud environment. Periodic testing verifies that all DR components are properly planned and organized and your RTOs and RPOs can be met when it counts.
- Monitoring and alerting. To prevent any possible disaster from wiping out your infrastructure, you need to identify potential issues quickly. You can regularly monitor the workflow of your system and check its integrity. This allows you to rapidly detect emerging threats such as connectivity issues, server failure, or application shutdown. Amazon CloudWatch evaluates the performance of your AWS resources. Alarms and notifications can be set up to notify you when certain metrics reach a critical level.
- Regular backup and replication. Before disaster strikes, it is crucial to prepare your system and run regular backup and replication jobs. As a result, you have a good target for failover. After switching to your DR environment, you should continue to run regular backup and replication jobs. Storing these backups and replicas in separate remote locations allows you to avoid the risk of having a single point of failure. AWS can run regular disaster recovery tests to verify the state of your DR infrastructure.
- Use of AWS tools and techniques. To ensure that AWS disaster recovery best practices are in place, you must adopt recovery groups or application stacks. This way, you can arrange the recovery of your infrastructure properly – e.g., business-critical applications should be recovered first, as they have the highest priority.
AWS Disaster Recovery Services
To this end, AWS provides various services:
- AWS Elastic Disaster Recovery is a service for data replication and recovery of applications running on-premises and in the cloud. You can launch recovery of AWS instances in the cloud to recover applications to those instances.
- AWS Import/Export enables access to portable storage devices for transferring business-critical data and applications into and out of AWS. Thanks to Amazon’s high-speed internal network, even large amounts of data can be sent rapidly and securely to the target location.
- Amazon Elastic Cloud Compute (Amazon EC2) allows you to use computing resources and form a complete virtual data center in the AWS cloud on demand. EC2 instances can be created within minutes and retain complete control for the entire disaster recovery period.
- Amazon Simple Storage Service (Amazon S3) is designed to store and retrieve data of the highest priority. This service keeps business-critical components on multiple devices across a number of facilities, thus providing the highest level of availability. AWS ensures further protection through Identity and Access Management (IAM), bucket policies, Multi-Factor Authentication (MFA), and object versioning.
- Amazon Elastic Block Store (Amazon EBS) is block-level storage for data used with your Amazon EC2 instances in the cloud. Data is stored on the basis of snapshots which are then sent to Amazon S3, thus providing long-term and reliable storage of your data.
- Amazon Relational Database Service (Amazon RDS) helps configure and manage a relational database in the AWS cloud. It is a cost-efficient and flexible solution for performing multiple database administration tasks.
- Amazon Direct Connect allows you to set up a dedicated connection between an on-premises network and the AWS cloud. This helps you secure and accelerate network connections without incurring high costs.
- Secure access. When working with private and/or business-critical data, providing a high level of security is crucial for organizations of any scale. To this end, you can apply AWS Identity and Access Management (IAM) which ensures secure access to resources in your DR environment. With IAM, you can create role-based and user-based security policies that control user access to critical data.
- Automation. Disaster recovery automation is an important aspect of AWS DR best practices. During a disaster recovery event, having full control over your AWS-based servers and your on-premises servers is essential. However, it is often physically impossible to manually oversee the recovery of every single application and instance. For effective management, orchestration and automation of disaster recovery processes are required. There are a number of Amazon management services available for this purpose:
- A set of features included in AWS CloudFormation lets you provision infrastructure services in an automated way.
- AWS OpsWorks helps automate the configuration, deployment, and management of servers in your Amazon EC2 instances, as well as on-premises computing environments.
- Autoscaling can scale your instances up or down to meet demands based on the parameters you specify in AWS CloudWatch. This is extremely helpful during a disaster recovery event. The solution can automatically scale up to deal with the increased workload on servers and scale down once your production infrastructure processes are restored to their normal state.
- Licensing. Installing correctly licensed applications in your AWS environment is crucial for efficient performance. AWS has various types of licensing, such as “License included” and “Bring-Your-Own-License”, to comply with your specific business needs. Note that your data protection solution should also be licensed for seamless integration with AWS.
AWS Disaster Recovery Solution from NAKIVO
Amazon EC2 is a highly reliable and secure cloud. Nevertheless, there are a number of threats that can disrupt the performance of EC2 instances and undermine business continuity. A dedicated integrated backup and disaster recovery solution like NAKIVO Backup & Replication can ensure the best reliability and recovery objectives.
The NAKIVO solution can protect your cloud environment with Amazon EC2 instance backup and Amazon EC2 instance replication, allowing you to follow the AWS disaster recovery best practices. The solution’s DR features include automated failover, failback, Site Recovery for orchestrating DR sequences of any complexity and DR testing.
The product allows you to create and manage replicas of your original EC2 instances and store them in a target location of your choice. Instance replicas remain in a powered-off state at the DR site and can be easily powered on during a DR event when instant recovery is required. Thus, no extra costs are incurred for constantly keeping instance replicas on standby.