November 19, 2018
AWS Disaster Recovery Best Practices
In the modern business environment, any business can be subject to disruption. Any activity that can negatively affect a company’s business continuity could be termed a disaster. It is crucial, therefore, for a company to invest time and resources into defining all possible risks and being able to prevent them – or at least act accordingly to mitigate their negative impact. Thus, creating a thorough disaster recovery (DR) plan for your infrastructure becomes a matter of the highest priority. In this blog post, we cover the best practices for disaster recovery planning in a cloud environment.
NAKIVO Backup & Replication is a reliable and highly flexible solution designed for AWS instance backup, replication, and AWS disaster recovery. The Site Recovery functionality, exclusive to NAKIVO Backup & Replication, allows you to arrange disaster recovery activities into automated workflows for maximum efficiency.
Download the Full-Featured Free Trial to test NAKIVO Backup & Replication in your own environment.
Benefits of Using AWS for Disaster Recovery
AWS is a dynamic product which offers a wide range of services including database storage, compute power, content delivery, and other distinct features. Moreover, AWS can help quickly restore your business operations running on virtual machines (instances) in case of disaster.
The AWS platform allows you to promptly moderate and recover your resources during a disaster. Keeping business-critical data in the AWS cloud also removes the necessity for a secondary physical storage system, which generally entails significant costs. In fact, your data can be stored in multiple AWS regions across the world, securely and reliably. As a part of its disaster recovery functionality, AWS enables you to run and test a DR solution to check for any deficiencies. Then, you can use AWS CloudFormation templates to define the most efficient DR practices and save them in an Amazon Virtual Private Cloud for further use.
Disaster Recovery Scenarios with AWS
The choice among AWS DR strategies depends on your business’s priorities. Various combinations are possible to accommodate the specific needs of your virtual infrastructure.
- Backup and restore. Business-critical data can be backed up and sent to an off-site location such as Amazon S3, where it is well protected and can be rapidly restored as needed. Amazon S3’s web user interface makes it accessible from anywhere. You can copy data directly to Amazon S3, or you can create backups and store them in the cloud.
- Pilot light. This DR scenario lets you have a small version of a virtual environment in the cloud, always keeping it running and up to date. You can rapidly recover and launch the most critical components of your AWS-based infrastructure. Services such as Amazon Machine Images (AMIs) and Amazon EBS snapshots are used. The pilot light method is more convenient than the back-up-and-restore strategy as it significantly reduces the time spent on recovery.
- Warm standby. In this DR scenario, a scaled-down version of your production infrastructure is always running in the cloud. During a DR event, it can be rapidly scaled up to minimize downtime and restore critical operations and workloads.
- Multi-site deployment (“hot standby”). This method entails replicating business-critical data and the core components of your infrastructure and distributing them across several locations. All of these sites are active; they share the traffic and workloads. If a disaster affects one of the locations, you still have an intact system ready to operate in full production mode. Amazon EC2 Auto Scaling is used to run this process. With hot standby, minimal Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are achieved. However, running several virtual systems at once can be quite costly.
The following features should also be mentioned in the context of Disaster Recovery:
- Replication. To ensure high availability, Cross-Region Replication can be implemented. Here, critical data and system components are replicated to any other AWS region that you choose. If any changes are made in the primary database, data can be updated either instantly (synchronous replication) or with a small delay (asynchronous replication). These two types of replication serve different business needs.
- Failback. During the DR process, the workload of the affected instance is moved to the target site and the replica instance is powered on (failover). Once the primary site is restored, you can recover the original instance. To save all the changes in data that were executed in the DR instance since failover, you need to reverse the flow of data replication back to the primary site (failback).
- Multiple AWS regions. Each AWS region is a separate and independent area intended to store either instances or data. For successful DR, you might choose to store data in two or more AWS regions to mitigate the impact of extremely large-scale disasters.
Best Practices for AWS Disaster Recovery
- Testing. After installing a DR solution, you should test it. Testing can be run on demand or on schedule. You can practice “game-day testing”, which is a way of testing your applications and instances in order to check whether your DR plan works as expected and RTOs can be met. For this purpose, AWS CloudFormation can be used to deploy complete environments on Amazon EC2. You can create a resource template that allows you to model and manage infrastructure components in your cloud environment. Periodic testing verifies that all DR components are properly planned and organized and your RTOs and RPOs can be met when it counts.
- Monitoring and alerting. To prevent any possible disaster from wiping out your infrastructure, you need to identify potential issues quickly. You can regularly monitor the workflow of your system and check its integrity. This allows you to rapidly detect emerging threats such as connectivity issues, server failure, or application shutdown. Amazon CloudWatch evaluates the performance of your AWS resources. Alarms and notifications can be set up to notify you when certain metrics reach a critical level.
- Regular backup and replication. Before disaster strikes, it is crucial to prepare your system and run regular backup and replication jobs so you have a good target for failover. After you have switched to your DR environment, you should continue to run regular backup and replication jobs. Storing these backups and replicas in separate remote locations allows you to avoid the risk of having a single point of failure. AWS can run regular DR tests to verify the state of your DR infrastructure.
- Use of AWS tools and techniques. To ensure that best DR practices are in place, you must adopt recovery groups or application stacks. This way, you can arrange the recovery of your infrastructure properly – e.g., business-critical applications should be recovered first, as they have the highest priority.
To this end, AWS provides various services:
AWS Import/Export enables access to portable storage devices for transferring business-critical data and applications into and out of AWS. Thanks to Amazon’s high-speed internal network, even large amounts of data can be sent rapidly and securely to the target location.
Amazon Elastic Cloud Compute (Amazon EC2) allows you to use computing resources and forming a complete virtual data center in the AWS cloud on demand. EC2 instances can be created within minutes and retain complete control for the entire DR period.
Amazon Simple Storage Service (Amazon S3) is designed for storage and retrieval of data of the highest priority. This service keeps business-critical components on multiple devices across a number of facilities, thus providing the highest level of availability. AWS ensures further protection through Identity and Access Management (IAM), bucket policies, Multi-Factor Authentication (MFA), object versioning, and support versioning.
Amazon Elastic Block Store (Amazon EBS) is block-level storage for data used with your Amazon EC2 instances in the cloud. Data is stored on the basis of snapshots which are then sent to Amazon S3, thus providing long-term and reliable storage of your data.
Amazon Relational Database Service (Amazon RDS) helps configure and manage a relational database in the AWS cloud. It is a cost-efficient and flexible solution for performing multiple database administration tasks.
Amazon Direct Connect allows you to set up a dedicated connection between an on-premises network and the AWS cloud. This helps you secure and accelerate network connections without incurring high costs.
- RTO and RPO. Your recovery time objective (RTO) is the time period in which failed services and applications should be recovered after a DR event. Your recovery point objective (RPO), on the other hand, determines the furthest point of time in the past to which the system should recover. RTOs and RPOs should typically correspond with the priorities of your organization during a DR event. After defining your RPOs and RTOs (for both applications and instances), you should determine the frequency with which you create data backups and replicas in your environment. Creating multiple recovery points ensures less data loss in case of disaster.
- Secure access. When working with private and/or business-critical data, providing a high level of security is crucial for organizations of any scale. To this end, you can apply AWS Identity and Access Management (IAM) which ensures secure access to resources in your DR environment. With IAM, you can create role-based and user-based security policies that control user access to critical data.
- Automation. During a DR event, having full control over your AWS-based servers and your on-premises servers is essential. However, it is often physically impossible to manually oversee the recovery of every single application and instance. For effective management, orchestration and automation of DR processes is required.There are a number of Amazon management services available for this purpose. A set of features included in AWS CloudFormation lets you provision infrastructure services in an automated way. AWS OpsWorks helps automate configuration, deployment, and management of servers in your Amazon EC2 instances, as well as on-premises computing environments. Moreover, Autoscaling can scale your instances up or down to meet demands based on the parameters you specify in AWS CloudWatch. This is extremely helpful during a DR event, as the solution can automatically scale up to deal with the increased workload on servers and scale down once your production infrastructure processes are restored to their normal state.
- Licensing. Installing correctly licensed applications in your AWS environment is crucial for efficient performance. AWS has various types of licensing, such as “License included” and “Bring-Your-Own-License”, to comply with your specific business needs. Note that your data protection solution should also be licensed for seamless integration with AWS.
AWS Disaster Recovery in NAKIVO Backup & Replication
AWS EC2 is a highly reliable and secure cloud. Nevertheless, there are still a number of threats that could disrupt the performance of EC2 instances and undermine business continuity. NAKIVO Backup & Replication is a perfect solution for overcoming such issues.
NAKIVO Backup & Replication can protect your cloud environment with AWS EC2 instance replication. The product allows you to create and manage exact copies (replicas) of your original EC2 instances and store them in a target location of your choice. Instance replicas remain in a powered-off state at the DR site, and can be easily powered on during a DR event when instant recovery is required. Thus, no extra costs are incurred for constantly keeping instance replicas on standby.
For each AWS EC2 instance, up to 30 recovery points can be created and rotated according to the GFS policy.
Once you have replicas created, NAKIVO Backup & Replication allows you to fail over to such replicas, i.e., switch from a source instance to its replica in order to move business-critical workloads from the production site (threatened by a disaster) to the DR site. NAKIVO Backup & Replication can automatically change the network settings of an instance during failover. Manually changing the settings for instances can be a time-consuming task, and if you are in a DR scenario, you likely have enough to worry about. To save valuable time, NAKIVO Backup & Replication has Network Mapping and Re-IP features that can be easily set up during configuration of a replication or failover job.
Network Mapping allows you to choose a network for automatic reconnection of your instances. Instances are connected to specific networks, which are likely to differ for the primary site and the secondary site. Network mapping involves mapping a source virtual network to a target virtual network (which you specify in advance) during a DR event. This saves you the trouble of manually configuring the networks for each instance.
Similarly, the Re-IP feature can automatically assign new network parameters to an instance during a DR event.
NAKIVO Backup & Replication offers advanced functionality that lets you create and implement custom plans to automate your disaster recovery strategies. This feature is called Site Recovery. You can create site recovery jobs (recovery workflows) of any complexity based on your current needs and priorities. You can modify, supplement, or test your site recovery jobs at any time without affecting the production environment. These workflows can be designed to deal with a variety of issues, ranging from planned migration of datacenters to emergency failover. You can create a special workflow for every type of DR scenario.
There are three different types of failover:
- Planned failover is generally used to protect the system from an impending threat or disaster. In this case, the solution synchronizes data between the source instance and its replica before transferring the workload to the replica. Thus, data loss is completely prevented.
- Test failover lets you check the feasibility and effectiveness of your DR plan. Moreover, NAKIVO Backup & Replication provides an option to test whether your RTO can be met if disaster strikes. You can set a designated time in which you want the site recovery job to finish running (RTO). The test is considered failed if your job exceeds this timeframe. You can also enable the option to send test/run reports when jobs are completed. Thus, you can identify shortcomings in your DR plan and improve its results over time.
- Emergency failover is run when your primary site is exposed to immediate danger. The solution instantly moves the workload from the primary instance to its replica. Thus, the minimum downtime is guaranteed, though some of the most recent data might be lost.
With NAKIVO Backup & Replication, you can combine the actions from the list below in any order to create a workflow that complies with your DR plan:
- Failover. Fail over to an already-created instance replica.
- Failback. Transfer workloads back from an instance replica at a DR site to the source instance at the production site.
- Start. Start one or multiple instances.
- Stop. Stop one or multiple instances.
- Run jobs. Run jobs (backup, replication, etc.) that you have already created for instances.
- Stop jobs. Stop running jobs for instances.
- Run script. Run a custom pre- or post-job script on a Windows or Linux machine.
- Attach repository. Attach a backup repository.
- Detach repository. Detach a backup repository that is attached.
- Send emails. Receive email notifications with results after a specific action is completed.
- Wait. Wait for a defined period of time before starting the next action.
- Check condition. Check whether a resource exists, whether a resource is running, and/or whether IP/hostname is reachable before proceeding to the next action.
Every site recovery job can be run in production or test mode. When you set up your jobs to run automatically on schedule, they are run in test mode. Test mode allows you to check whether your instances can be recovered within the RTOs defined in your disaster recovery plan. If you want to run site recovery jobs in production mode, you should activate the job manually. Once a site recovery job in test mode is completed, some actions (Start/Stop instances, Failover/Failback, and Attach/Detach Repository) are reverted so as to restore the environment back to its original state and ensure that the job can run properly in production mode during a disaster. When a site recovery job is run in production mode (e.g., if disaster strikes), recovery of your virtual environment is initiated and the actions are not reversed upon completion.
Coming up with effective disaster recovery strategies for Amazon EC2 virtual environments is a critical issue for many businesses. By implementing best DR practices, you can ensure business continuity and high availability despite the impact of a disaster.
The DR practices outlined above can be used to easily set up your Amazon EC2 virtual environment for fast recovery and a high level of data protection. AWS can scale up a company’s virtual infrastructure on a pay-as-you-go basis. Moreover, AWS provides several DR methods that serve the needs of businesses of different scales. Small and medium-sized businesses with lighter workloads might benefit most from “pilot light” environments. Large enterprises looking to minimize downtime and avoid significant loss of revenue, on the other hand, might choose the “hot standby” option.