December 5, 2018
Failback in VM Disaster Recovery
In today’s working world, it is extremely important for any company to have a disaster recovery (DR) plan in place – which represents a set of steps that should be followed when an unexpected and potentially dangerous event occurs in a company’s business environment. Organizations that do not have a DR plan in place put themselves at risk, as they do not have planned DR strategies that can easily and promptly be acted out to minimize the impact of potential disasters.
Essentially, DR plans provide strategies that can guide a company in the steps that should be taken before and during a disaster. In addition to this, it is also critical to know how to act after a disaster has occurred, once the consequential damage has been remedied. Thus, after you have failed over mission-critical workloads to a DR site, you should evaluate the state of your virtual environment, and be ready to resume operations at the primary production site. This blog post describes the role of the failback operation in the disaster recovery process, and provides tips on how you can improve your DR plan by including failback strategies.
Failback and failover in the VM disaster recovery process
Disaster recovery generally consists of two main elements: failover and failback. Failover entails the process of switching from a source VM to a VM replica in order to move business-critical workloads from a damaged site to a DR site.
The next step is failback. As soon as you have restored your primary site and you are ready to resume operations, you can switch the workloads back to an original VM or another VM at a new location. The data between the primary site and the DR site is re-synchronized to ensure that there is zero data loss in the process.
Failover and failback are both time- and resource-consuming operations. Thus, it is essential to plan and test their implementation in advance. A recovery team (which consists of the people who are responsible for the DR plan) should be trained in the performance of failover and failback operations, and should also ensure that the appropriate technology is present at both the production site and the DR site.
What is failback?
Failback is the process of switching workloads from the VM replica back to the source VM after the disaster damage has been remedied. Moreover, failback also involves identifying the changes that were made while the DR site was substituting the production site, and transferring such data back to the original VM.
Failback execution should be performed with great care. A DR site might be inadequately equipped for running workloads and sustaining a virtual environment for a long time. Thus, it is essential to ensure that there is a secondary site that can support your production environment, and function as the primary site should your primary production center encounter serious damage.
Note that testing your failback strategies is an integral part of your DR plan. Moreover, failback testing should be closely monitored and documented by a recovery team in order to determine any gaps in its implementation, in an attempt to eliminate them. Regular failback testing can help you save precious time during a DR event, in addition to alleviating pressure from your staff.
Role of failback in a VM disaster recovery plan
A basic DR plan envisions the full recovery of the virtual environment at the DR site as the final goal. However, the return of the virtual environment to its initial state at the primary location (if it is still possible) is just as important. Failback can sometimes be overshadowed by failover, and its role in the DR plan is often completely disregarded. However, DR events may eventually turn out to be false alarms or their consequences could be successfully avoided. Due to this, an organization should have efficient and well-tested failback strategies to avoid delays in switching the workloads back to the primary production site.
Best tips for using failback
If your organization has decided to include failback operation in your VM DR plan, consider following the tips below to make the return to the original site easier and more manageable:
- Support connection with the primary site. After failover to the VM replica, it is important to ensure that the connectivity between the production site and the DR site is intact. This way, the failback operation can be performed without interruption; the source VM and the target VM can be synchronized easily; and the chance of loss during the data transfer is lessened.
- Secure all data at the DR site. There is a chance that the DR site could also be effected in the event of disaster, which would make it impossible to immediately recover after such a disruption. After you have already failed over the workloads to the VM replica, and the operations are successfully running at your DR site, you need to make sure that your virtual environment is protected. Running replication jobs and sending the latest VM replicas to the new location will provide additional protection to your system.
- Ensure sufficient network bandwidth. The usage of network bandwidth enables high-quality connectivity and data synchronization between the production site as well as the DR site. Thus, sufficient network bandwidth is an integral element in failover and failback operations.
- Keep documentation up to date. In the event of a disaster, some critical documentation may get lost or damaged. Therefore, it is important to have copies available and secure at the remote location. Furthermore, you should ensure that service providers can replace lost or damaged documentation if the need arises.
- Check licensing. Review your software documentation and find out if there are any licensing limits in your application stacks. Additionally, it is important to ensure in advance that all contractual obligations are fulfilled in order to prevent any issues that could possibly arise. Thus, your systems and applications are guaranteed to be secure and operational for the entire DR period.
- Test all systems and networks before failing back. Failing over is merely a part of the VM DR process, as challenges might arise when failing back to the original location. Therefore, you should ensure that all systems and networks of the primary production site are ready to resume operations after failback. For this purpose, use the alternate location as your test environment; this can help you to determine the efficiency of your failback strategies. Information security policies should be functional and updated. Furthermore, internet access and sufficient network bandwidth should be enabled for smooth failback.
- Perform DR assessment. After the failback operation has been completed, it is important to create an after-action report that documents each step of the process, the results achieved, and the errors identified. On the basis of such report, you can update your DR and business continuity (BC) plans with improvements for future use.
- Prepare a return-to-business plan. Apart from DR and BC plans, an organization is recommended to come up with a return-to-business plan, which provides strategies that would achieve the smooth transition of a virtual environment back to its original state. A return-to-business plan covers a variety of aspects such as IT assets, documentation, network services, etc., which are essential to the failback process. Additionally, the plan should clearly identify the duties and responsibilities of each recovery team member so as to avoid confusion.
Failback in NAKIVO Backup & Replication
Failback is a resource-intensive and time-consuming task which, if approached carelessly, can put your business at risk. To make the DR process easy and customer-friendly, NAKIVO Backup & Replication has presented the exclusive Site Recovery feature, which enables you to create and implement recovery workflows (or jobs) of any complexity. Essentially, the site recovery workflows automate your disaster recovery strategies to help you in achieving maximum efficiency.
Site Recovery workflows
Site Recovery workflows represent a set of specific actions or conditions, such as failover, failback, start/stop VMs, run/stop jobs, attach/detach repository, etc., which can easily be arranged in any order for the automation and orchestration of the DR process. Furthermore, site recovery jobs can be modified, supplemented, or tested by users at any time, without interrupting any current workloads of the VM. Site recovery workflows can be designed to comply with any DR scenario, ranging from the planned migration of datacenters to emergency failover. The number of workflows that you can create and integrate into your virtual environment is not limited. Thus, you can be sure that your system will be protected from any disaster that may occur.
Failback as a part of Site Recovery workflows
Running a failback job is impossible without prior failover. This is due to the fact that failback is essentially the process of restoring the application in a state of failover back to its original state. Thus, to start a failback operation, you need to create a site recovery workflow that includes the failover action. You can fail back to the primary site or a new location from a VM replica that has replaced the original VM.
Failback can be run in both production and test mode.
- Failback in test mode is aimed to identify whether a workflow would run successfully in production mode. A workflow in test mode can be run either on demand or on schedule.
Here, the VM replica runs all operations and remains powered on, while the original VM is powered off. A protective snapshot of the original source VM is created. Thereafter, incremental or full replication from the VM replica to the source VM is performed. Replication is only run once, which is sufficient for testing purposes. It is important to check that the IP address and network settings are correct in order to establish the connection between the sites. In doing this, the source VM and the VM replica can become synchronized for smooth data transfer. Finally, the source VM is powered on.
Note that all changes in your VMs that are made during the failback process are discarded after the test has completed, and your virtual environment is reverted back to its pre-failback state.
- Failback in production mode is performed when you want to recover your environment after a disaster has struck. A workflow in production mode can only be executed on demand.
Failback in production mode is similar to failback in test mode. However, replication from the VM replica to the source VM is performed twice (instead of just once) in order to ensure zero data loss in the process. In the end, the source VM is powered on and the VM replica at the DR site is powered off.
Note that VM replicas are only powered off in production mode.
- Failback cleanup is the process of deleting unnecessary files after the completion of a failback job.
Test failback cleanup includes the following actions:
- If the source VM existed before the failback job testing, the source VM is removed.
- If the source VM was powered off during the job, the source VM is reverted to the protective snapshot and then powered on. The protective snapshot is removed from the source VM.
The execution of failback cleanup in production mode is different from test failback cleanup as it primarily depends on the outcome of the failback job.
If failback has been completed successfully:
- The protective snapshot is removed from the source VM.
- The replication job is set up to use a new source VM rather than the old one.
- The VM replica state is switched from Failover (operational) to Normal.
If failback has failed, the virtual environment is reverted to a pre-failback state:
- The source VM is reverted back to the protective snapshot that was taken.
- The protective snapshot is removed from the source VM.
- The VM replica is powered on again.
Most VM DR plans focus solely on the protection of mission-critical data as well as the recovery of a virtual environment to a DR site. Generally, a DR plan represents a list of steps that should be taken during the actual disaster, and not how to act after the disaster has been evaded. Thus, adopting and testing failback strategies is a crucial part of the VM DR process.
With NAKIVO Backup & Replication, you can create site recovery jobs that represent an algorithm for automating and orchestrating DR activities. A site recovery job can be configured to deal with a specific issue in any DR scenario. The Site Recovery feature allows you to automate failback strategies as well. This enables you to keep your virtual environment at the DR site up to date, and to securely transfer the data back to the production site, with minimal input requirements on your part.
NAKIVO Backup & Replication offers a full package of backup and disaster recovery solutions in a single piece of software. Download our full-featured free trial to test out the solution in your own environment.