October 9, 2018
Site Recovery with NAKIVO Backup & Replication Part 6: Failback
Site Recovery is a new advanced feature that was released with NAKIVO Backup & Replication 8.0. With Site Recovery you can easily create a disaster recovery plan for your environment and restore your virtual machines with their respective workloads. The previous blog post of the Site Recovery series explored VM failover which can be used as an action for a Site Recovery job and can switch from a source VM to a VM replica. Today’s blog post covers a failback action (which is a reverse to failover), the role of failback for site recovery, and how the failback action can be used.
Failover entails migrating the workloads of a source VM to a VM replica that is an identical copy of the source VM at an appropriate point of time. This might be done because the production site (where the source VM is located) has been compromised by a disaster of some sort, or pre-emptively if disaster is anticipated. The VM replica is typically located at a temporary, geographically separate location called a disaster recovery (DR) site. When a source VM goes down and the failover process is used, all changes after failover are written to the VM replica but not to the source VM. Once the production site is back up and the source VM can run again, the changes written to the VM replica since failover must be transferred to the source VM. Hence, the data is re-synchronized with reverse replication.
Failback is the process of restoring virtual machines in their actual states to the production site from the DR site and returning the workloads to the production site, which originally handled them. Alternatively, you can migrate the workloads to a new production site with failback. Let’s consider how the data is transferred using an example.
1. There are two sites: a production site and a disaster recovery site. A VM is replicated from the production site to the disaster recovery site. The source VM is located at the production site, while its VM replica is located at a disaster recovery site. The data on the virtual disks of the source VM and the VM replica is identical after replication. When disaster strikes (or threatens to strike), failover to a VM replica is performed.
2. After performing failover to a VM replica, the workloads were migrated to a disaster recovery site. Any further changes to the VM (e.g., transactions added to a database as customers make online purchases) are written to a virtual disk of the VM replica during operation. Some blocks are written, and some blocks are erased. The virtual disk of the source VM does not include those transactions.
3. The damage caused by the disaster has been resolved (or the threat has passed). The production site is functional again and, accordingly, the workloads must be returned to the production site from the DR site. The updated data of the VM replica must be transferred back to the source VM. The VMs must be re-synchronized with reverse replication through a failback process.
Using the failback functionality built into NAKIVO Backup & Replication provides the following advantages:
- Your virtual machine data remains current after switching back from a DR site to the production site.
- You automate the process of migrating data and workloads back to the production site. You don’t need to delete the old VMs at the production site and copy the data of each VM replica from the DR site to the production site manually.
- The automation minimizes downtime when migrating the workloads from the DR site to the production site.
How Does Failback Work in NAKIVO Backup & Replication?
In order to make failback possible, the following conditions must be met:
- A VM replica exists and is in failover state (i.e., the replica has taken over the workload from the original source VM).
- The original source VM exists or a new location has been specified.
Failback can be performed either in production mode or in test mode. Let’s consider how each case works in detail.
Executing production failback entails the following points:
- Powering off the source VM (if it exists and is powered on).
- Creating a protective snapshot of the source VM (if the source VM is functional). Creating this snapshot allows you to restore a pre-failover state of the source VM in case failback cannot be performed properly.
- Running incremental replication (if the original source VM is online at the production site) or full replication (if the VM is being recovered to a new production site) from the VM replica to a source VM once.
- Powering off the VM replica (optional).
- Running incremental replication from the VM replica to the source VM once more. The delta (difference in the data) between the VM replica and a source VM should be much smaller after this step.
- Connecting the original source VM to its new network with Network Mapping (optional).
- Modifying the static IP address of the original source VM with Re-IP (optional).
- Powering on the original source VM.
When failback is complete, cleanup is performed. The algorithm of cleanup differs depending on the outcome of the failback operation.
Cleanup After Successful Failback
There are three steps for cleanup if failback has run successfully:
- The protective snapshot is removed from the original source VM.
- The replication job is reconfigured to use your newly created primary (source) VM rather than the old one (optional; applies if you have failed over to a new VM).
- Switching the VM replica from failover (operational) state to normal state.
After a successful failback operation, both the source VM and the VM replica exist in their normal states.
Cleanup After a Failed Failback Action
If the failback operation is not executed successfully, for whatever reason, then three other steps are performed to roll back the environment to a pre-failback state:
- Reverting the source VM to the protective snapshot that was taken.
- Removing the protective snapshot from the source VM.
- Powering the VM replica back on.
Test failback is performed when you run a Site Recovery job that includes a failback action in test mode manually, or when a Site Recovery job runs on a schedule. The procedure for test failback differs from that of production failback. With test failback, all changes in your virtual environment made by the failback action are reverted to the pre-failback state after the test is run.
The procedure for test failback is as follows:
- Powering off the original source VM (if it is functional and powered on).
- Creating a protective snapshot of the original source VM (if it is functional).
- Running incremental replication (if the original source VM exists) or full replication from the VM replica to a new source VM once.
- Connecting the source VM to an isolated network (optional).
- Modifying the static IP address of the source VM (optional).
- Powering on the source VM.
As you can see, in test failback, the VM replica is used to host the workloads and is not powered off, which contrasts with the production failback procedure. Replication from a VM replica to the original source VM (or a new production VM) is performed once, not twice, because this is sufficient for testing purposes. In this case, the source VM can be connected to an isolated network so that there is no disruption whatsoever to the production environment.
Test Failback Cleanup
Test failback cleanup slightly differs from production failback cleanup.
If the source VM didn’t exist before the test failback was run:
- Removing the source VM.
If the source VM already existed before the test failback was run:
- Reverting the source VM to its state when the protective snapshot was taken.
- Powering on the source VM (if it was powered off).
- Removing the protective snapshot from the source VM.
Preparing for Failback
First, you should create a Site Recovery job that includes failover actions. This process was described in a detailed walkthrough in the previous blog post of our Site Recovery series. A replication job and a VM replica are required to perform a failover action. A Site Recovery job must include a failover action in order to perform failback. The VM replicas must be in failover state; hence, you can perform failback only after performing failover. When all issues caused by disaster are resolved at the production site, you can prepare for failback to the source VMs.
Let’s use a walkthrough example how to perform failback with NAKIVO Backup & Replication. First, make sure that failover has been run as a part of a Site Recovery job (this should have already been created).
Then create a new Site Recovery job; the failback actions can be incorporated into this job. On the home page of the web interface of NAKIVO Backup & Replication, click Create > Site recovery job.
The New Site Recovery Job Wizard is launched.
1. Actions. In the left panel of the Actions interface click Failback VMware VMs (the VMware platform is considered in this example, but you can operate failback just as easily for other environments by clicking Failback Hyper-V VMs or Failback EC2 Instances).
Select the VM replicas to which the failover operation should be applied. Click Next.
Select a failback location – this could be the original production site or a new location. Click Next.
Select the job options. Tick the Power off replica VMs checkbox if needed. Click Save when you are ready to proceed.
After you have added the failback action, the Site Recovery job looks like this (see the screenshot below). Click Next.
2. Networks. Tick the checkbox if you need to enable network mapping for this job. Click Next.
3. Re-IP. Tick the checkbox if you need to enable Re-IP for this job. Click Next.
4. Test Schedule. Configure your scheduling options, then click Next.
5. Options. Define the Site Recovery job options and enter the job name. Click Finish to finalize the creation of this new Site Recovery job with failback.
Now you can run this Site Recovery job to perform VM failback: simply right-click your Site Recovery job’s name, select Run job, and select Test site recovery job or Run site recovery job according to your needs.
Failback is a critically important action for most site recovery workflows. It is performed to restore the workloads to a production site by transferring the updated data of the VM replicas that were used for disaster recovery back to the original source VMs (or to a new VM at a more permanent location). Failback allows you to keep the virtual machine data current, automate the data transfer process, and minimize downtime when migrating from a DR site to a production site.
This blog post concludes our series on Site Recovery in NAKIVO Backup & Replication, a complex but user-friendly feature with which you can flexibly implement your disaster recovery plan and protect your virtual environment against disasters. Try the latest v8 in your own environment to see how this functionality could save your business time and stress.