September 24, 2018
Site Recovery with NAKIVO Backup & Replication Part 5: Failover
The previous blog posts of our series on Site Recovery explored the planning of site recovery workflows, the creation of Site Recovery jobs with failover actions in NAKIVO Backup & Replication, and the testing of those jobs. Failover actions are an integral part of most Site Recovery jobs. Today’s blog post covers the failover action, including how failover works, types of failover, and requirements for failover. The post then includes a walkthrough on configuring and running failover as a part of a Site Recovery job in NAKIVO Backup & Replication.
What Is Failover?
In the context of virtualized environments, failover is the process of switching from a source (production) VM to a VM replica for the purpose of transferring workloads. A VM replica is an identical copy of a source VM at the appropriate point of time. Failover allows you to achieve high availability, which is a characteristic that describes the uptime of your VMs as a percentage. A source VM is located at a production site, while the VM replica is often located at a geographically separate disaster recovery (DR) site. Failover is an operation mode that makes mission-critical systems highly available due to the redundancy ensured by VM replicas.
Types of Failover
There are three types of failover: planned failover, test failover, and emergency failover. Let’s consider each in more detail.
Planned failover is used for switching workloads to a VM replica with zero data loss. This type of failover can be used proactively when a potential disaster is predicted or suspected – for example, the electric company notified you about a planned power outage in your primary office on Monday, or the weather forecast warns about a typhoon risk. Replication of the source VM is performed immediately prior to planned failover in order to create a fresh recovery point.
Test failover is used to ensure that virtual machines can be failed over and workloads can be migrated successfully. Test failover works similarly to planned failover. By performing test failover, you can train your staff for disaster recovery operations, check whether a site recovery plan is workable (read the article on site recovery plan testing for more information), and check how much time it takes to perform failover.
Emergency failover is used for quickly switching the workloads from a source VM to its VM replica if the source VM goes down. No additional data transfer occurs. Replication is not performed to add a new recovery point when you initiate emergency failover, because data on the source VM may be inconsistent at that moment (or the VM could be completely unreachable).
Requirements for Failover
The first and the most important requirement for VM failover is having a VM replica. Read our earlier blog post for a walkthrough on using NAKIVO Backup & Replication to create and run a replication job in preparation for failover.
Hardware requirements at the DR site must also be considered. The performance of your virtual machines depends on several hardware components, including CPU, RAM, disks, and network. If the hosts at the DR site do not have sufficient CPU performance, memory, storage capacity, or disk speed, the VMs on them may lag. This could threaten their ability to run the business-critical services they are designed to protect. Insufficient disk speed can slow down a VM’s performance dramatically, and insufficient storage space could cause complete VM failure, putting you back at square one. Virtual machines might also interact extensively with each other via the DR site network if the applications running on them depend on each other. Thus, your DR network speed must be high enough to ensure proper function as well.
Let’s consider how to perform failover as a part of a Site Recovery job. To create a failover action, you should first create a Site Recovery job. From the home page of NAKIVO Backup & Replication, click Create > Site recovery job.
In the left panel of the first screen of the New Site Recovery Job Wizard, you can see the list of actions that can be included in your workflow. Click Failover VMware VMs. (The VMware virtualization platform is used in the current example; you could similarly select Failover Hyper-V VMs or Failover EC2 Instances if you used one of these virtual environments.)
A configuration screen appears for the failover action. From the left panel, select the VM replica from the relevant replication job. This replica is to be used for failover. You can select multiple VM replicas at this step. In the right-hand panel, you can select a recovery point. The latest recovery point is used by default. Click Next to continue.
Select the action options. Tick the Power off source VMs checkbox if necessary and click Save.
You can use actions more than once when creating your site recovery workflows. Thus, you can add another failover action in this Site Recovery job in order to perform failover of another VM (or set of VMs) after those defined in the first failover action. Click Failover VMware VMs once again.
Select the VM replica(s) for failover as you did with the first failover action. Click Next.
Similarly to the first action, select your failover options and click Next.
This Site Recovery job now includes two actions. Click Next to proceed.
2. Networks. Tick the Enable network mapping checkbox if you have different VM networks at the production site and disaster recovery (DR) site. Click Next to proceed. Consult our recent instructional blog post for more information on configuring Network Mapping and Re-IP.
3. Re-IP. Tick the Enable Re-IP checkbox if different addresses are used for IP networks at your production and DR sites. Click Next.
4. Test Schedule. Configure the scheduling options if you want to run periodic Site Recovery job tests automatically. Click Next.
5. Options. Set the job options. Enter a name for your new job (Site recovery job-Failover, in this example). Define the recovery time objective for testing purposes. Click Finish to finalize configuration of the Site Recovery job.
Now you can use this Site Recovery job if disaster strikes and perform failover to VM replicas.
Re-Protecting the Environment
When your VMs are failed over and the workloads are migrated to the DR site, you should protect the VMs running at the DR site. This is because if the VM replica that is running after failover fails, then you would have no ability to quickly restore that data and those workloads. NAKIVO Backup & Replication’s Site Recovery functionality allows you to re-protect your virtual environment immediately after disaster recovery.
In order to re-protect the VMs running at the DR site, you should first replicate these VMs to another safe place. That way, if the VM running at the DR site fails, you can fail over to your new VM replica quickly. The Site Recovery functionality allows you to add a Run job action in your Site Recovery job workflow, which you can use to add an existing replication job. Thus, you can set up your Site Recovery job so that as soon as VM failover is finished, replication of the VMs that are running after failover is performed automatically, ensuring the appropriate protection level.
Here is a walkthrough example of how to re-protect your VMs with a Site Recovery job this way.
Creating a Replication Job
On the home page of the web interface of NAKIVO Backup & Replication, click Create > VMware vSphere replication job.
1. VMs. Select the VM replicas that are used as the failover targets at the disaster recovery site by checking the boxes next to their names. In the current example, these two VM replicas were used to handle the workload upon failover in the Site Recovery job outlined above. Click Next.
2. Destination. Select the destination container (host or cluster) on which the VM should be run and a datastore within which to place the VM files. For the purposes of this example, the ESXi host 10.10.10.51 and datastore1 (which is attached to this ESXi host) are used. Click Next.
3. Networks. Tick the Enable network mapping checkbox if you have different VM networks at the source site (the DR site where the failed-over VMs are running) and the new target site. Click Next.
4. Re-IP. Tick the Enable Re-IP checkbox if the addresses used for networks differ between the source site (your DR site) and the new target location. Click Next.
5. Schedule. Configure scheduling options if you want to run the replication job periodically. Click Next.
6. Retention. Define the retention settings. Click Next.
7. Options. Configure replication job options, including inputting a name for the job. In this example, the replication job has been named VMware replication job Re-protection. Click Finish to finalize the creation of your replication job.
Editing a Site Recovery Job
Now that the new replication job is created, you can add a Run job action to your Site Recovery job. This way, you can automatically replicate the VMs running at the DR site. Since the original production VMs are now offline, your replicas at the DR site are now your only functional copies, so this is important for robust data protection.
On the home page of NAKIVO Backup & Replication’s web interface, right-click the name of the Site Recovery job you recently created. Click Edit in the context menu.
You can see the two failover actions added to the Site Recovery job earlier in this walkthrough. Find and click Run jobs from the action list located in the left panel of Site Recovery Actions screen.
Select the appropriate replication job from the job list (the one you just created). Select action options as usual and click Save.
Add a Wait action between the failover action and the replication job. This gives the VM replica some time to start up and load the operating system (you cannot replicate a powered-off VM). From the Actions list in the left panel, click Wait.
Select a time to wait – 5 minutes should suffice for these purposes. Select the action options and click Save.
When you add the action, it is appended to the end of the action list. Click Move up and move the Wait action from the fourth position to the third position – it needs to occur before replication.
Now the actions are arranged in the appropriate order.
Finally, the Site Recovery job is ready to be used for performing VM failover and automatic re-protection of the VM replicas used for failover. Right-click the name of your Site Recovery job on the home page and click Run job in the context menu.
Being an important part of site recovery, VM failover to replica is the process of switching from a failed source VM to a VM replica, which is an exact copy of the source VM at the appropriate time point. The advanced Site Recovery functionality released with NAKIVO Backup & Replication 8.0 includes failover actions. You can use this flexible feature to create custom-tailored Site Recovery jobs with different combinations of actions for protecting your production environment. You can also configure the same jobs for automatic re-protection of your disaster recovery environment after failure of the VMs at the production site. Read the next blog post in this series to learn more about failback or download the Free Trial to try the solution in your own environment.