July 23, 2018
VM Failover Guide
The availability of VMs is essential for providing business continuity. When the services running on business-critical VMs become unavailable, companies can lose funds. To restore VM availability immediately after a failure, appropriate failover techniques must be used. Your VM failover process should be described in the company’s disaster recovery plan as well as the business continuity plan, and will be dictated in part by the RPO/RTO values set for the VMs. This blog post discusses how VM failover works, the different types of VM failover, and which methods are preferable in different use cases.
There are three types of failover: planned, test, and regular.
Planned failover is used for migrating workloads from one site to another, including for cases when a possible disaster scenario is predicted. If you are forewarned with a weather alert about a tornado risk or notified about planned maintenance or electrical works at your primary site, you might perform a planned failover.
Test failover is used for testing purposes – for example, to check if your designated RTO and RPO values can be achieved when failover is executed. You can simulate the recovery procedure in your test environment before an actual disaster occurs to be sure everything functions well and can run smoothly when needed.
Regular failover is unplanned failover performed when a disaster occurs unexpectedly and a critical VM (or the whole primary site) goes offline. This could be caused by a natural disaster, a power outage, a virus attack, or any other incident. Hosts and replicas should be prepared for unplanned failover.
When you perform a VM failover, the failover sequence matters. The VM start order must be defined at the stage of disaster recovery plan development. There are dependencies between different services that run on different VMs. For example, some services and applications that run on VMs might use Active Directory, which is running on another VM, for authentication. A database server might be running on the first VM, an application server on the second, and the web server on the third VM.
The VM with Active Directory Server must be started first. Then the VMs with services that use Active Directory for authentication can be started. The VM with the database server must be started before the VM with the application server, because the application server connects to the database. Once the VMs with the database server and the application server have been started, the VM with the web server can be started.
Main Failover Solutions
The main solutions that are used in virtual environments are failover using VM replicas and failover clustering. Let’s consider each solution.
Clustering as a failover solution
Clustering is a high-grade automated solution that can be used for the most important, business-critical VMs. The Hyper-V virtual environment offers a Failover Cluster made up of several Hyper-V hosts. The VMware equivalent is a High Availability cluster (made up of ESXi hosts).
In the first diagram below, you can see a cluster wherein both hosts (also called nodes) are functioning properly. The VMs are running on hosts, and the VM files are located on a shared storage that is accessible by both hosts.
When one of the hosts goes down, the VMs are restarted on another healthy host. This is the failover process.
The following requirements must be met to build a failover cluster:
- A shared storage connected to the hosts with a dedicated high-speed network with low latency. A clustered file system must be used to ensure that multiple hosts can access the data located on the storage concurrently.
- The hosts on which the VMs are running must have the same hardware, or at least hardware of the same family. The processors must support the same instruction sets to ensure compatibility; VMs must run properly after migration from one host to another during failover.
- A high-speed redundant network with low latency. There should be multiple, separate cluster networks, i.e., a cluster must have different networks for storage, management, VM migration, connection of hosts amongst each other, etc.
Failover clusters are used to protect VMs in case of hardware failure, providing high availability for critical VMs. If one of the hosts within cluster fails, then the VMs that were running on the failed host migrate to other, healthy hosts in the failover process. Depending on your settings, the VMs that were failed over can be migrated back to the host on which they were running before the incident once the situation is resolved.
The advantages of using a failover cluster
- A failover cluster provides automatic VM failover. You don’t need to start the failed VMs manually on other hosts.
- Upon failover, you experience near zero data loss. The downtime is usually limited to the time it takes to load the VM, the operating system (OS), and the software running on the VM.
- The Fault Tolerance feature that is included in the VMware High Availability cluster ensures a VM failover with no downtime and no data loss.
A failover cluster does not protect against:
- Software failure of VMs. Software bugs or viruses can cause a system crash in a VM.
- Accidental deletion of files inside the VM.
- Shared storage failure. The cluster fails if shared storage fails. The shared storage is a crucial component of the cluster; the virtual disks that belong to the VMs within a cluster are stored on the shared storage.
Failover using VM replicas
This method of VM failover can be executed by an appropriate application-level solution, which can replicate the VMs and run the replicas when prompted by the administrator. The only things you need besides the appropriate software are ESXi or Hyper-V hosts (depending on your environment) to run the VM replicas when the source VMs fail.
In the diagram below, you can see two hosts connected with one another via the network. The VMs are using the disks of the hosts. The source VMs are running on the first host, and the VM replicas, which are exact copies of the source VMs at the appropriate point of time, are located on the second host in a powered-off state.
When one host goes down, the VMs that were running on that host also become unreachable. The VM replicas that are located on another host are then powered on by the administrator.
Two or more hosts and a VM replica. A source VM running on the first host is replicated to a second host. Thus, the VM replica is located on the second host.
Failover using VM replicas can be used when hardware or software failure occurs. ESXi or Hyper-V host failure would be an example of hardware failure. Examples of software failure could be unsuccessful updates, software bugs, virus attacks, or accidental deletion of files by user.
The main advantage is the possibility of failover to a remote site. When a VM replica is being created, the data copied from a source VM can be transmitted via network connection (with limited bandwidth) to a remote site. The remote site could be located in a nearby office or on the other side of the world. The VM replica could also be located at the same site.
The list of disadvantages for a failover using VM replicas:
- There is a period of downtime.
- Failover must be initiated manually.
- The data written since the last replication can be lost. VM replication is not a real-time (synchronous) process. Replication is performed at the appropriate time intervals.
- The network settings of the VMs must (often) be changed upon failover to another site. The VM networks of the remote site may differ from the networks of the primary site. Hence, the IP addresses might also be different, and must be checked and changed along with the other network settings during failover.
Differences Between VM Failover with Clustering and Failover Using a Replica
Essentially, the main differences are as follows:
- A failover cluster is used for high availability, while a replica failover is used for disaster recovery.
- A failover cluster protects VMs against hardware failure only, while a replica failover offers protection from both hardware and software failure.
- Cluster failover is performed automatically, while replica failover is performed manually.
- Cluster failover takes less time than failover to replica. Thus, there is less VM downtime for a clustering solution.
- The list of requirements is longer for clusters; clustering solutions are usually more expensive.
Combined Use of Clusters and Replicas for Failover
Although sometimes discussed as alternatives, cluster and replica failover solutions can complement each other. You can replicate the VMs running within a cluster to the host at a remote site. Moreover, you can replicate the VMs running within one cluster to another cluster. Thus, in a case of a host’s failure, the failover cluster keeps those VMs online. If the entire site experiences a disruption, then you can fail over to the VM replicas stored at a remote site.
As another example, let’s consider a situation where a virus damages the files inside some VMs. The failover cluster could not protect against such failure, but if you have VM replicas with multiple recovery points, you can restore each VM to a point of time before their files were damaged or deleted. Using both failover solutions can help protect your VMs against server-level failures and site-level failures.
Choosing the Right Solution for VM Replica Failover
VMs can migrate from one host to another within a cluster after failover events or load balancing events. A failover cluster is usually configured in a conjunction with load balancing. That’s why the software that you use for replicating the VMs from a cluster must be able to track the host on which a VM is residing.
NAKIVO Backup & Replication is a solution than can protect VMs running within a cluster, replicate VMs, and fail over to replicas. The product automatically tracks the host on which a VM is residing so it can replicate that VM. Clusters as well as standalone ESXi or Hyper-V hosts are supported as source and destination points for replication. NAKIVO Backup & Replication can change the VM network settings automatically upon failover; just use the Network Mapping and Re-IP features when configuring a replication or failover job.
Let’s consider an example of Automated VM Failover (with Network Mapping and Re-IP) with NAKIVO Backup & Replication. First, you need to create a VM replica. On the home page click Create > VMware vSphere replication job if you use the VMware virtual environment. (You can create a replication job for a Hyper-V VM or an Amazon EC2 instance in much the same way.)
The replication job wizard is launched.
1. Select the virtual machines that you want to replicate. In this example, the “Win7” VM that is running within the cluster called “cluster” will be replicated. Click Next.
2. At the second step, select a destination host for the VM replica to run on. Select the datastore mounted to the selected host for placement of the VM files. Click Next.
3. The next two steps are new, and were not available in versions of NAKIVO Backup & Replication older than v7.4. You can now set Network Mapping and Re-IP options when configuring a replication job or a failover job. In this walkthrough, Network Mapping and Re-IP will be configured later, when the failover job is configured. Thus, you can skip this step for the moment; just click Next.
4. Re-IP configuration will be explained during configuration of the VM failover job in this walkthrough. Click Next.
5. Select your scheduling settings. Click Next when you are finished.
6. Set the retention settings. Remember that you can set up a Grandfather-Father-Son retention policy at this step. Click Next.
7. Select the replication job options and click Finish or the Finish & Run button. Wait while the replica is created.
Now that you have a VM replica created, you can perform VM failover to replica. On the home page click Recover > VM failover to replica.
The new failover job wizard is launched.
1. In the left pane, select a VM replica that should be used for failover. In this walkthrough, the “Win7-replica” VMware VM, which was just created, is selected. In the right pane, select a recovery point. The latest recovery point is selected by default. Click Next.
2. Network Mapping helps you change the network the VM is connected to. The source and destination ESXi hosts likely have different virtual switch settings. Since a VM replica is an exact copy of the source VM, the virtual networks to which the source VM was connected are preserved in the VM replica. Generally, you should check the network settings of a VM replica and manually change the network. NAKIVO Backup & Replication can map the source network to a destination network automatically; you just need to set up Network Mapping when configuring the replication or failover job.
In order to enable Network Mapping, tick the checkbox. If you have created a network mapping rule before, you can click Add existing mapping. If there are no network mapping rules, click Create new mapping.
To create a new network mapping rule, select the source network and destination network. The source network is the network to which the source VM was connected. The destination (or “target”) network is the network to which the VM replica should be connected. Note that the VM network name is not the same as the IP address or network address.
Click Save to save the network mapping rule, then click Next to proceed in configuration.
3. The Re-IP feature allows you to change the IP settings of the VM replica. It can be used for static IP addresses. Tick the Enable Re-IP checkbox if you want to enable this option, then create a Re-IP rule or add an existing rule. Click Create new rule if there are no rules created before. A popup menu is launched.
The source VM settings are the IP address and network mask that need to be changed.
The target settings are the settings to be applied for the VM replica when failover occurs.
In this example, the “*” character covers the last octet. The “*” signifies any number from 1 to 254. If the source IP addresses are, for example, 10.10.10.1, 10.10.10.96, and 10.10.10.222, the destination addresses would be 192.168.10.1, 192.168.10.96, and 192.168.10.222 respectively. The last octet of the IP address is preserved. Read the blog post on Automatic VM failover with Re-IP to learn more about configuring network mapping and Re-IP.
Click Save to save your Re-IP rule and proceed.
After adding the Re-IP rule, your screen should look like this:
You must now select the VMs for which the Re-IP rules should be applied. The failover job that is considered in this example contains only one VM replica, so that one checkbox must be filled.
Then select the credentials for each VM.
Click Manage credentials > Add credentials to add new credentials. The added credentials can be selected from dropdown list. The credentials are needed for NAKIVO Backup & Replication to access the network settings of the operating system inside the VM and apply the script that changes those settings. VMware Tools must be installed on VMware VMs and Hyper-V Integration Services must be installed on Hyper-V VMs. When you have configured all these settings, click Next.
4. Now, configure the failover job options. You can tick the Power off source VMs checkbox; it may be useful to prevent a conflict of IP addresses if the both source and replica VMs use the same network or have the same IP addresses.
After configuring all the options, click Finish and Run.
Wait until the failover job is complete.
Now you can ensure that the VM replica is running. Go to Configuration > Inventory and click the Refresh All button. After refreshing, you can see that “Win7-replica” VM is already running on the target ESXi host. You can also manage the credentials, network mapping rules, and Re-IP rules from this page.
Today’s blog post has covered both primary methods of VM failover – failover clustering and failover using a VM replica. Failover clustering provides high availability and can protect VMs against hardware failure, while failover using a VM replica can additionally protect against software failure and site-wide failure. Each method has advantages and disadvantages. These methods complement each other and can be used together successfully. Using VM replicas for a failover can be useful if the entire cluster fails or the VM failure is due to software issues.
NAKIVO Backup & Replication is a fast, reliable, and affordable VM protection solution that can help you protect your VMs with the failover to replica method. Replication and failover of VMs running within a cluster are supported for VMware as well as Hyper-V virtual environments. Network Mapping and Re-IP are new features in v7.4 that allow you to automatically change the VM network settings during the failover process. The flexible retention policy helps you configure the replication settings according to your demands, providing the option to recover your VM as it was at an appropriate point in time.
Automated VM Failover with Network Mapping and Re-IP, together with other new features like Flash VM Boot for Hyper-V, Bandwidth Throttling, Global Search, Instant VM Recovery, and Instant File Recovery to Source, make NAKIVO Backup & Replication more universal as well as extending the field of application of the product.
Download the latest version of NAKIVO Backup & Replication and test the new product features in your environment.