January 15, 2019
VM Disaster Recovery Advice for Failover and Failback
In the modern world, any business could suffer from corruption of data and disruption of mission-critical operations from time to time. However, even the brief interruption of services can undermine customers' trust and eventually lead to significant losses. Businesses, especially those that run their services on VMs, must create a VM disaster recovery (DR) plan to ensure high availability and business continuity. This blog post describes the role of failover and failback in the DR process and discusses how you can use these strategies to protect your business.
What Is VM Disaster Recovery?
VM disaster recovery is the process of restoring your business infrastructure to a normal state after a disaster. A disaster could mean any event that puts an organization's operations at risk, encompassing natural and man-made hazards alike. Essentially, VM disaster recovery is aimed at restoration of the virtualized environment of an organization. The ultimate goal of any DR process is to near-instantly resume business operations and secure the most critical data for ensured business continuity.
DR measures are divided into three types. Preventive measures are intended to prevent an event from occurring. Corrective measures aim to fix a system in case of disaster. Detective measures are used to identify possible risks and mitigate them.
Failover and Failback in VM Disaster Recovery
Disaster scenarios almost always strike unexpectedly. In a DR event, it is critical to restore the virtualized infrastructure of your business as soon as possible, before any significant damage is done. Failover and failback can help ensure that your business continues to function properly, even if the production site is affected by a disaster.
What is failover?
When you experience software or hardware failure, you can quickly recover an affected VM by failing over to its replica. Failover is the process of transferring mission-critical workloads from the primary production center and recovering the system at an off-site location. The main goal of failover is to mitigate the negative impact of a disaster or service disruption on business services and customers.
Failover using VM replicas
During failover, a VM replica at a remote site is powered on to replace the original VM at the production site. You can fail over to the latest recovery point, which essentially represents a VM at a particular point in time. Running replication jobs as frequently as possible allows you to create multiple recovery points, which ensures a minimum loss of data in case of a disaster. Failover to replica is a cost-effective solution suitable for disaster recovery in the event of hardware or software failure.
A failover cluster represents a group of independent computers that work together to ensure high availability of applications and services. A failover cluster consists of two or more interconnected servers (or nodes), on which VMs are running, and a shared storage, where VM files are kept. If one of the servers fails, those VMs are restored on another server. A failover cluster protects VMs only from hardware failure. Failover clustering is more costly than failover to replica. However, it provides almost zero downtime, as the VMs are automatically powered on at the secondary location when disaster strikes.
What is failback?
Once you have recovered your primary site after a disaster and resolved any issues associated, you can transfer business operations back to the source VM. Failback helps recover the original VM on the source host (or at a new location of your choice) and return workloads from the VM replica to the original VM. However, some changes might have occurred in the VM replica since failover. Thus, the original VM and the VM replica must be synchronized before performing failback so no critical information is lost. In failback, only the changed data is sent back to the original system.
The failover and failback process
During a DR event, failover and failback operations are initiated. The process is performed as follows:
1. The source VM at the production site is replicated to the DR site. The data on the virtual disks of the VM replica is identical to the data on the virtual disk on the source VM at the moment of replication. If disaster strikes (or if a disaster is anticipated), failover to the VM replica is initiated.
2. During failover, the system workloads are transferred to the DR site. However, some changes might occur in the replica VM as operations continue. It is important to save such data because the original system is offline, not registering any of the changes made. Thus, all changes are written only to the virtual disk of the VM replica.
3. Once the negative consequences of a disaster have been rectified (or the possible threat has passed), the primary site can function as usual. Thus, the failback operation is executed; all the workloads are sent back from the DR location to the production site and the updated data is received by the source VM. The original VM and the VM replica become synchronized.
Best Practices for Failover and Failback in VM Disaster Recovery
- Ensure compliance with regulations. Some organizations operate with very sensitive and confidential data and are therefore required to comply with regulations such as HIPAA or PCI DSS. If this is applicable to you, then you must check whether your DR strategies for failover and failback meet the applicable security standards.
- Check licensing. Review your software documentation and determine whether there are any licensing limitations in your application stacks. If so, you must address any issues beforehand and ensure that all requirements are met.
- Define the scope of your DR plan. The scope of a VM DR plan determines which systems should be protected and identifies the expected results as well as any possible limitations. Ensure that your virtual environment has adequate technical capacity to cover all aspects of your plan.
- Choose a reliable data protection solution. Installing a properly licensed data protection solution in your virtual environment is crucial for efficient performance and seamless integration. For DR planning purposes, you must establish how long the product takes to recover your virtual infrastructure and restore all operations back to the production site.
- Decide who is responsible for failover and failback. Management should designate members of a recovery team and assign specific responsibilities to each team member. Determine who is responsible for monitoring failover and failback operations so as to avoid confusion in an actual recovery scenario when it matters.
- Train IT staff in failover and failback operations. Following along from the previous point, make sure that your IT staff have the necessary knowledge and qualifications to conduct failover and failback operations. The employees responsible should be fully prepared in case anything does not go as planned; they must have a solid understanding of the operations to be able to adapt accordingly and deal with any issues that arise.
- Review Service Level Agreements (SLAs). A service level agreement is a contract between a service provider and its customers that determines the requirements and service standards the provider is expected to meet. Thus, ensure that your SLAs are up to date and that their applicability extends to the DR environment.
- Define RTOs and RPOs. A recovery time objective (RTO) is the period of time during which business operations must be recovered after a disaster so as to prevent significant damage and critical losses. The recovery point objective (RPO) signifies the amount of data (measured in time) that can be lost without causing unacceptable levels of harm to your business. An RPO is essentially the furthest-back point in time that your VMs could be reverted to in case of a disaster. Your RTOs and RPOs should be established primarily based on the priorities of your organization during a disaster scenario. Though increasing the frequency of backup and replication jobs can be a time-consuming and resource-intensive task, it considerably improves your RPOs. Shorter RTOs should be assigned to the components of the highest priority, which should be recovered first. Note that RTOs and RPOs should be established for applications and VMs separately.
- Consider the possibility of turning your DR site into a permanent site. Your business might be affected by a huge disaster that renders it impossible to restore your primary datacenter. Thus, consider the possibility of turning your DR site into a permanent site, so you can be prepared for an event of this scale in advance. Obviously, this is an expensive solution that consumes significant amounts of resources and entails major equipment, software, and facility costs. It can be beneficial to consider what would have to be done, even if you don’t go forward with the plan immediately.
- Test failover operations. By testing your failover procedure, you can check whether your virtual infrastructure can be properly recovered at your DR site and verify that your pre-installed applications can run successfully even when your production site is disabled.
- Test failback operations. This way, you can ensure that your company’s operations can be successfully restored from the DR site to the original site.
- Test your DR plan in full. Testing the entire DR plan is also worthwhile; it can help identify weaknesses in the plan by simulating a DR event. As a result, you can improve and adapt the DR strategies applied by your organization. A flawed and outdated DR plan can considerably disrupt your organization’s business continuity.
Failover and Failback in NAKIVO Backup & Replication
NAKIVO Backup & Replication offers an exclusive Site Recovery functionality, which enables you to create automated recovery workflows (or jobs) of any complexity. Site recovery (SR) workflows involve custom sequences of actions, such as failover, failback, start/stop VMs, run/stop jobs, attach/detach repositories, etc. These actions can be arranged in any order for total automation and orchestration of the DR process. Furthermore, you can easily modify, supplement, or test your SR jobs at any time without disrupting the production environment. Thus, even the most sophisticated DR plan can be built, tested, and then implemented smoothly with the use of SR workflows.
Failover in Site Recovery
Failover action is an integral part of most SR workflows. Site recovery involving failover can be executed only if you have previously created replicas of the source VMs you want to protect; these are used as the targets for failover when disaster strikes. The workload is transferred from the source VM at the affected production site to a VM replica at the DR site.
NAKIVO Backup & Replication has presented three types of failover:
- Planned failover is used for pre-emptive protection of your systems when there is a potential threat or if a disaster is expected. If you have been notified of weather hazards or if there is scheduled power outage in the area, you can initiate planned failover. In this case, the solution synchronizes data between the source VM and its replica before transferring the workload to the replica; thus, data loss is completely prevented.
- Test failover helps you determine whether your failover strategies are functional and whether they can be relied upon in case of a DR event. Test failover is performed similarly to planned failover, except that all changes made in test mode are immediately reverted so as to cause no disruption in the primary environment. Furthermore, you can test whether your workflow runs sufficiently quickly in a DR event. NAKIVO Backup & Replication allows you to set an RTO for your site recovery job. If the job takes longer than the set time to complete, the test is considered failed. A test/run report is sent via email, which you can examine to identify deficiencies in your DR plan and resolve them.
- Emergency failover is executed immediately after disaster strikes your production site and the source VM can’t be reached. With NAKIVO Backup & Replication, you can move the workload from the primary site to the DR site in just one click. Thus, the minimum downtime is guaranteed, though some data might be lost.
Re-protecting VMs at the DR site
Once failover has run, you should make sure that the VM replicas running at your DR site are protected. VM replicas can also get damaged, and if there were no other copies, it would be impossible to immediately recover them.
However, NAKIVO Backup & Replication ensures that your virtual infrastructure is re-protected after a DR event. Simply replicate the VMs running at your DR site to another location. Thus, you can easily fail over to your new VM replica if anything unexpected happens. You can configure you SR workflows to automatically initiate replication of the VMs running at the DR site as soon as failover is completed, thus guaranteeing high levels of protection.
Failback in Site Recovery
Failback can be performed only after failover has occurred in an SR workflow. After some time, when your primary site is back up and running, you can resume running operations on the original source VM. For this purpose, you can fail back to this VM from a VM replica which has replaced the original VM. If the VM workloads can’t be transferred back to the primary production site (e.g., because it cannot be restored), they can be transferred to any other new location of your choice for a longer-term solution than the DR site.
Failback can be run in production mode or in test mode.
- Failback in test mode is intended to determine whether the SR job can run successfully, with no issues arising during the actual failback process. In this case, incremental or full replication from the VM replica to the source VM is performed only once, which is enough for testing purposes. Ensure that the IP address and network settings are correct. The source VM and the VM replica are synchronized so as to avoid data loss, and the source VM is then powered on. Note that all changes made to your VMs during the failback process are discarded after the test is completed and your virtual environment is reverted back to its pre-failback state. In test mode, a site recovery job can be run either on demand or on schedule.
- Failback in production mode is performed when you want to recover your production environment after DR failover. In production mode, a site recovery job can be executed on demand only. Failback in production mode essentially follows the same steps as failback in test mode. However, replication from the VM replica to the source VM is performed twice so as to ensure zero data loss in the process. Once the replication operation is complete, the original source VM (at the production site) is powered on and the VM replica at the DR site is powered off. (Note that this last step – the DR VM replicas being powered off – occurs only in the production mode.)
Understanding the technology behind failover and failback and integrating it into your VM disaster recovery plan can protect your virtual environment from any unexpected event. Failover ensures that mission-critical data is secured and all workloads are quickly transferred to a DR site. Failback allows you to switch back from the DR site to your production site in a few clicks. Together, these operations help you ensure minimal data loss and reduce downtime.