December 13, 2018
Automatic Failover and VM Disaster Recovery Automation
Modern organizations operate in a state of constant competition. Thus, it is crucial to ensure business continuity and high availability of your organization’s services or goods to achieve dominance in the market. To protect your business from unpredicted events, such as system failure or service outages, an effective disaster recovery (DR) plan should be designed and constantly improved.
Traditional DR plans comprise a long list of steps and procedures to follow during a DR event. Implementation of a traditional DR plan can be a resource-intensive and time-consuming task that imposes a heavy burden on your staff. Moreover, manual execution of a DR plan entails the possibility of human error, which can disrupt the entire DR process. Smart businesses should consider Automatic Failover and VM Disaster Recovery Automation. This speeds up the recovery process, making 24/7 availability possible and minimizing (or eliminating altogether) the risk of human error.
About Automatic Failover
The way businesses handle the effects of a disaster primarily depends on how well they are prepared for it and how quickly any negative impact can be mitigated. Currently, a reliable data protection solution that can fully automate and optimize the failover process is an indispensable component of a robust virtual environment. To understand how automatic failover can protect your business, let’s first determine what this concept means and how it works.
What Is Automatic Failover?
Automatic failover is the process of automatically moving an application or a virtual machine (VM) to a DR site in case of system failure at the production site to achieve minimal downtime and zero data loss. In the modern business world, where the organizations are expected to operate on an “always-on” basis, automatic failover is a necessity. It is no longer practical to employ traditional DR strategies, as they are resource-intensive, time-consuming, and error-prone. On the other hand, automatic failover makes the recovery process more manageable; mission-critical data can now be recovered in a matter of minutes. This method of VM disaster recovery provides maximum efficiency with minimal input on your part.
Before adopting automatic failover for the protection of your virtual environment, consider the following:
Conduct a risk assessment (RA) and a business impact analysis (BIA) to determine which of your applications and VMs are most critical. This way, you can determine the recovery order of the system components on the basis of their priority levels and define the scope of your DR process.
RPO and RTO
These metrics are defined for applications and VMs separately depending on their priority levels. A Recovery Time Objective (RTO) is the period of time that is considered tolerable for restoring business operations after a disaster. A Recovery Point Objective (RPO) is a measure of how much data you can afford to lose; it is essentially the oldest acceptable point in time to which your VMs can be reverted in case of a disaster. When testing disaster recovery failover jobs, you can determine how realistic your DR objectives are and whether they can be met with your current systems.
Hardware and software
Execution of automatic failover requires a significant amount of RAM and CPU resources. Moreover, failover is impossible without a DR site equipped with the necessary hardware and software. The secondary site should be a remote data storage location where mission-critical data, including VM replicas, is stored. Building an operational DR site allows you to transfer the workloads of your production center to the secondary location in case of a disaster, resuming operations in a few clicks.
To ensure successful failover, regular testing of disaster recovery failover jobs should be performed. Testing allows you to identify issues and errors in the failover operation, which can then be eliminated. You can also determine potential risks and act to mitigate them in advance. This way, the failover operation can be fully reviewed and optimized to ensure a smooth, failure-resistant DR process. Create a custom schedule for testing each job to check the state of your virtual environment periodically and ensure its recoverability.
How Automatic Failover Works
Automatic failover cannot be performed without a VM replica at the DR site. A VM replica is a point-in-time copy of the original (source) VM. A VM replica is created with a replication job, then transferred to the DR site and stored there (powered off) for future use. If a disaster impacts (or threatens) the production center and renders the source VM unreachable, business operations can be easily resumed at the DR site with the help of automatic failover. Thus, it is essential to regularly update the systems at both the primary and the DR site, ensure reliable network connectivity for data synchronization, and install a data protection solution that supports automatic failover.
Use Cases for Automatic Failover
Automatic failover can be used to protect your virtual environment in a number of ways.
- Disaster recovery. Automatic failover can transfer the production workloads of a source VM to the DR site and restore the virtual environment in a few clicks. Businesses with a large number of operational VMs should consider this option, as it would otherwise be challenging to rapidly restore multiple VMs at once. Automatic failover ensures minimum data loss and reduced recovery time, even in the complex virtual infrastructure.
- Planned migration. For any number of reasons, your organization may decide to move their production workloads from an old site to a new one. In this case, the process of data migration can be optimized with automatic failover. The source VMs should be powered off to ensure that they do not interfere with the VM replicas.
- Test migration. This feature allows you to verify that the data migration goes smoothly in case of a DR event. You can use test migration to identify any possible issues and risks before they become real problems, then address them in the next iteration of your DR plan.
Benefits of Automated Disaster Recovery
Automation of VM disaster recovery is a valuable asset in the modern business world that brings a number of benefits.
Previously, DR solutions could not support certain types of hardware, or would not work for mixed environments with physical and virtual servers. However, modern DR solutions are compatible with most hardware and software resources. It is now possible to easily synchronize data between a production site and a DR site, fail over and fail back between different platforms, and transfer workloads based on physical-to-virtual, virtual-to-virtual, or virtual-to-physical systems.
As mentioned above, testing is extremely important. However, testing the DR components in isolation often proves to be ineffective. If you do not test the DR process in full, you cannot know for sure how it will function in the actual working environment. Thus, consider installing a DR solution that provides full testing of your virtual environment. This way you can fully check the state of the system, find any gaps in your DR strategy, and revise the DR process accordingly. Optimization is essential for effective DR automation.
Recovery of several systems at once
Modern organizations tend to have large business environments containing multiple applications and VMs. In these environments, recovering one system at a time could disrupt the whole DR process, with most critical data getting lost or corrupted during a disaster. Automated DR solutions let you rapidly restore several applications and services at once, which significantly reduces system downtime.
Improved RTO and RPO
DR automation enables you to optimize your RTOs and RPOs. These metrics are generally defined by the priorities of your organization. To meet a short RPO, replication jobs should be run frequently, creating many recovery points and thus ensuring that near-real-time replicas are available for recovery. The resulting VM replicas remain at the DR site and can be easily powered on in case of a disaster, which significantly shortens the RTO. Automated DR solutions let you test your DR processes against your objectives, including RTO values, to determine whether they can be achieved. If any issues were identified, the DR strategy can be updated accordingly so as to avoid such problems in future.
Reduced network and storage costs
Installing automated DR solutions can significantly reduce network and storage costs. Feature sets that include data deduplication and thin provisioning help decrease data storage footprints and bandwidth requirements. This way, you can optimize your DR strategy to achieve the best possible results (zero downtime, shorter RTOs, high availability, and lower costs).
With DR automation, you can launch a DR process in a few clicks. After that, the recovery process follows an automated algorithm, comprising your pre-defined (and pre-tested) sequence of actions. The ultimate goal of the automated DR workflow is to resume regular operations and return your virtual environment to its original, functional state. Thus, businesses can operate on an “always-on” basis with minimal manual effort required.
Disaster Recovery Automation in NAKIVO Backup & Replication
NAKIVO Backup & Replication is an efficient solution aimed at protecting data in virtual environments and near-instantly resuming business operations in case of a disaster. NAKIVO Backup & Replication offers a set of exclusive features, including Automated VM Failover, that allow you to configure a recovery job in a few clicks and fully automate the disaster recovery process.
During a disaster scenario, Automated VM Failover makes the recovery process faster and more manageable. Before running a disaster recovery failover job, you must create a replica of production VM at your DR site.
After the initial copy, only new changes are added to the target VM (the replica), using incremental replication technology. This way, the VM replica remains similar to the source VM and can be instantly powered on to assume the workload if required.
When disaster strikes, the failover job is run and the workloads of the primary site are transferred to the DR site. With NAKIVO Backup & Replication, you can configure network mapping and re-IP rules once and apply them to multiple VMs. Thus, you don’t need to manually enter the network settings for each VM. Source VM virtual networks are automatically mapped to the appropriate target virtual networks, and source VM IP addresses are automatically mapped to the right target IP addresses during the DR process.
Automated failover as a part of Site Recovery
With NAKIVO Backup & Replication, you can create site recovery jobs, which represent custom tailored workflows for automation and orchestration of the DR process. A single site recovery job can include various actions and conditions (start/stop VMs, failover/failback, run/stop/enable/disable jobs, etc.) to accommodate the DR objectives of your organization. It is worth noting that a complex SR job should include Automated Failover, which allows you to automate your DR activities by configuring them beforehand.
Now, let’s walk through how to create a VM failover job in NAKIVO Backup & Replication. As already mentioned above, the first requirement for successful failover is to have previously created VM replicas.
Open your browser and access the NAKIVO Backup & Replication web interface. On the main page, click the Recover option and select VM failover to replica from the drop-down menu.
As you can see, the VMware failover job wizard includes four steps: Source, Networks, Re-IP, and Options.
1. Source. Here, you choose the protected VM and the recovery point that you want to use for failover. The left-hand box of the Source section contains the list of available VM replication jobs from which you can choose. In the right-hand panel, you see the list of relevant VM replicas, which can be sorted according to processing priority by dragging them up or down. Click Next to proceed further.
2. Networks. This section enables network mapping, which ensures that VMs are connected to the right network upon failover. After checking the box, you have two options: Create new mapping or Add existing mapping.
If you choose Create new mapping, a pop-up menu is launched in which you can configure the source VM virtual networks and match them with appropriate target virtual networks. Click Save to save your settings.
If you choose Add existing mapping, the Network Mappings dialog opens, where you can choose an already-created network mapping (e.g. one you set up for another job).
You can add multiple Network Mapping rules to one failover job.
3. Re-IP. Here, you can enable Re-IP by ticking the corresponding checkbox. Re-IP automatically assigns new IP addresses to your replicas at the DR location. Here, you have two options, similar to the previous step: Create new rule or Add existing rule
If you choose Create new rule, a pop-menu is launched, where you can configure the source settings (IP address and subnet mask) and target settings (IP address, subnet mask, default getaway, primary DNS server, secondary DNS server, and DNS suffix). Click Save to save your settings.
By clicking on Add existing rule, the Re-IP Rules dialog opens, where you can choose from the Re-IP rules that are already created.
You can add multiple Re-IP rules to one failover job.
NAKIVO Backup & Replication provides the ability to store OS login and passwords as well as Amazon EC2 instance private keys. For this purpose, click Select VMs and tick the checkbox corresponding to VM replica for which you want to configure Re-IP. Then click Manage Credentials.
Click Add Credentials.
Enter the username and password for the OS inside the VM, then click Save to proceed.
Finally, select the credentials that you have just created and click Next.
4. Options. In the final section, you can give the job a name and configure the failover job options. If you select Power off source VMs, the source VM that was used to create the replica can be powered off.
You can also set pre- and post-job actions, such as Send job run reports to, Run local pre-job script, or Run local post-job script.
Enabling the Send job run reports to option means that a report can be emailed to the designated recipients every time the failover job is completed.
To run a script before the failover job starts or once the failover job is complete, click Run local pre job script or Run local post job script. A pop-up menu with three parameters is launched.
Script path is where you input the local path to the script on the machine with the Director installed. This step works the same for both pre- and post-job scripts.
Job behavior defines what happens regarding the script completion:
- Wait for the script to finish: VM failover is not started until the script is completed (pre-job script).
The job is in the “running” state until the script is completed (post-job script).
- Do not wait for the script to finish: the script and VM failover start at the same time (pre-job script).
The job can be completed even if the script execution is still in progress (post-job script).
Error handling is where you specify the job behavior in the event of script failure:
- Continue the job on script failure: the job proceeds and attempts VM failover even if the script has failed (pre-job script).
Script failure does not influence the status of the job (post-job script).
- Fail the job on script failure: if the script fails, the job is abandoned, and VM failover is not performed (pre-job script).
If the script fails, the job status is set to “failed” in reports even if VM failover was successful (post-job script).
Finally, you can click Finish or Finish & Run to complete the failover job creation. Note that the Finish & Run option allows you to select VMs for which you run the job.
Other DR automation features
NAKIVO Backup & Replication includes a set of other features that help you further automate the DR process:
- Command Line Interface
The command line interface (CLI) is a text-based interface from which you can manage software and operating systems. You can input single command lines and receive responses from within the CLI. With NAKIVO Backup & Replication, you can automate the process of triggering certain actions with the use of scripts. The CLI can be implemented both locally and offsite.
The CLI supports the following actions:
- View status of all jobs
- View status of a single job
- Start/stop a job
- Disable/enable a job
- Create a job report
- View status of all backup repositories
- View status of a single backup repository
- Update a repository
- Update all repositories
- Detach/attach a backup repository
- Create a support bundle
- View/replace the current license
- Disable/enable a tenant
- Get the current license information
- Pre- and Post-Job Scripts
With NAKIVO Backup & Replication, you can set up a specific script to run before a job is started or after the job is completed. Pre- and post-job scripts can be configured in the Job Wizard (these were discussed above in the failover job configuration walkthrough). As mentioned, you can configure a job to wait for the script to finish or to run in parallel. You can also determine whether the job should continue running or fail in the event of script failure.
- HTTP API for Automation
NAKIVO Backup & Replication features an HTTP API (application programming interface) that enables you to automate and orchestrate VM backup, replication, and recovery jobs. The API is a set of communication protocols and tools used for managing interconnections between software programs or system components. With NAKIVO Backup & Replication, you can use the HTTP API feature to automate data protection tasks, such as:
- Create/ edit/ delete a job
- Create/ edit/ delete a tenant
- Run VM backup/ VM replication
- Recover a VM
- Check health state
The API uses the HTTP protocol and JSON format to create requests and responses; you can trigger actions by sending a JSON with the corresponding parameters. This feature can be used to integrate NAKIVO Backup & Replication with other existing software you are using, or to create your own software based on NAKIVO Backup & Replication functionality.
The HTTP API for Automation is available only in Enterprise edition of NAKIVO Backup & Replication.
- Automated Verification
To check the state of a VM, the ideal solution would be to run a test recovery. However, full VM recovery can take time and resources, potentially putting excessive load on your systems. With NAKIVO Backup & Replication, VM backups and VM replicas are automatically and instantly verified. The product can send you an email report with a screenshot of each test-recovered VM, which serves as a proof of recoverability.
This blog post has described automatic failover and its role in the DR process. Disaster can affect businesses of any scale and undermine their success in the long run. To stay competitive in the modern business environment, a company is expected to provide uninterrupted service. To this end, automated DR solutions have been introduced, NAKIVO Backup & Replication being a prominent example.
NAKIVO Backup & Replication is a fast and reliable solution that protects virtual environments and ensures fast recovery of damaged VMs. With the product’s Automated VM Failover feature, you can transfer production site workloads to a DR site in just a few clicks. This reduces downtime and helps you avoid incurring losses. Failover in NAKIVO Backup & Replication is a fully automated process, incorporating network mapping and re-IP rules. These features automate VM network configuration during disaster recovery, when every second counts.
Download the Free Trial of NAKIVO Backup & Replication to see the benefits of Disaster Recovery Automation for yourself.