September 17, 2018
Site Recovery with NAKIVO Backup & Replication Part 4: Performing Site Recovery Testing
The previous blog post in our series on Site Recovery explained the creation of a logical site recovery workflow and provided a walkthrough of Site Recovery job configuration. Once you have built your site recovery plan and set up the corresponding Site Recovery jobs, don’t forget about testing them. Testing helps you ensure that you are ready for recovery when disaster strikes and that all selected components can be recovered successfully in an appropriate time frame. NAKIVO Backup & Replication provides a testing option for your Site Recovery jobs, i.e., you can run any Site Recovery job in test mode.
Why Do You Need Testing?
Site recovery testing is an important part of preparation for disaster recovery. Testing increases the probability of fast and successful recovery in the event of disaster. You need testing for the following reasons:
- To make sure that everything can be recovered successfully. Suppose you have developed a site recovery plan, then configured a Site Recovery job accordingly, but have not tested it. This could result in the following scenario: when disaster strikes and the time comes to run your site recovery job, the job fails and some of your virtual machines cannot be recovered. This would require you to spend much more time restoring functionality to your virtual infrastructure (e.g., you might have to restore from backups and manually implement changes that have been made). When you test your site recovery plan and discover some things going wrong, you can fix the issues before they cause serious problems in a real crisis scenario.
- To make sure that RTO values can be met. Your Site Recovery job might complete successfully, but in a time that exceeds your RTO (recovery time objective) value. This could have a negative impact on your business processes. Site recovery testing allows you to check whether or not your workloads can be recovered within the relevant RTOs. A site recovery test can be run manually on demand or automatically on a scheduled basis, which makes the process painless and saves your time.
The Differences Between Test Failover and Production Failover
The failover action is crucial for most site recovery workflows. The mechanism of executing a failover differs depending on whether the Site Recovery job is run in test or production mode. A breakdown of the steps for each mode is shown in the table shown below.
|Production (emergency) failover||Test failover|
|1||Disable replication from source VM to the replica|
|2||Roll the VM replica back to a certain recovery point (RP)
(optional; last RP is used by default)
|Run a single incremental replication from source VM to the replica|
|3||Connect the VM replica to a new network with Network Mapping (optional)||Connect the VM replica to an isolated network with Network Mapping (optional)|
|4||Modify static IP address of the replica with Re-IP (optional)|
|4A||Power off the source VM (optional)||---|
|5||Power on the replica|
|6||Switch the replica to "Failover" state|
As you can see, the second and third points differ between the production and test workflows. This is because you can run replication from a source VM in test mode while the source VM is running. In most cases, when disaster strikes, the source VM no longer works and thus replication cannot be performed. The networks for VM connection can be defined separately in the Network Mapping options for production mode and test mode when configuring a Site Recovery job (see the previous blog post).
Failover test cleanup is performed after execution of a Site Recovery job in test mode. The VM replica is powered off and reverted to its pre-failover state via snapshot (a snapshot of a VM replica is taken before performing a failover action). The replica is then switched from failover state to its normal state, and replication from the source object to the replica is re-enabled.
In order to perform effective site recovery testing, emulate different points of failure and test your site recovery plan regularly.
Emulating Different Points of Failure for Testing
Simulate situations where different components of your environment fail. You can emulate, for example, failure of a network, failure of different VMs, failure of entire ESXi hosts, failure of a vCenter server, or failure of one or more storage devices. Check whether your disaster recovery plan is workable for all the different situations that could reasonably arise. If not, then create another disaster recovery plan to suit the specific scenario that isn’t covered. This way, you can have disaster recovery plans custom tailored for certain situations.
Testing Your Site Recovery Plan Regularly
Infrastructure can change over time – certain VMs can be added, some roles can be migrated from one VM to another, and network configuration may be changed. You should test a site recovery plan regularly in order to verify if your SR plan works for your environment in its current state and meets your defined RTO values. If something goes wrong, then update your site recovery plan accordingly or create a new one.
How to Test a Site Recovery Job in NAKIVO Backup & Replication
Now that you are familiar with the theory behind site recovery testing, you are ready to test your Site Recovery job in NAKIVO Backup & Replication. Let’s briefly address the key points of the testing functionality built into the product.
Checking the Actions Included in Testing
Review the logics of your actions added to a Site Recovery job. Check whether the actions are arranged in the appropriate order and ensure that they cannot form an infinite loop. You can edit Site Recovery job options when the job is not running. Change the order of actions, add actions, remove actions, or edit action options as necessary.
Check that your network works properly. A VPN connection can be used between a production site and disaster recovery (DR) site, but this connection cannot be periodically disconnected in normal state. The network at the DR site must also work without disruptions. Check the Network Mapping and Re-IP settings you have used to configure failover and failback. If a VM is configured for the incorrect network, a network connection may not be established. The same is true for IP settings.
Setting the Test Schedule
Site Recovery job testing can be scheduled in the Site Recovery job scheduling options. Open the web interface of your instance of NAKIVO Backup & Replication. In the left panel of the home page, right-click the name of your job and click Edit in the context menu. You can also rename, disable, delete, or run the job from this menu.
Click Test Schedule and define the scheduling settings. In the example used for the purposes of this walkthrough, the Site Recovery job test would be run every weekday at 2:00 AM.
Running a Test On Demand
If you are wondering about something, you don’t have to wait until the Site Recovery job test runs on schedule. You can run a Site Recovery job in test mode manually. Simply go to the product’s home page, select your site recovery job by name, click Run Job, and then click Test site recovery job.
Set your RTO and click Test.
The Site Recovery job test is now running. You can see the total progress bar and the progress bar for each running action. Wait for the test to complete.
Reviewing Test Results
When the test is finished, you can view the results. Click the name of the tested job for which you would like to check the results. In this case, our Site Recovery job has been completed successfully. You can see the details in the Events section.
From this screenshot, you can also see that another Site Recovery job test has been failed. The red exclamation point icon indicates test failure. You can refer to the Events section for details to identify the source of the failure. In this case, the red highlight shows us that the email could not be sent. Check the network settings, review the configuration of the Send email action, and check whether the address for the email group is valid. Fix the issues once you have identified them and then try to run the test again.
Site recovery plan testing is an important process that helps you ensure that your site recovery plan is workable. Test the Site Recovery jobs you create in NAKIVO Backup & Replication in order to check whether everything can be recovered successfully. Testing also lets you determine whether your VMs can be recovered quickly enough to meet your RTO values. Regular site recovery testing is recommended; you can use the product’s flexible scheduling options to automate running tests when it is convenient for you. Use the testing feature to ensure that there are no surprises when disaster strikes and that your virtual environment can be recovered as planned.
Read the other blog posts in this series about Site Recovery for more information on site recovery planning and Site Recovery job creation. The next blog posts in this series will explore failover and failback actions, which are used for migrating the workloads from a production site to a disaster recovery site.