January 3, 2019
An Overview of Disaster Recovery Testing Scenarios
Modern-day businesses are expected to operate 24/7. Even a minor delay in business operations and service provision can undermine the organization’s credibility and result in significant losses. There are multiple factors which can lead to business downtime, the primary one being a disaster which always strikes when you least expect it. Therefore, in order to stay competitive in the market and ensure business continuity, it is important for organizations to design an efficient disaster recovery (DR) plan and test it regularly. This blog post lists the factors worth considering before testing a DR plan and describes how going through DR testing scenarios can help you prepare for disaster recovery.
What Is a DR Plan?
Generally, a disaster is impossible to predict and it always comes unexpected. Therefore, an organization interested in high availability should design a DR plan. A DR plan is a documented set of tasks and procedures to be implemented when a disaster affects an organization’s IT infrastructure. Its main purpose is to minimize the negative impact of a DR event and prevent possible damages. A comprehensive DR plan dictates what actions to undertake before, during, and after a disaster.
Two types of disaster are differentiated: natural (tornadoes, hurricanes, floods, etc.) and man-made (server errors, failed updates, hacker attacks, etc.). Your DR plan should be created based on the risks and threats that your organization is most prone to. Moreover, the operations and applications which are the most critical for conducting your business should be identified and given the highest priority in the recovery order. By reviewing such factors beforehand, you ensure that your DR plan can address any issues that might arise during an actual DR event.
Factors to Consider Before Testing a DR Plan
After you have created a DR plan, you should be ready to test it. Even if you are sure that you have designed an efficient and complex DR plan, you should verify that everything works as planned, and identify any issues beforehand. However, before putting your DR plan to the test, there are several factors worth considering to ensure the success of the process, such as test assumptions, test scope, and test success criteria.
The initial step in preparing for testing is to define your test assumptions. Prior to DR testing, the recovery team should discuss which direction to take to achieve optimal results. Essentially, test assumptions provide the basis on which the process of DR testing would be built. Comprehensive test assumptions include the following:
- The risks and threats that your organization is most exposed to, and respective response mechanisms to test out
- DR testing scenarios to implement and the reasoning behind this choice
- Pre-test conditions and circumstances required for conducting DR testing
- Post-test conditions and circumstances which must be met at the end of the test
- The results expected to be achieved after the testing process
Another important factor to consider is test scope, which outlines the areas to cover during the testing process. The recovery team should clearly establish which system components and functions should be tested and then notify the staff of the systems which would be involved in DR testing. Also, the recovery team should define the limitations and exclusions of the testing process to know exactly what will and will not be tested and avoid any confusion beforehand.
Test success criteria
Test success criteria determine when the DR testing process can be considered successfully implemented. By reviewing test results you can define whether your expectations have been met and what areas require improvement. DR testing is generally considered successful if the DR plan proved its functionality and validity. However, if the weaknesses of the DR plan have been identified as a result of the DR testing process, this can also be considered a success. The recovery team is now able to upgrade the DR plan by developing countermeasures and fixing its flaws. Moreover, test success criteria allow the staff to evaluate their performance during DR testing and improve the organization’s disaster response mechanisms.
Therefore, it is important to document every step of the process and determine test assumptions, test scope, and test success criteria in advance to be prepared for any unexpected issues and act accordingly.
What Is a DR Testing Scenario?
It is not practical to test all components of your DR plan without prior preparation, as conducting DR testing can be a very daunting task. To ensure that your DR plan performs successfully during a DR event, you should check how your organization would respond to a specific emergency event. For this purpose, a DR testing scenario can be used. A disaster scenario can be created by the recovery team, which takes into account all aspects of your organization, or you can apply the ready-for-use templates of DR scenarios available online.
A typical DR testing scenario generally describes a DR event, its circumstances, and how it has affected the organization in question. By simulating a DR event, you can evaluate your organization’s preparedness for the DR process and identify better ways to respond and recover from an actual disaster (natural or man-made).
Types of DR Testing Scenarios
DR testing scenarios cover multiple emergency situations and disaster events, which can affect your organization’s performance in one way or another. Let’s take a closer look at what those DR testing scenarios represent.
Disruption of operations
Most organizations represent a complex system, the components of which are highly interdependent. Therefore, if one of those components fails, the whole system would be put at risk of disruption. DR testing scenarios covering a wide variety of operational issues should be designed. For this purpose, think of any critical operation/process and the DR event that might negatively affect or damage it. This type of DR scenarios generally includes any emergency that might disrupt performance of the organization’s operations. The examples of operation-related DR events are the following: fire or explosion in the production center, failure of the major assembly line due to malfunctioning software, or workflow interruptions due to human errors.
If most of your operations are running in the virtual server environment, simulating technology-related DR scenarios should be your main priority. In case of system failure, it can take some time before business operations are resumed. Therefore, it is essential to design a DR testing scenario reflecting the technological issues, which can significantly affect the performance of your organization. Such issues might include server failure, disrupted network connectivity, software glitches, data loss, or an inability to access the backups.
Loss of key staff
The staff is an essential part of any organization, as employees are the first to face and respond to an emergency. Management should form a recovery team responsible for conducting and monitoring the DR process from start to finish. However, some members of the recovery team – those who have critical knowledge of the DR procedures – might get sick or quit. Therefore, you should consider the possible repercussions of such a loss, and prepare a DR testing scenario that is ready to cover this issue. Possible DR scenario cases feature the following: the staff goes on strike, employee sabotage, flu epidemic, or hacking by a sacked and disgruntled employee.
Natural disasters, such as tornadoes, hurricanes, or earthquakes, can affect people and physical property, as well as an organization’s infrastructure. Natural disasters are generally unexpected and the damage they can cause is generally quite hard to predict. Therefore, consider the geographical position of your production center and identify the possible risks and threats that this area is most subjected to. Based on this, you can design the DR testing scenario which is most suitable for your organization. Examples of natural-disaster scenarios include the following: an ice storm damaging communications infrastructure, an earthquake destroying the production center, and floods causing transportation problems.
Business-related DR scenarios should be specifically designed for your organization, meaning that you primarily need to define how your business works and what critical components ensure its continuity. To identify which areas need a higher level of protection, perform a Business impact analysis (BIA), which evaluates the most critical business operations and the effect of their interruption. Based on this, management can identify most likely risks, and design a corresponding DR scenario. Such DR scenarios typically include: stock market crash, data leaks, loss of customers to competitors, or insolvency of key suppliers.
As discussed above, there are various DR events that can affect organizations from time to time. However, you should also be prepared for responding to off-scale events. The probability that such event will happen is extremely low, but the staff should still be aware of them and know how to react when the time comes. Thus, you should create a DR testing scenario which would include such emergency situations as: the plane crashing into the production center, volcano eruption, or civil strife.
The Importance of Testing a DR plan
Even the most thought-out DR plan can’t be proven valid until testing it. Testing a DR plan allows you to identify any flaws and inconsistencies in your DR strategy, thus ensuring that any possible damage is predicted and prevented before an actual disaster can occur. In this case, reviewing your DR plan in the context of DR testing scenarios is highly advisable.
The recovery team can simply go through all steps of the designed plan and discuss them in detail, which requires no expenses and is easy to conduct. However, this testing method provides only a basic view of how the DR process would go as no system components are actually tested. On the other hand, a full-scale simulation test can be executed, which is a more expensive and complex activity as it entails testing all components of the DR plan in the actual working environment. Even though it might disrupt the production process, this way of testing allows you to see the ability of your staff to respond to various kinds of DR scenarios and verify the validity of your DR plan. Thus, you can test your organization’s DR plan regularly by applying various DR scenarios in order to refine it and ensure that even an unexpected disaster won’t set you back.
Site Recovery Testing in NAKIVO Backup & Replication
To ensure that your system is properly protected and can be easily and promptly recovered, having a DR plan is not enough. The organization should have a powerful backup and replication software installed to ensure a seamless DR process. NAKIVO Backup & Replication is the solution to go with as it provides an exclusive feature of Site Recovery, allowing you to accommodate the DR needs of any business. Let’s see what the Site Recovery functionality represents.
How to test a site recovery job
With NAKIVO Backup & Replication, you can create a Site Recovery workflow (i.e., a SR job) which includes a number of actions or conditions, such as failover, failback, start/stop VMs, run/stop jobs, attach/detach repository, and others, arranged in the order of your choice. A SR job represents an automated algorithm which allows you to design a recovery process of any scale.
NAKIVO Backup & Replication provides the opportunity to easily modify, supplement, or test SR jobs at any time, without affecting the production environment. Minimal configuration is required on your part. After that, the process is completely automated and can run on schedule or on demand. It is also worth mentioning that a SR job can be performed in production mode and test mode. For the purpose of this blog post, let’s discuss how to test a SR job with NAKIVO Backup & Replication.
The procedure is easy to implement. However, to perform the SR job testing on demand, you first need to make sure that you already have a SR job, or else create one. After that, you can follow the steps below:
1. In the Jobs dashboard, select the SR job that you want to test and then click on the Run Job.
2. After that, the dialog should open, which provides two options: Test site recovery job or Run site recovery job. Click Test site recovery job.
3. Then, the new dialog opens where the recovery time objective (RTO) can be configured. RTO is the period of tolerable downtime during which your system is expected to be restored in order to prevent any huge losses. In this dialog, you can either disable or enable the Recovery time objective option. If enabled, be sure to set up the value of the recovery time objective, which defines the amount of time allowed for the SR job test to be completed.
4. Click Test to start the job.
Note that the SR job testing can also run on schedule. The Test Schedule option can be configured when creating a new SR job. Thus, you can set up a SR job to run a periodic testing based on the schedule that you choose.
Another way of setting up the test schedule is available with previously created SR jobs. In this case, you have to go to the left panel of the home page and then right-click the SR job for which you want to configure the test schedule. The pop-up menu appears which includes a variety of options for job management, such as Run Job, Rename, Edit, Delete, and Disable. Click Edit.
After that, click the Test Schedule section and insert the scheduling settings of your choice. The menu is identical to the one in the New Site Recovery Job Wizard.
Thus, you can set up a SR job to run a periodic testing based on the schedule that is most suitable for your organization.
Advantages of Site Recovery Testing
If you are not sure whether your organization is fully protected against possible risks and threats and want to verify the validity of your DR strategy, SR job testing is the way to go. Apart from being a flexible and reliable testing method, SR job testing provides a number of additional benefits, which are presented below.
Checking the test schedule
As already mentioned above, NAKIVO Backup & Replication provides the opportunity to run the SR job testing on schedule. The jobs can be run on a daily, weekly, monthly, or yearly basis, depending on the priority and complexity of a SR workflow. Employing the test schedule makes the process fully automated and allows to save effort and time.
Running a test on demand
If you want to check something in particular, NAKIVO Backup & Replication allows you to test a SR job on demand, without waiting for a SR test job to run on schedule. Moreover, on-demand testing provides you with full control over the Site Recovery functionality. This way you can quickly start a SR test job whenever a need arises to ensure that everything goes as planned and the system can be recovered.
Analyzing test results
After the test job has been completed, a report containing the results of the job can be sent via email. For this purpose, when creating a SR job, enable the option Send test/Run report to in the Options section and insert an email address to which the report will be sent. As another option, you can right-click the name of the test job and choose the Site Recovery Job report option. The report with results will be downloaded straight to your computer. After receiving the test report, you can analyze the results, identify underlying issues, and fix them to prevent any problems in future.
Setting Network Mapping and Re-IP rules
Failover and failback are essential components of the DR process, during which the production workload is transferred from the production site to a DR site (failover) and vice versa (failback). However, the production site and the DR site can have different network and IP parameters. If the SR job has invalid network and IP values, the connection between the source site and the target site won’t be established, thus disrupting the whole process.
With NAKIVO Backup & Replication, you can configure Network Mapping and Re-IP rules in the Site Recovery Job Wizard by establishing the appropriate network settings and the IP settings. Network Mapping and Re-IP rules are used to ensure that the source VM virtual networks can be mapped to the chosen target virtual networks, as well as the source VM IP addresses can be mapped to the chosen target IP addresses. Thus, the SR job testing ensures that Network Mapping and Re-IP rules have been configured correctly and your production can be easily failed over and failed back during a DR event.
Reviewing the actions in site recovery jobs
Testing a SR job allows you to check whether the actions are arranged in the appropriate order and performed as planned. If you notice that the recovery process has not been executed in the manner you expected, you can modify the settings of the SR job when the job is not running. For this purpose, find the name of the job you want to configure in the left panel of the home page, then right-click on it, and select the necessary option in the pop-up menu that has appeared. A number of options is presented, which include Run Job, Rename, Edit, Delete, or Disable. You are also allowed to change the order of actions, add or remove actions, etc. With NAKIVO Backup & Replication, you can modify the SR jobs to match your needs.
Verifying RTO values
With the SR job testing, you can check whether your RTO is achievable or not. As mentioned above, you can set up the RTO and verify whether your target can be met within the stipulated timeframe, or it should be modified. This way, you will know the time period needed for your system to be recovered, find ways to improve the RTO, and eventually reduce the downtime costs.
Every organization aware of repercussions of a DR event realizes the importance of having a comprehensive DR plan in place. However, many DR plans are proven to be invalid due to a lack of testing. To ensure that your DR plan is efficient and up-to-date, designing various DR scenarios and applying them as a part of the DR testing process is important. DR scenarios allow you to train your staff in how to respond to a disaster, regardless of how unexpected or unlikely it may be, thus avoiding any possible panic or confusion.
With NAKIVO Backup & Replication, you can be sure that your system is reliably protected and can be easily recovered. The new feature – Site Recovery – is an automated multifunctional tool which relieves the pressure of manually conducting the DR process. Moreover, you are enabled to execute the SR job testing at any time, without affecting the production environment. After receiving the test results, you can identify the flaws in your recovery strategy and update the SR job accordingly. Thus, Site Recovery functionality provides you with a number of benefits aimed at ensuring your business continuity and data protection.
Download a full-featured free trial and test the product in your VMware, Hyper-V, or mixed environment today.