December 5, 2018
How to Test a VM Disaster Recovery Plan?
Any business can be hit by an unexpected disaster. It is essential, therefore, to have a VM disaster recovery plan (DR plan) in place to protect your mission-critical data and operations. However, simply creating a VM DR plan is not enough; you must make sure that your plan can be properly executed and effectively meets your recovery objectives (RTOs). Thoroughly testing a DR plan and constantly updating it should be one of your main priorities. Regularly checking the preparedness of your systems, processes, and staff ensures that any gaps in your DR plan and possible risks are identified before actual disaster strikes.
What Is a VM Disaster Recovery Plan?
A VM disaster recovery (DR) plan is a documented set of steps and procedures that should be implemented before, during, and after a disaster to protect mission-critical data and secure business continuity in a virtualized environment. The main purpose is to come up with a plan that ensures the recovery of enough data and system processes for your organization to operate, at least at a minimal level.
When creating a VM DR plan, you should begin by determining the following:
- DR objectives and priorities
- the people or teams responsible for implementing the DR plan
- the recovery order of business operations
Usually, a more detailed plan that is tailored to your organization’s goals and objectives is needed. Therefore, business impact analysis (BIA) and risk analysis (RA) should be conducted to identify the most important business functions and the possible risks.
How to Design a VM DR Plan?
The following recommendations might be useful when designing a detailed DR plan for efficient system recovery.
- Identify business-critical processes and applications. Those processes and applications that are absolutely essential for your business should form the basis of your DR plan design.
- Perform a business impact analysis. A BIA measures the impact of downtime on your business, calculating the cost of downtime (loss of revenue and loss of customer trust, etc.) as well as identifying any security vulnerabilities.
- Define possible risks and threats that could potentially undermine the success of your VM DR plan in case of a disaster. Look for ways to prevent or mitigate these threats.
- Determine the order in which your applications and VMs should be restored to deliver business-critical services. Depending on the scale of your business and your virtual environment, you could have several VMs with different applications that are interdependent. For example, suppose you have a VM with Active Directory Domain Controller, a VM with a database server, and a VM with a web server. In this case, the VM with Active Directory Domain Controller needs to be powered on first. Only then can the VM with the database server be started, as the VM with Domain Controller enables user authentication. Finally, the VM with the web server is powered on, because the web server depends on the database server to operate properly.
- Designate staff members responsible for implementing your DR plan and monitoring its execution in case of an emergency. This group of people form your disaster recovery team.
- Create a step-by-step VM DR plan and run regular tests. Testing identifies gaps in your plan and allows you to address risk areas before they become a problem in a real disaster scenario.
Components of VM DR Testing
Following the steps above can help you design an efficient VM DR plan, but this still can’t guarantee 100% protection of your business processes. A VM DR plan can become outdated (and unworkable) for a number of reasons: business expansion, new hardware or software installation, etc. Thus, periodic testing of your DR plan is a necessity in the modern business world.
The following points represent the main components of DR testing.
Defining DR objectives
The main objective of a DR plan is to minimize downtime and prevent data loss in case of a disaster. Two concepts – the recovery time objective (RTO) and the recovery point objective (RPO) – define how much data you can afford to lose and how much downtime can be tolerated, respectively.
On the basis of your BIA, you should set RPOs and RTOs for each application and VM.
- The RPO is the oldest acceptable point in time to which your VMs can be reverted during a DR event. The RPO represents the amount of tolerable data loss, measured in time. Increasing the frequency of backup and replication jobs lets you significantly shorten your RPOs.
- The RTO establishes the period of time within which your virtual infrastructure is expected to be restored. Applications and VMs of the highest priority should be recovered first, so they have the shortest RTOs. With VM replication, you can meet significantly shorter RTOs, because VM replicas represent a point-in-time copy of a VM that can be instantly powered on during a DR event.
- The testing scope involves a set of assumptions and expectations that should be met during DR testing. Setting the testing scope allows you to identify the systems and functions that require DR testing. All goals and objectives of DR testing should be clearly defined and communicated to your staff so as to avoid any confusion during the testing procedures. Moreover, exceptions and limitations should be established in advance, because some components of your DR plan may not be executed as planned. With a properly established testing scope, a DR team can be fully prepared for any scenario.
Reviewing your VM DR plan
Before testing your DR plan, you should perform a review. The testing procedure must be precise and well thought out. DR testing should be conducted in an organized manner by focusing on the organization’s policies and practices. Thus, the disaster recovery team should meet with senior management to review the existing DR plan and determine any changes or updates that should be implemented based on the current state of the business. These changes should also reflect any recent adjustments at the system level of the organization.
After running a DR plan in test mode, it is advisable to review your DR plan once again. Strengths and weaknesses, as well as any unexpected results, should be recorded during DR testing and their impact on business continuity should be measured. This can significantly improve your DR strategies and boost overall performance.
With current IT environments being highly dynamic, determining the review frequency is critical for keeping your DR plan constantly updated. Some organizations review and update their DR plans once per year. However, the most efficient strategy is to update (and re-test) your DR plan whenever mission-critical components of your organization undergo changes. DR testing can prove time-consuming and costly. Thus, you should create your testing schedule on the basis of business needs and resources, considering the scope of DR processes.
Test success criteria
You need to set the criteria that determine whether your VM DR tests are successful or not. Ideally, VM DR testing can be considered passed when a DR plan is proven to be valid and viable. However, DR testing can be deemed successful even when a DR plan has failed to pass the test. This scenario allows you to identify flaws in a DR plan prior to actual disaster and address them in the next iteration of the plan. Essentially, test success criteria are defined on the basis of predetermined expectations, which should be clearly expressed to avoid any confusion.
Evaluation of test results
The results of VM DR testing provide a general overview of the DR strategies currently used in the company. The recovery team can evaluate the test results and come up with improvements or adjustments for the DR plan on the basis of the identified issues.
The following metrics should also be considered when evaluating DR test results:
- How much time elapsed before mission-critical activities were restored
- How well each step of the plan was executed (whether any errors and delays occurred)
- How many operations were successfully completed during the DR testing process
Changes and updates should be made and tested to improve the DR plan. The goal is to provide a more effective and manageable recovery process.
Factors to Consider Before Testing a VM DR Plan
Preparing for VM DR testing helps you achieve the most relevant results. One of the key things to consider is that there should be at least two people in a recovery team so as to avoid the problem of a “single point of failure”. With multiple team members, if one person can’t be reached during a disaster, you can rest assured that there is a substitute with the required knowledge and access to the DR site.
The time of day when testing is conducted is also worth thinking about. Generally, DR testing is executed outside of working hours, as the process is time-consuming and could interrupt business operations or affect overall performance of the company. However, these test results might not be indicative of how the DR plan would function under actual working conditions. Testing the components of a VM DR plan in isolation during working hours could be an ideal solution. This helps reduce the risk of system overload that full testing presents.
Before testing a VM DR plan, consider the various factors that could render your DR plan incomplete and outdated. Such factors might include: the emergence of competitive organizations, the introduction of new hardware or software products, business expansion, budget cuts, etc. Moreover, staff might forget critical components of your DR plan, or new employees might join the organization. For this reason, it is advisable to hold regular refresher meetings or training sessions. Keep the DR team apprised of new changes to the environment and send brief memos notifying staff of the latest updates.
Types of DR Plan Testing
A checklist test involves a list of requirements and conditions that must be met and reviewed. This checklist verifies, for example, that the backup site is of sufficient size, that the recovery team is notified of the latest updates, that the data protection solution is running, etc. By using this testing method, the recovery team can quickly review the DR plan, ensure that every component is in place, and identify any missing components in your DR strategy. This procedure can be conducted in minimal time and without heavy staff involvement.
The purpose of this strategy is to verbally walk through every step of a VM DR plan and identify any issues and deficiencies. Here, all members of a recovery team take part in the review and discussion of the DR plan, coming up with recommendations. It is essential to ensure that everyone has a strong understanding of the plan and is aware of their responsibilities during a DR event. This method only involves a verbal discussion of the DR process. The technological aspects of your DR plan are not actually tested or approved in walkthrough testing.
In this case, the organization goes through a simulated disaster scenario to identify whether a DR plan is adequate and the defined goals can be met. This method can be considered an extension of the walkthrough test. All team members are presented with various disaster scenarios, which they review by discussing how they would act in the circumstances. This allows you to test the preparedness of your staff in a more realistic setting and check whether your DR plan can deal with unexpected issues.
Parallel testing allows you to test the functionality of your recovery systems to determine whether they can execute business operations and secure critical processes. The primary systems are not included in the testing process, as they are expected to support the full production workload. This is a safe and non-disruptive way to test your technical systems.
A full-interruption test provides thorough testing of your VM DR plan. In this case, your DR site assumes the full production workload and the primary site is shut down. The goal is to recover as quickly as possible using the corporate DR plan. The execution of a full-interruption test should be well thought out as normal operations can be disrupted and it is quite costly.
Every one of the recovery process should be documented. Identify all issues and concerns during test execution so as to address them later. The actions of the recovery team should be closely observed to pinpoint any potential gaps in your VM DR plan. Full-interruption testing is also an appropriate method to check whether your DR objectives are adequate and achievable.
You might consider conducting the full-interruption test without notifying your staff in advance. This allows you to more accurately assess the preparedness of your team in case of disaster.
Useful Tips for VM DR Testing
Testing a VM DR plan is an important task that can seem overwhelming at times. The following DR testing tips could help save you time and stress.
- After installing any new hardware or software products, immediately test them to verify their functionality and integrity. This also helps you to find the product’s RTO and learn how it might perform during DR procedures.
- Perform a risk analysis (RA) and a business impact analysis (BIA) before designing your DR plan. Constantly review the results of these analyses, and if any changes are made, consider how they should be reflected in your DR strategy.
- Testing should be executed in circumstances as similar as possible to a DR scenario. By simulating a real-life disaster scenario, you can see how well employees perform their duties in DR circumstances. This also helps reduce stress among your staff, as employees get more accustomed to various DR scenarios and learn what is expected of them.
- Invite independent observers to review your DR plan and monitor the testing process. This approach ensures that no shortcuts are taken by employees to rapidly complete the tests. Moreover, independent observers can then help rewrite a DR plan and improve it, often identifying issues that are not visible to those within the organization.
- Have a complete list of all the applications in your infrastructure. This list should include the details of each application, their configurations, the contact details of the application owners, and your contract/licensing details.
- At the primary stages, DR testing should be conducted in parts and after business hours so as not to overload the system. After identifying any deficiencies and improving the plan accordingly, you can consider running further full tests in business hours.
Site Recovery Testing in NAKIVO Backup & Replication
NAKIVO Backup & Replication has recently presented a new feature – Site Recovery – that enables creation and implementation of an automated workflow for disaster recovery of any scale. With NAKIVO Backup & Replication, site recovery jobs can be designed to accommodate the specific priorities and needs of your business during a DR event. Moreover, you can modify or supplement your jobs at any time without affecting the production environment. Site recovery jobs can be executed in test mode or in production mode.
Running site recovery jobs in test mode
NAKIVO Backup & Replication lets you run site recovery jobs in test mode to check whether all system components can be easily restored during a DR event and the stipulated DR objectives can be met. A site recovery job in test mode can be scheduled as well as run on demand.
The following walkthrough tells you how to run a site recovery job manually in the test mode.
1. In the Jobs dashboard, select a site recovery job and then click the Run Job. The dropdown menu gives you two options. Click Test site recovery job.
2. In the dialog box that is launched, you can configure your RTO metrics. Define the maximum permissible amount of time your site recovery job can take to complete. If the test run exceeds the RTO value you input, the test is considered failed. This option can be either enabled or disabled.
3. Finally, click Test to run the job.
Test mode features
The following site recovery testing features are provided by NAKIVO Backup & Replication.
NAKIVO Backup & Replication allows you to fail over the primary site’s production workload to a VM replica at the DR site in case of disaster. Failover can be run in test mode so as to check the preparedness of your DR systems and verify the sustainability of your DR plan.
Failback is generally performed to reverse failover once the primary site has been restored after a disaster. Failback switches the production workload back to the VM at the primary site to ensure that all changes made to the data during the period of disruption are preserved. Test failback allows you to verify that the system can be restored and moved back to the primary location smoothly. All changes in your virtual environment caused by failback are reverted to their pre-job state after the test is run.
As mentioned above in the walkthrough, NAKIVO Backup & Replication lets you test whether your RTO can be met in a particular DR scenario. You can manually set the amount of time during which you expect your site recovery job to complete. If the test run exceeds this time frame, the product informs you that the test has failed.
NAKIVO Backup & Replication provides Network Mapping and Re-IP features, which can be easily set up when configuring a site recovery job. Before running a site recovery job in test mode, you can enable Network Mapping and Re-IP options and input the network settings and the IP settings. This lets you verify that the source VM virtual networks can be mapped to appropriate target virtual networks and that the source VM IP addresses can be mapped to the target IP addresses during disaster recovery.
The test schedule is a feature specific to jobs run in test mode. It enables you to check whether a site recovery job can be run within the stipulated time period. Here, you can choose to run your site recovery job in test mode on demand or set up a schedule for the job to be regularly tested. The following options are available: Run daily/weekly, Run monthly/yearly, Run periodically, or Run after another job. The option Effective from (if enabled) allows you to set a schedule to start on a selected date.
From the scheduling interface, you can also Add another schedule for the job or view the calendar with all your planned jobs by clicking on Show calendar.
If this option is enabled, selected recipients are sent a test report every time the job is completed. Select the Send test/run report to option and insert an email address to which the report should be sent. Alternatively, you can right-click the name of the test run for which you would like to see results and choose the Site Recovery Job report option. The report is then immediately downloaded to your computer.
A DR plan is essential for ensuring business continuity and high availability. Because disaster can strike when least expected, it is critical to go through various DR scenarios and design an effective DR plan with the most likely risks and threats in mind. However, planning is not enough to fully protect your system. Your DR plan should be regularly tested so you can review and improve it.
Site recovery job testing with NAKIVO Backup & Replication helps you assess the efficiency of your DR plan. By running site recovery jobs in test mode, you can check that your DR goals can be achieved when the time comes as well as helping your staff become more comfortable with the procedures. Email reports provide summaries of the job and notify you of any issues that arose during execution.
Fixing faults and failures that have been identified in your plan during a DR test can be a daunting task, but thorough testing is worthwhile; it is the only way to be sure you are prepared for any type of disaster.
Download the full-featured Free Trial of NAKIVO Backup & Replication to build and test comprehensive site recovery workflows in your own environment.