December 7, 2018
How to Build an Effective VM Disaster Recovery Strategy
The number of businesses increases with every minute and so does the amount of data that they operate with. Thus, even a minor error can cause major disruptions in the organization’s system, which proves the importance of data protection. To overcome this issue, various disaster recovery (DR) strategies have been introduced. Which DR strategies to adopt generally depends on the scale of your business operations, the amount of data, and allocated budget. Therefore, considering all aspects of your business and choosing the most effective DR strategy is the optimal strategy for ensuring high availability and business continuity of your organization.
What Is a VM Disaster Recovery Strategy?
Disaster recovery is a process of restoring business infrastructure to a normal state after a disaster. A disaster is considered to be any event that puts an organization’s operations at risk, such as natural or man-made disasters. Essentially, disaster recovery is aimed at protection and restoration of the IT systems supporting key functions of the organization. The final goal of any VM DR process is to near-instantly resume business operations and secure the most critical data for ensuring business continuity.
A VM DR plan and a VM DR strategy are the core elements of disaster recovery, which help to determine how to act during a DR event. The main difference between the two concepts is that a DR strategy defines which policies, procedures, and tools to implement in case of a disaster, while a DR plan formulates a list of steps to follow during a DR event.
A VM DR strategy is developed on the basis of a business impact analysis and risk assessment, which define the most critical components of virtual environment and the repercussions of the prolonged system shutdown. Thus, you can define the recovery order of the system elements based on their priority and decide which recovery strategy should be applied in a specific DR scenario. Moreover, consider the amount of data you want to protect, the budget available for DR activities, the resources necessary for ensuring system maintenance during a DR event, and the most likely threats and risks. Taking all these factors into account allows to design a DR strategy which later forms the basis of a full DR plan.
An organization can have only one DR strategy around which a DR plan is built. This is due to the fact that a DR strategy is specifically designed to accommodate the needs and priorities of a particular organization. A DR strategy evaluates all components of the organization and considers DR goals and objectives, which are unique to every business. Thus, high-level protection can be achieved.
Key Elements of a Disaster Recovery Strategy
Disaster recovery metrics
DR metrics are critical values as they define the goals and objectives which a DR plan is expected to achieve during a disaster. The following DR metrics can be distinguished.
- Recovery Time Objective (RTO)
RTO is a period of time within which business operations must be restored after a disaster so as to prevent any significant damage and critical losses.
- Recovery Point Objective (RPO)
RPO determines the amount of data, measured in time, which can be lost during a disaster without harming your business. Essentially, the RPO is the point in time that your VMs will be reverted to in case of a disaster.
- Work Recovery Time (WRT)
WRT is a metric which defines the period of time within which the company is expected to verify the system and data integrity. This ensures that virtual environment has successfully recovered after a disaster and is now ready to resume operations.
- Maximum Tolerable Downtime (MTD)
MTD represents the sum of RTO and WRT, which defines the total amount of time that your organization can allocate for the entire DR process without incurring significant losses and serious repercussions.
Disaster recovery site
A DR site is a valuable asset of the DR process as it carries workloads of the production site for the DR period while the main site is being recovered. An alternate site should have the equipment and infrastructure capable of carrying the workloads of the production center until a primary site is restored or a new location is found. A DR site should be situated away from your primary site so that both sites will not be disrupted at the same time.
There are three types of a DR site:
- Cold Site
A cold site is a datacenter or office facility equipped with the basic infrastructure. A cold site provides electric power, cooling, network connectivity, and/or office space, but it doesn’t have any hardware installed (or it might have the equipment which is not operational). Thus, in case of a disaster, virtual environment can’t be restored immediately as the additional time is required to build a DR site suitable for carrying the workloads of the primary site. A cold site is the cheapest option of three.
- Warm Site
A warm site is a datacenter or office facility, which is partially equipped with the necessary hardware and software, electric power, and environmental support equipment. A warm site has pre-installed hardware which can support some operations of the primary site. Data synchronization between the primary and warm DR site is conducted daily or weekly, meaning that data on the DR site is older than on the primary site. Warm sites are suitable for recovering the workloads which are not critical and the projects that can allow partial data loss.
- Hot Site
A hot site represents an exact copy of the production site, with the same hardware and software equipment, operating systems and applications, network systems, power supply, which ensures almost real-time data synchronization. The production site can easily fail over to the DR site and the entire virtual infrastructure would be promptly restored in a few clicks. This is the most expensive option of three.
Therefore, the organization’s budget, data sensitivity, and the maximum tolerable downtime define which DR site would comply most with your business needs.
A DR plan represents a set of policies and procedures intended for responding to unexpected events that might undermine an organization’s IT infrastructure, services, and staff. The DR plan is primarily designed to ensure the recovery of data and system processes, enabling an organization to operate even during and after a disaster.
Before creating a DR plan, risk assessment and a business impact analysis should be conducted. These procedures help identify which areas of the IT infrastructure are most vulnerable and prone to disruption, and how the system shutdown would affect business operations.
Simulating a DR event helps identify weaknesses in a DR plan and, as a result, improve DR strategies of your organization. A flawed and outdated DR plan can considerably undermine business continuity. Thus, to ensure high availability during a DR event, you need to regularly test a DR plan and update it accordingly.
Methods of DR testing
There are the following types of DR testing which can be differentiated:
- Checklist testing
A checklist test involves reviewing the list of requirements and conditions which must be met for ensuring a successful DR process. This checklist covers various aspects, such as whether the backup site is of adequate size, a recovery team is aware of the latest system updates, the data-protection solution is updated and properly licensed, etc. This testing method allows to review basic elements of the DR plan and identify gaps in a DR strategy. This procedure can be conducted with minimal time and staff involvement.
- Walkthrough testing
Recovery team members verbally walk through every step of a DR plan and identify possible issues and risks. This way you can ensure that every employee is aware of the plan components and their responsibilities during a DR event and the staff is encouraged to come up with recommendations. This test type represents a verbal discussion of the DR process. Thus, technological aspects of your DR plan will not be tested this way.
- Tabletop/simulation testing
This method is considered an extension of the walk-through test. Here, recovery team members not only discuss all aspects of a DR plan, but they are also presented with various DR scenarios which simulate real-life incidents. Employees review a DR scenario and discuss how they would act in particular circumstances. Simulation testing allows to train your staff in a more realistic setting and determine sustainability of your DR plan.
- Parallel testing
Parallel testing helps to check sustainability of your recovery systems and whether the DR site can carry the workloads of the production site in case of a disaster. A parallel test is similar to a full-interruption test, except that the production center continues to carry the full production workload. This is the time-consuming but safest way to test your technical systems.
- Full-interruption testing
A full-interruption test involves the thorough testing of a DR plan. In this case, the production site is actually shut down, which allows for DR sites to assume the full production workload. This test type identifies whether your virtual environment can quickly recover mission-critical operations by following a DR plan. A full-interruption test should be dealt with great care so as to avoid any disruptions in the system.
Disaster Recovery Control Measures
DR control measures represent a set of mechanisms and procedures that can identify possible threats and reduce the level of their impact on the system. All control measures should always be included in a DR plan. They are divided into three types.
- Preventive measures are intended to prevent an event from occurring. These measures identify deficiencies in virtual environment and eliminate possible risks. Preventive measures suggest keeping data backed up and off site, performing routine inspections, and installing fire suppression systems or surge protectors.
- Detective measures are used for identifying existing issues in virtual environment which might develop into serious threats in future. They include setting up antivirus software and firewalls, installing fire alarms, and instructing the staff on how to act during a DR event.
- Corrective measures aim at fixing the system after a DR event. These measures involve adopting proper insurance policies, implementing effective DR tools, and updating the system based on an after-action report.
How to create an effective DR strategy
Perform risk assessment and a business impact analysis
This should be the first step in creating an effective DR strategy. Risk assessment and a business impact analysis help define the most critical elements of the business and evaluate the system’s sustainability. Identifying possible risks and threats and realizing how they would impact virtual environment allows to choose the optimal DR strategy and prepare your system in advance.
Establish recovery priorities
On the basis of risk assessment and a business impact analysis, it is now possible to define the most critical systems and functions of your IT infrastructure, which should be recovered in the first place. When disaster strikes, you have a very limited amount of time and resources for restoring your virtual environment. Therefore, you need to clearly outline your recovery priorities so as to avoid any confusion in the DR process and ensure that the system can function properly even during a DR event.
Identify mission-critical data and applications and their recovery order
It is also essential to know which data and applications are most sensitive and should be recovered first. In this case, you should consider business needs and technical requirements of your virtual infrastructure. Also, identify the dependencies existing between applications and core elements of the system and document them so as to prevent any unexpected disruptions during disaster recovery.
Evaluate resources of an organization
For successful disaster recovery a separate site with fully operational IT infrastructure is required. Offsite location represents a remote data storage resource or facility that is used to replicate and back up data from a local host to a target host.
The systems on the off-site location should always be on standby in case a disaster strikes. Having a remote DR location allows securing mission-critical data and restoring your virtual infrastructure in a few clicks. This is achieved through failover, which is the process of switching from a source VM to a VM replica in order to move business-critical workloads from an affected site to a DR site.
The data can be transferred through magnetic disks, tape drives, or a wide area network (WAN). However, this strategy proved to be expensive and time-consuming. Therefore, more affordable and effective cloud-based systems were introduced.
Conduct the technology inventory
Inventorying applications and technical equipment allows to identify which of them are critical and non-critical. Moreover, technology inventory determines the state of hardware and software, whether it needs to be updated, and whether necessary documentation and its copies are in place.
Implement a comprehensive cost model
A DR strategy is primarily based on the resources of your organization, particularly its budget. Some organizations dismiss the importance of training staff, running tests, and installing high-end technology on the DR site, which is important for forming an effective DR strategy. Therefore, it is essential to have a full grasp of the components of the DR process and be aware of all the costs involved. Adopt a comprehensive cost model which would take into account business needs and priorities and ensure maximum efficiency for a reasonable price.
Define RTOs and RPOs
Currently, businesses have become more reliant on high availability and the prolonged system downtime can lead to significant losses. Thus, it is crucial to choose the optimal recovery time objective (RTO) and recovery point objective (RPO), on the basis of which you can create a DR plan.
The choice of RTO and RPO is primarily based on the priorities of your organization during a DR event. Some businesses operate with less sensitive data and the short system shutdown won’t significantly affect their services. After defining your RPOs and RTOs, you should decide on the frequency of creating data backups and replicas in your virtual environment. Creating multiple recovery points would ensure a minimum loss of data in case of a disaster.
The sum of RTO and WRT is Maximum Tolerable Downtime (MTD), which represents the maximum amount of time during which the system operations can be down without causing any serious repercussions. Designating the realistic MTD is critical because even if the RTO wasn’t met during the DR process, the MTD provides the extra time during which recovery is still acceptable and business will not suffer from irreversible damage.
Work Recovery Time (WRT) defines the maximum period of time within which the company is expected to verify the system and data integrity. This metric is also critical as it checks the state of applications and systems in virtual environment and whether they are ready to take on the workload and resume business operations.
Distribute roles and responsibilities within a recovery team
A recovery team responsible for designing a DR plan and ensuring its implementation should be formed. Management should designate the members of a recovery team and assign specific duties and responsibilities to each team member. This would ensure that every employee knows what is expected of them and would act accordingly. Moreover, it is crucial to familiarize staff with the organization’s DR strategies and make sure that people are trained to use them and notify them of the latest updates by sending brief memos.
Ensure automation of the DR process
If you want to ensure that recovery of the system runs smoothly, even with minimal input on your part, consider automating the DR process. Data replication allows to create multiple VM replicas which can be used to fail over and then fail back critical workloads. For this purpose, a data-protection solution can be used which enables automation and orchestration of DR activities in a few clicks.
Follow the 3-2-1 rule
The 3-2-1 rule helps build a reliable data-protection system resistant to any type of disaster. It follows this pattern: create 3 copies of your data, store the copies on 2 different types of media, and transfer 1 of these copies to a DR site. Thus, virtual environment would be easily recovered, even if a copy of your data was destroyed or accidentally deleted.
Review DR strategies in place
DR strategies form the basis of a DR plan. Thus, it is important to review them and determine whether they still serve their purpose. Most businesses tend to expand over time, meaning that existing virtual environment will not have the capacity to carry the full production workload as expected. The same goes for DR strategies which can also be affected by the organization’s growth, expansion of business operations, and updates in the system. Therefore, it is essential to regularly review DR strategies and ensure that they still comply with the infrastructure requirements.
Regularly test and update DR strategies
Apart from reviewing DR strategies, you should also test them and identify if they are functional and can protect virtual environment as planned. Create an after-action report which would describe what happened during the test, the issues encountered, and the lessons learned. On the basis of this report, you can decide how to update DR strategies. Implementation of flawed and outdated DR strategies can lead to failure of virtual environment.
How to Build a Disaster Recovery Strategy with NAKIVO Backup & Replication
The set of features in NAKIVO Backup & Replication allows to implement a DR strategy of any complexity. To understand what these features entail and how they can improve your DR strategy, read the following tips.
The state of your VMs should always be checked to identify if there are any potential risks to their functionality and whether they can be easily reached in case of a disaster. NAKIVO Backup & Replication can periodically check the state of your VMs and inform you if the integrity of your VMs has been compromised.
Run replication jobs, as a result of which an identical copy of a source VM is created (VM replica). A VM replica is stored in a powered-off state but can be easily powered on in just a few clicks. A VM replica serves as the core component of the failover operation, meaning that the workloads of the source VM affected by a disaster can be switched over to its VM replica at the DR site.
Running regular replication jobs allows to create multiple recovery points and ensures that a VM replica contains the latest updates of a source VM.
Create a site recovery workflow
With NAKIVO Backup & Replication, you can easily build site recovery workflows (jobs) by combining available actions and conditions (start/stop VMs; failover/failback VMs; run/stop/enable/disable jobs; wait; etc.). SR jobs allow to orchestrate and automate the entire DR process and can be designed for specific DR scenarios, such as disaster avoidance or planned migration. Automated SR jobs save time and resources and minimize probability of human errors, which is possible with manual implementation of a DR strategy.
Test a site recovery workflow
SR jobs can be run in production and test mode. Production mode is enabled when the actual disaster strikes and you need to quickly recover mission-critical workloads at the DR site. However, test mode is just as important as production mode. Test mode verifies sustainability of your SR jobs, identifies potential risks, and checks whether virtual environment can be recovered accordingly to a DR plan and DR objectives can be met.
Also, you can test your RTO and determine whether it can be met in a DR event. NAKIVO Backup & Replication allows to set up the period of time for the job to complete. If the job exceeds this time frame, the test is considered failed.
Thus, testing a site recovery workflow helps to optimize and update DR strategies, thus ensuring that your virtual environment is protected in case of a DR event.
Enable Network Mapping and Re-IP
NAKIVO Backup & Replication provides the Network Mapping and Re-IP features which can be easily set up when configuring a site recovery job. You can enable Network Mapping and Re-IP in the Site Recovery Job Wizard and insert the network settings and the IP settings. By configuring these settings, you can ensure that source VM virtual networks can be mapped to appropriate target virtual networks and source VM IP addresses can be mapped to specific target IP addresses during disaster recovery.
Set up screenshot verification and automatic reports
NAKIVO Backup & Replication enables automatic verification of VM backups and replicas by powering them on and making screenshots of the test-recovered VMs. Screenshots are then sent via email to prove the integrity of VMs.
Moreover, the test report can be sent via email after the current job is completed. You need to enable the option Send test/run report to and insert an email address to which the report will be sent. Or, you can right-click the name of the tested job for which you would like to see the results and choose the Site Recovery Job report option. This way you can download the report to your computer.
Enable Job options
You can enable various job options for optimization of SR jobs. They include: application-aware mode, Changed Block Tracking for VMware VMs (or Resilient Change Tracking for Hyper-V VMs), network acceleration, encryption, bandwidth throttling, and others.
Application-aware mode ensures that the application data in Microsoft Exchange, Active Directory, SQL Server or any other application remains consistent during backup and replication. For this purpose, guest OS quiescing is used.
NAKIVO Backup & Replication relies on VMware CBT (Changed Block Tracking) and Hyper-V RCT (Resilient Change Tracking) to identify and copy the changes that have been made in a VM since the last replication. This technology significantly improves the speed of replication jobs. If CBT and RCT are unavailable, NAKIVO Backup & Replication uses proprietary change tracking method.
Network acceleration allows to boost data transfer with the help of compression and traffic reduction techniques. If you transfer data over a slow WAN, network acceleration can reduce the load on bandwidth and increase the load on Transporters.
Encryption protects the critical data when it is sent over a WAN without VPN. The data is encrypted so that only authorized users could access and read it.
Bandwidth throttling enables the speed regulation during the transfer of data over the network. NAKIVO Backup & Replication allows you to fully control how your data protection processes use the available bandwidth.
Perform staged VM replication
Some virtual environments can operate with multiple VMs, which can be quite large. Thus, it can take a lot of time to replicate such VMs to a DR site. To this end, NAKIVO Backup & Replication provides the option of staging (or seeding) VM replication. During the initial data transfer, VM replicas are transferred (“seeded”) to a DR site using a removable media (such as a USB hard drive) and a new replication job is then created that can use the transferred VMs as a target host. At this stage, only incremental replication is performed, which saves time and reduces an undesirable load on the network.
With NAKIVO Backup & Replication, you can run jobs either on demand or on schedule (daily, weekly, monthly, and yearly). You can create a custom schedule which is most appropriate for your business environment, e.g., every half an hour, every 3 days, or once in 2 weeks, etc. The option of scheduling is extremely helpful for testing purposes. Thus, you can set up a separate test schedule for every SR job and determine the frequency of testing. Also, you can test the RTO of your DR plan by configuring when the job testing is expected to complete. If the job exceeds this time frame, the test is considered failed.
Disasters and their consequences may vary in nature but their impact on an organization’s activities cannot be underestimated. Therefore, building an effective DR strategy should be given the highest priority. If not, repercussions of such negligence may involve irreversible damage to reputation, loss of revenue and clients, corruption of business-critical data, bankruptcy, and even loss of business to competitors.
To ensure that minimum or zero damage is done during a disaster, an organization should adopt an effective DR strategy, install reliable tools, and provide a considerable pool of resources. Therefore, a fast, cost-effective, and reliable DR solution is required. NAKIVO Backup & Replication is the perfect choice. The product is compatible with VMware, Hyper-V, and AWS EC2 environments. This DR solution protects VMs of any size which can be easily recovered with minimal input on your part. Moreover, NAKIVO Backup & Replication provides an advanced feature of Site Recovery which allows to orchestrate and automate the entire DR process, ensuring that even the most sophisticated DR strategy can be tested and then implemented.
Download a full-featured free trial and test the product in your VMware, Hyper-V, or mixed environment today.