September 5, 2018
Site Recovery with NAKIVO Backup & Replication Part 1: Planning
In the framework of using a virtual infrastructure, site recovery is the process of recovering your virtual machines and the services running on them at a secondary site (known as a “DR” site or disaster recovery site) when your production site is rendered unavailable.
NAKIVO Backup & Replication v8.0 includes advanced Site Recovery functionality that allows you to create advanced recovery workflows and fail over an entire site in just a few clicks. Before creating a workflow, however, you must evaluate the particular recovery needs of your company. The first installment in a series, this blog post explores the required recovery planning activities and best practices before we delve deeper into using NAKIVO’s new Site Recovery functionality in the upcoming posts.
Best Practices of Site Recovery
The most important best practices for site recovery include conducting the business impact analysis, assessing the risks, and creating site recovery documentation.
Conducting the Business Impact Analysis
Business impact analysis (or BIA) involves determining the potential negative impact of natural or man-made disasters on business operations. The VMs used for business processes may depend on each other and have different degrees of importance. Thus, failure of one VM might cause certain delays and inconveniences, while failure of another VM would cause complete interruption of business-critical operations.
For example, if a VM running a bug tracker fails, then the business can operate despite some inconveniences for its employees. If the VM with the production database server were to fail, however, then the company could not operate and would incur financial losses. Conducting the BIA helps determine the priority with which VMs must be recovered and how long the recovery process should take.
Assessing the Risks Involved
Before conducting any recovery planning activities, compile the relevant data and statistics to identify which risks are the greatest for your company. In some areas, a long-term power outage or a virus attack are more likely to occur than a tornado, but the opposite is true in other regions. With the results of your risk assessment, you can determine the appropriate level of protection against certain threats and come up with measures to minimize the risks or mitigate the consequences. The risks cannot be completely eliminated, but your company can be better prepared for the disaster scenarios that you are more likely to encounter.
Developing Site Recovery Documentation
Once the risks and their potential impact on your business are identified, you have a better understanding of where to focus your efforts when disaster strikes. Document your recovery procedures, describing all the vital steps and DR measures in detail. Assign roles and responsibilities to team members in the event of disaster. A site recovery plan should also cover hardware and software components needed for successful recovery. The documentation should be regularly updated to reflect all changes made in the environment.
The recovery process is complex, encompassing many different activities and components that can be easily missed if not documented. Organizations failing to devise detailed site recovery plans are more likely to experience downtime and data loss. For a prompt response to disruptive events, a company needs a clear understanding of where to start as well as awareness of all the most critical aspects. Thus, properly developed site recovery documentation increases the odds of successful recovery.
Determining Site Recovery Scope
Determining the critical components that must be recovered first can significantly shorten your recovery time. Not all VMs in your infrastructure are equally important. VMs housing business-critical information, IT systems, and applications whose operation is essential to ensure continuous delivery of services should be your top priorities. These need to be recovered most urgently. Assess the importance of each hardware and software component in your infrastructure and include the most critical ones in your site recovery plan.
Determining RTO and RPO
Recovery time objective (RTO) and recovery point objective (RPO) are two important metrics that must also be described in your site recovery plan. The former defines how much time your company can afford to spend on recovery without incurring unacceptable financial losses. The latter determines how much data your company can afford to lose if an outage occurs. In other words, the RPO value defines how often backup or replication must be performed.
Different VMs can be assigned different RTO and RPO values. For example, consider VMs with financial systems: long recovery times are unacceptable, and any data loss is extremely detrimental. These VMs should therefore be assigned the shortest possible RTOs and RPOs. VMs used for storing archived documents can have significantly longer RTOs and RPOs.
Determining Site Recovery Dependencies
Dependencies and interconnections exist between your staff and the IT components of your virtual infrastructure. These dependencies should be carefully evaluated, since even a single missed link in the dependency chain can lead to devastating consequences.
VM Recovery Order
In any infrastructure, particular VMs may be dependent on the software or information housed by another VM, which means they cannot operate separately or be started at random. For example, the VM running Active Directory Domain Controller must be up and running before you can start a VM with a file server that uses Active Directory authentication.
Web services often rely on software that is installed on several different VMs. For example, the following sequence might need to be implemented:
- The VM with the database server should be started first.
- The VM with the application server can then be started.
- Only then can the VM with the web server be started.
By having the recovery order predetermined, you can shorten recovery time, ensure a smooth recovery process, and eliminate the risk of software conflicts in your infrastructure at the DR site.
Staff Requirements and Dependencies
When determining the dependency chain, take your staff into account as well. For example, a VM used by the accounting department might need to be recovered first if workers in other departments depend on those financial operations to function.
If you want your staff to work at the DR site, make sure you have workstations set up there with all the required equipment, office furniture, and hardware, so that your employees can continue their work to support your business operations with minimal interruptions. If your employees can work remotely from their homes or other places, configure VPN access and provide VPN accounts for them in advance.
Work with your staff to identify all these dependencies and take them into account when devising your site recovery plan. Otherwise, the whole recovery procedure may be prone to failure.
Determining Hardware Requirements
The success of your DR plan depends heavily on the performance and capabilities of the hardware located at your DR site. Several factors should be taken into account. Servers must have enough CPU, memory, and disk capacity to sustain transferred workloads. Low CPU performance and insufficient memory can affect the speed of your VMs, while insufficient disk speed results in poor VM performance. Networks must provide enough bandwidth for the recovered VMs to interact with each other, with shared storage, and with users as necessary.
Planning is an essential step for effective site recovery. Every company wants to be well equipped for disasters and able to mitigate their consequences. To achieve this, you must evaluate your recovery needs, developing a comprehensive understanding of what components, steps, and procedures should be included in your recovery workflow. This blog post covered the fundamentals of such evaluation as well as the best practices for site recovery planning. The next blog post of this series on Site Recovery covers preparing your infrastructure for site recovery with the help of NAKIVO Backup & Replication.