September 10, 2018
Site Recovery with NAKIVO Backup & Replication Part 3: Creating a Workflow
In the previous blog posts of this series, you were introduced to the new Site Recovery functionality in NAKIVO Backup & Replication v8. The importance of site recovery planning was discussed, as well as a walkthrough for creating a replication job was provided.
This blog post explains site recovery workflows. The various actions that can be included in site recovery workflows are explored, along with potential sequences in which they could be used to suit the requirements laid out in a disaster recovery plan
What Is a Site Recovery Workflow?
A site recovery workflow is a sequence of actions executed to complete the disaster recovery (DR) process. Site recovery jobs in NAKIVO Backup & Replication allow you to automate the site recovery workflow execution. Site recovery job testing involves executing the workflow in a non-disruptive “test mode” in order to verify that it runs smoothly and that RTO objectives are met. The site recovery job can be run in production mode when recovery after disaster is needed.
A site recovery job action is a single task included in a site recovery job. Any number of actions can be included in a single site recovery job. Some of the actions can be executed multiple times depending on the logic used to connect the steps. Each action you add to a site recovery job in NAKIVO Backup & Replication can be executed in test mode only, in production mode only, or in both modes (this is used by default).
Determining the Site Recovery Sequence
Some actions may depend on the result of execution of other actions. Obviously, you cannot run a script on a VM that has not been started yet. For this reason, you should define in which order the actions would be executed. When creating a site recovery job, you can add actions, then move them up or down within your workflow to change the execution order. You can set waiting behavior for the most actions: either Wait for this action to complete or Start next action immediately. If you select the latter option, multiple actions can be executed simultaneously. For example, if there are no dependencies between the relevant VMs, you could launch one failover action immediately after launching another so they execute in parallel.
Creating a Site Recovery Workflow
The new Site Recovery functionality allows you to create complex site recovery jobs by combining actions and conditions. You can include any or all of the following actions in your site recovery workflows:
- Failover – initiates failover to replica VMware VMs, Hyper-V VMs, or EC2 instances.
- Failback – returns workloads from the VM replica to the source VM. The changes made in the VM replica since the point of failover are written to the source VM when the failback operation is performed. The VMs are synchronized and the source VM is in the actual production state again.
- Start – starts VMware VMs, Hyper-V VMs, or EC2 instances.
- Stop – stops VMware VMs, Hyper-V VMs, EC2 instances that are running.
- Run job – runs a backup job, replication job, site recovery job, backup copy job, or Flash VM Boot job.
- Stop jobs – stops a job (any of the jobs listed in the previous bullet).
- Run script – runs a script on one of the following targets: the server with the Director, a Remote Windows Server, a Remote Linux Server, a VMware VM, a Hyper-V VM, or an EC2 instance.
- Attach repository – attaches a backup repository used by NAKIVO Backup & Replication to store backups.
- Detach repository – detaches a backup repository.
- Send email – sends an email with the message you compose to one or more defined recipients.
- Wait – waits for the designated period of time before proceeding to the next action.
- Check condition – based on your input (all or part of a resource name), checks one of the following conditions:
- The resource exists
- The resource is running
- IP/Hostname is reachable
You can create flexible disaster recovery workflows by using different combinations of these actions. Let’s consider how to build a site recovery job with an example.
Suppose you have a primary (production) site and a DR site. You have some VMware VMs at the production site, including the following:
- DC-VM is a Windows-based VM running Active Directory Domain Controller.
- FS-VM is a Windows-based VM with a file server running (SMB protocol is used for file sharing). Active Directory is used for user authentication. Oracle database dumps are stored on the file server.
- Ora-DB is the VM on which the Oracle database is running.
The disaster recovery site contains the following VMs:
- DC-VM-replica and FS-VM-replica are replicas of the VMs residing at the production site. They can be used as targets for failover.
- DB-VM is a Linux-based VM with Oracle Database software installed, but there are no databases on this VM.
A database is backed up at the database level to FS-VM on the production site (this Oracle database backup is application-consistent). FS-VM and DC-VM are replicated at the host level to the DR site with NAKIVO Backup & Replication.
When disaster strikes and the production site goes down, the components must be recovered at the DR site as follows:
First, fail over DC-VM.
Once DC-VM is up, fail over FS-VM. You had to operate in this order because FS-VM relies on DC-VM for user authentication on the file server.
Once these two VMs are running, DB-VM can access the shared directory on the file server where the dump is stored. Now DB-VM can be started.
Once DB-VM is running, run a script that can restore the database from the dump located on the file server. The blue arrows in the diagrams above indicate the dependencies. Please note that you may need to wait for a few moments before services are up and running on a powered-on VM.
For this situation, you would create a site recovery job in NAKIVO Backup & Replication with the following logic:
Action 1. Fail over DC-VM. Wait until this action is complete before proceeding to the next step. Stop the job if this action fails.
Action 2. Wait for 3 minutes.
Action 3. Check condition of DC-VM. Check if the resource is running. If so, then continue this site recovery job. If not, then stop and fail site recovery job.
Action 4. Fail over FS-VM. Wait until this action is complete before proceeding to the next action. Stop the job if this action fails.
Action 5. Wait for 3 minutes.
Action 6. Check condition of FS-VM. Check whether the resource is running. If so, then proceed to the next step of the site recovery job. If not, then stop and fail site recovery job.
Action 7. Start DB-VM. Wait until this action is complete before proceeding to the next action. Stop the job if this action fails.
Action 8. Wait for 5 minutes.
Action 9. Run script. Target type: VMware VM. Target VM: DB-VM. Script path: /home/oracle/restore_db.sh (when adding this step, you must input the username and password of an account with sufficient permissions to run the script).
Site Recovery Walkthrough
Let’s create a new site recovery job by using the plan outlined above. On the home page of your NAKIVO Backup & Replication instance click Create > Site recovery job.
The New Site Recovery Job Wizard is launched. In the left panel, you can see actions that can be added to your site recovery job. Simply click them to compose your workflow.
Note: VMware VMs are considered in this example. One site recovery job can comprise actions for one virtualization platform (VMware, Hyper-V, or AWS EC2).
In the left panel, click Failover VMware VMs.
In the left pane select the VM replica from a replication job you have already created (see our previous blog post for a walkthrough on creating replication jobs in preparation for site recovery). In our workflow, failover to DC-VM-replica is the first action. In the right panel, you can select a recovery point. The latest recovery point is used by default. Click Next to continue.
Select the options for the failover action. You can tick the Power off source VMs checkbox; this option can be used to prevent a conflict of IP addresses if the source VMs and replicas use the same networks. In this walkthrough, in accordance with the workflow logic outlined above, the following options are selected:
- Run this action in: Run this action in both testing and production mode.
- Waiting behavior: Wait for this action to complete.
- Error handling: Stop and fail the job if this action fails.
Click Save to save the created action.
Action 2. In the left panel of the Actions interface click Wait.
Now configure the options for the wait action. Select the time to wait (3 minutes is used for the purposes of this walkthrough). Some time may be needed for services to start on the VM replica that was powered on after the failover action. A wait action is useful in this case because the following failover action in the workflow (failover to FS-VM replica) would require the DC-VM replica to be up and already running with Active Directory Domain Services. Select the action options as you have for the first action and click Save.
The new action is added after previous action, at the bottom of the list. You can reorder, edit, or remove the existing actions. Simply hover your mouse over the action to see these options.
Action 3. In the left panel of the Actions interface click Check condition. This is where the product should check whether the VM that was failed over in the first action is running.
Configure this action as follows:
- Select condition type: Resource is running. (Other options are resource exists, IP/hostname is reachable.)
- Select resource type: VMware VM. (Other options are Hyper-V VM, EC2 instance.)
- Select identification method: Name (the other option is ID). This is how you identify the VM in question. You can use any part of the VM’s string. Here, we know the exact name, so we have used the Equals function.
- Define the search string: DC-VM-replica.
This action checks if the VMware VM named DC-VM-replica is running. Click Save to proceed.
Action 4. As for Action 1, click Failover VMware VMs.
Again, select the VM replica. FS-VM-replica is selected in this case. Click Next, then select the same options for the failover action as you have in Action 1 and click Save.
Action 5. Click Wait and configure this action as you did for action 2. The time specified is again 3 minutes in this walkthrough.
Action 6. Click Check condition in order to check if the VMware VM FS-VM-replica is running. Refer to the action 2 and select the same options – except, of course, for the VM name.
Action 7. Click Start VMware VMs in the left panel of the Actions interface of the New Site Recovery Job Wizard.
Select DB-VM. This VM can be started once you are sure that the FS-VM-replica is running. On the bottom of the page, select the same action options as shown in previous actions. Then click Save.
Action 8. Wait for 5 minutes. Click Wait and configure this action similarly as for action 2 (but set the time to wait for 5 minutes). This should be enough time to start the Oracle service on DB-VM.
Action 9. On the Actions interface click Run script. Recall from the workflow outlined above that this script is intended to recover the Oracle database on the database level from a dump stored on FS-VM-replica.
Define the script options. In this case:
- Target type: VMware VM
- Target VM: DB-VM
- Script path: /home/oracle/restore.db.sh
- Username: oracle
- Password: (password)
Your script path, username and password likely differ. Don’t forget to ensure that a script file is executable and that the user has sufficient permissions to run the script. Action options are configured as usual in this example. Click Save when you are ready to continue.
Now you can see all actions configured. Click the Next button to continue configuration of the site recovery job with the Wizard.
2. Network Mapping
If your VMs at the production site and DR site are connected to different networks, then tick the Enable network mapping checkbox. Click Create new mapping, in the pop-up windows select a source network, destination network, and a network used for site recovery job testing. Click Save to save the network mapping rule, then click Next. (Alternatively, you can use existing mapping rules if you have configured them in other replication, failover, or site recovery jobs.)
If the networks used for VM connection at the source site and target site have different addresses, then you should enable Re-IP by ticking the Enable Re-IP checkbox. Now that Re-IP is enabled, create a new Re-IP rule by clicking Create new rule. Define the source settings and target settings, then click Save. Read the blog post to learn more about configuration of network mapping and Re-IP. (Alternatively, you can use existing re-IP rules.)
Click Select VMs and check the boxes near the VMs for which Re-IP should be used. You should provide the credentials for a user with sufficient permissions to change network settings in the guest operating system of the VM.
4. Test Schedule
A schedule is enabled only for the purposes of running site recovery jobs in test mode. This allows you to test whether your site recovery job can be run successfully within the appropriate time frames. Once you have configured the scheduling as desired, click Next. A detailed walkthrough of Site Recovery job testing is included in the next blog post of this series.
5. Job Options
Type the job name and recovery time objective (RTO). Click Finish when configuration is complete.
Now you know how to create and configure a site recovery job based on a logical workflow with NAKIVO Backup & Replication. Read the next blog posts to learn more about testing your site recovery jobs as well as the failover and failback actions used for site recovery.