June 7, 2017
RPO and RTO: What Is the Difference and Why Should You Care?
The ultimate goal of data protection is clear: you want to be sure that you will not lose data if something goes wrong. However, it is quite costly to mirror all the changes in your virtual environment to a DR site. That is why you need to accept the idea that you will lose some data and your IT services will be interrupted in case of an outage. Thus, your task is to minimize those losses and interruptions. You need to come up with realistic estimates of those based on your business needs and your capabilities. Simply said, you need to calculate how much data can you tolerate to lose and for how long can your IT services be down without severe consequences to your business. These numbers are RPO and RTO (recovery point objective and recovery time objective).
Let’s illustrate those with a simple diagram:
The diagram tells us a pretty simple and common story: we had a virtual machine, one (not very) lovely day it had crashed, but we had restored it (heroically), and everything went fine. We have two color stripes on a timeline, so let’s explain what they are.
What Is RTO?
The orange one represents the RTO – the time during which your VM was unavailable. You need some time to:
- Detect a failure
- Decide what to do
- Take action to fix the problem
During this time the failed VM does not provide its services, so (if it’s in VIP VM) your business freezes. Your clients cannot add an order to a cart. Your employees cannot send and receive emails. Your boss cannot view financial reports. To summarize, your business loses time and money. So the RTO is the maximum downtime your business can afford, and your goal is to restore services within this period.
What Is RPO?
The yellow stripe shows an RPO. You could be lucky if a backup or replication job finished just before the original VM failed. However, this is a rare case, so you would have a gap between the moment when the last successful backup was made and the moment the original VM failed. During this time, the VM was performing operations and storing data, and most likely this data will be lost. So the RPO defines how much data you will lose. Your business must withstand RPO without critical consequences.
Defining RPO and RTO
Defining RPO and RTO depends on how critical the services provided by a VM are, and how busy they are. For example, a file server used to store photos from company’s corporate parties for sure can survive longer RPO and RTO. Meanwhile, a corporate email server must be restored as soon as possible because its failure paralyzes internal communications and internal email contains important data. Thus, it requires RPO and RTO to be as short as it is possible.
Differences Between the Recovery Objectives
To understand how to determine RTO and RPO, you should look at their differences as well as what their role is in the DR process.
RTO is primarily concerned with the period of time within which business operations are expected to be resumed during a disaster. Therefore, you should first assess your business needs and priorities, as they are unique to each organization. Moreover, consider which applications are the most critical for the overall performance of your business, as well as what the repercussions may be if these applications were to fail. As a result, you can better determine the order in which the system should be restored in order to ensure successful disaster recovery with minimum downtime-incurred losses.
RPO, on the other hand, is more focused on the amount of data that can be lost during downtime without causing any serious damage to your business. To calculate RPO, identify how frequent your backups are, as well as how much data might be lost between the latest VM backup and an actual disaster. Consider whether your business can afford to lose such an amount of data, as well as what the potential risks of losing this amount of data may be.
The main difference between RTO and RPO is that the former takes into account all aspects of the business structure and the DR process as whole, whereas the latter only considers the criticality of data and applications for business continuity. Therefore, meeting RTO values might be a more demanding and expensive task than ensuring PRO metrics (backups).
Due to the fact that RPO is focused on data and your system’s resiliency to loss, it is recommended that you run frequent data backups. Many modern virtual machine backup solutions allow you to perform automated VM backups, meaning that your backup strategies can be tailored in a way that meets your RPO goals efficiently, and with minimal input on your part.
At the same time, achieving RTO is a more complex process to manage, as it takes into account all business processes and system components that need to be recovered during a DR event. Therefore, it is extremely difficult to automate and orchestrate the entire DR process from start to finish while also ensuring that your RTO goals can be met.
Ease of calculation
RPO metrics are much easier to calculate, as they only cover one aspect of the recovery process – data usage. To define the RTO that is applicable to your organization, answer the following questions:
- How much data can your organization afford to lose?
- How often do you back up business-critical systems?
After answering these questions, consider whether the expected results can satisfy your current business needs. If not, think of how you could improve your backup strategies in order to keep backed up data as current as possible.
On the other hand, RTO considers all aspects of your organization, including the importance of your data and services, the cost of downtime, investment in DR activities, etc. Therefore, it is advisable to calculate the RTO on the basis of a business continuity plan, which outlines possible business risks and threats, as well as describes the steps to be taken to resume business operations. After calculating RTO metrics, you can critically look at how well your system is prepared for a DR event and whether any adjustments are required to improve your DR strategies.
Achieving Tighter RPO and RTO with NAKIVO Backup & Replication
NAKIVO Backup & Replication has some great features to satisfy your RPO (by allowing you to make VM backups more frequently) and RTO (by providing instant VM recovery and VM replication). The simplest way is to schedule regular backups with an interval which is no more than your RPO.
As mentioned before, the recovery time consists of three components: detect, decide, act. The first two can be covered and even automated with API integration with your network monitoring services, so you can trigger a recovery process immediately after a VM becomes unavailable.
Speaking of taking actions, you can have replicas (exact copies) of important VMs, so if the original VM failed, they would be powered on automatically. However, if you do not have a possibility to maintain replicas (they require a dedicated virtual infrastructure or at least host), you can run a Flash VM Boot which still provides good recovery times.
To meet short RTOs, NAKIVO Backup & Replication has introduced a special feature – Site Recovery (SR). By arranging actions and conditions into an automated algorithm, you can be sure that every aspect of the DR process is handled. You are allowed create multiple SR jobs, each of which can be tailored to address a specific DR scenario (e.g. fire or power failure) or serve a certain purpose (e.g. planned migration or disaster avoidance). With a comprehensive SR job in place, you can considerably minimize downtime and ensure business continuity.
With NAKIVO Backup & Replication, you can even run SR jobs in test mode, without disrupting the production environment. SR job testing allows you to verify that all DR activities can be performed as planned, VMs can be successfully recovered, and that the expected RTOs can be met. SR job testing can either be performed on demand or according to the schedule that you set up. Note that all the changes made are discarded after the job completion, and the system is reverted back to its original state.
After the test run is completed, you receive an email report with the information about the job results as well as whether any issues were detected. If there are any issues detected, you can make the necessary adjustments to a SR job and perform testing again.
To wrap up:
- RPO and RTO are the metrics for your data protection plan
- Recovery point objective (RPO) is the maximum time between two consequent backups
- Recovery time objective (RTO) is the maximum time to recover VMs
- The more important your VMs, the lower the RPO and RTO values must be.
Download our write papers "How to Calculate a Recovery Time Objective and Cut Downtime Costs" and "Achieving Tighter RPOs: Methods and Tools" for more information on the topic.