RPO and RTO: What Is the Difference and Why You Should Care?
Michael Bose, posted on June 7, 2017
The ultimate goal of data protection is clear: you want to be sure that you will not lose data if something goes wrong. However, it is quite costly to mirror all the changes in your virtual environment to a DR site. That is why you need to accept the idea that you will lose some data and your IT services will be interrupted in case of an outage. Thus, your task is to minimize those losses and interruptions. You need to come up with realistic estimates of those based on your business needs and your capabilities. Simply said, you need to calculate how much data can you tolerate to lose and for how long can your IT services be down without severe consequences to your business. These numbers are RPO and RTO (recovery point objective and recovery time objective).
Let’s illustrate those with a simple diagram:
The diagram tells us a pretty simple and common story: we had a virtual machine, one (not very) lovely day it had crashed, but we had restored it (heroically), and everything went fine. We have two color stripes on a timeline, so let’s explain what they are.
The orange one represents the RTO – the time during which your VM was unavailable. You need some time to:
- Detect a failure
- Decide what to do
- Take action to fix the problem
During this time the failed VM does not provide its services, so (if it’s in VIP VM) your business freezes. Your clients cannot add an order to a cart. Your employees cannot send and receive emails. Your boss cannot view financial reports. To summarize, your business loses time and money. So the RTO is the maximum downtime your business can afford, and your goal is to restore services within this period.
The yellow stripe shows an RPO. You could be lucky if a backup or replication job finished just before the original VM failed. However, this is a rare case, so you would have a gap between the moment when the last successful backup was made and the moment the original VM failed. During this time, the VM was performing operations and storing data, and most likely this data will be lost. So the RPO defines how much data you will lose. Your business must withstand RPO without critical consequences.
Defining RPO and RTO depends on how critical the services provided by a VM are, and how busy they are. For example, a file server used to store photos from company’s corporate parties for sure can survive longer RPO and RTO. Meanwhile, a corporate email server must be restored as soon as possible because its failure paralyzes internal communications and internal email contains important data. Thus, it requires RPO and RTO to be as short as it is possible.
NAKIVO Backup & Replication has some great features to satisfy your RPO (by allowing you to make VM backups more frequently) and RTO (by providing instant VM recovery and VM replication). The simplest way is to schedule regular backups with an interval which is no more than your RPO. As mentioned before, the recovery time consists of three components: detect, decide, act. The first two can be covered and even automated with API integration with your network monitoring services, so you can trigger a recovery process immediately after a VM becomes unavailable.
Speaking of taking actions, you can have replicas (exact copies) of important VMs, so if the original VM failed, they would be powered on automatically. However, if you do not have a possibility to maintain replicas (they require a dedicated virtual infrastructure or at least host), you can run a Flash VM Boot which still provides good recovery times.
- RPO and RTO are the metrics for your data protection plan
- Recovery point objective (RPO) is the maximum time between two consequent backups
- Recovery time objective (RTO) is the maximum time to recover VMs
- The more important your VMs, the lower RPO and RTO values must be.