May 29, 2017
How Does Forever-Incremental Backup Work on VMs?
The simplest way of backing up a VM is doing the full backup. You just need to copy an entire virtual machine to a backup repository. This simplicity has its price: making the full backup takes a long time, loads production networks, and uses a significant amount of space. One of the ways to reduce the amount of backup data is to use a forever-incremental backup.
At first glance, the concept of backups seems to be easy. You just copy data from a VM to some storage (backup repository), and then, if needed, you put back the necessary data. However, this comes at a cost. If you run a backup for every VM in your environment every time, you would be tired of buying new HDDs for your backup repository. For example, if you have 10 VMs with Windows Server 2016 (which requires 32 GB for a clean install), and you need to back them all up every day and you will end up using 320 GB of free space every day or more than 2 TB weekly! Simply speaking, you’ll need a few new disks every week. The odds are that your financial department will not be happy with such spending.
The other obstacle is that all this data needs to travel to the backup repository. Often, data flows through production networks, slowing down the business-critical routines. Even if you schedule the backup off the business hours, a full backup of such huge amounts of data can still impact other operations.
One of the ways to decrease this vast amount of data is to use backup techniques other than the full backup. Those are the differential and incremental backups. In this post, we will focus on how the incremental one works, while the other post compares it with the differential one.
As it follows from its name, incremental backup deals with increments, i.e., changes made since the last backup. This way, the least possible amount of data is transferred to the backup repository. Of course, there must be a starting point, so the initial full backup must be made. Let’s see an example on how incremental backup works.
Say, we have three files in a VM. Each file consist of four data blocks numbered 1 through 4, and the initial backup is made on Sunday.
On Monday, we have changed block 1 to block 5 in File 1. A backup application does not copy all three files or all four blocks of File 1, it just copies one changed block of File 1 and sends the information that this block must replace the block with number '1'.
On Tuesday, we have added blocks 6 and 7 to File 2. Again, only those changes are copied to the backup repository.
On Wednesday we have deleted File 3. No changes sent, except the message that File 3 was deleted.
Let's go back to the example with 10 VMs with Windows Server 2016. First, you need to make an initial full backup, that is 320 GB of space total. If we assume that daily changes on each machine average in 1GB, it totals in 10 GB of data daily or 70 GB per week. Sounds like less than 2 TB of full backups, right?
How exactly are the changes on VMs tracked? Both VMware and Hyper-V hypervisors have their own technologies to do that. These are Changed Block Tracking (CBT) from VMware and Resilient Change Tracking (RCT) from Hyper-V. These native technologies provide the quickest way to identify changes which result in faster backups. If for some reason you cannot use CBT or RCT, backup vendors use their own change tracking methods.
However, the title of this article has a term 'forever-incremental'. What’s the difference? Well, forever means forever. After you make an initial full backup, all consequent ones will be incremental only, period. So, you will always have the least space possible occupied in a backup repository, and backups will run (almost) lightning-fast.
The downside of forever-incremental backup is the time necessary to reconstruct the original VM. It happens because during recovery a backup application at first must recover the initial backup and then "replay" all increments to the date of recovery. Looking at the example above, a backup application would first need to recover 3 files with all their data blocks, then replace block 1 with block 5 in File 1, then add blocks 6 and 7 to File 2, and finally delete the File 3 (which it first took time to restore, hmm…). Some legacy backup solutions still use this approach. Others take the pain of transforming increment data in the backup repository into synthetic full backups, which again takes time and imposes an extra load on your environment.
The way to avoid such a long recovery time and unnecessary data manipulation is to use the full synthetic data storage mode. In a nutshell, the full synthetic mode stores data only once and makes references to the data blocks that are necessary to recover a VM as of a particular moment in time. When this mode is used, the backup solution already knows which data blocks constitute a VM, such as this:
With the full synthetic data storage mode, it takes about the same time as recovery from a full backup. So, speaking of the total time of backup and recovery, the combination of forever-incremental backup and full synthetic data storage mode provides the best results.
- Incremental backup tracks and copies only changes made since the last backup
- VMware has the CBT technology to identify changes, and the Hyper-V has the RCT one since Hyper-V Server 2016
- The proprietary methods track changes as well, but they work slower
- Forever-incremental backup needs only the initial full backup after that only changes are copied
- The downside of forever-incremental backup is that it is necessary to “replay” all changes during recovery, and this may take a significant amount of time
- It is possible to avoid this by using the synthetic backup