Backup Deduplication Explained
Michael Bose, posted on May 22, 2017
Large virtual infrastructures generate a vast amount of backup data that leads to the increase of spending on storage infrastructure: storage appliances themselves and their maintenance. So network administrators are looking for the ways to save storage space. One of the widely-used techniques is backup deduplication.
A concept of deduplication is almost as old as the computer. Its grandparents are the LZ77 and LZ78 compression algorithms that were introduced in 1977 and 1978, respectively. They imply the replacement of repeated data sequences with reference to the original ones. This concept influenced other popular compression methods. The most well-known of these is DEFLATE that is used in PNG image format and ZIP file format. However, we are more interested in how the deduplication works with VM backups and how exactly it helps to save storage space and thus costs spent on infrastructure.
In a nutshell: during the VM backup, data deduplication checks if new blocks of data are identical to those already available in the backup repository. If there are duplicates, they will not be copied, and a reference to the existing data blocks will be created. That’s it.
How much space can data deduplication win? Here’s an example: The minimum system requirements for Windows Server 2016 claim that you need at least 32 gigabytes of free disk space to install the system. If you have ten VMs running this OS, their backup will total at least 320 GB, and this is just a clean operating system, without any applications or databases on it. The odds are that if you need to deploy more than one VM with the same system, you will use a template, and this means that initially, you will have ten identical machines. And this also means that you will get 10 sets of duplicate data blocks. In this example, you will have 1:10 storage space saving ratio. In general, savings ranging from 1:5 to 1:10 are considered to be good.
Backup Deduplication Techniques
The backup deduplication techniques can be split into the categories by the following dimensions:
- Where it is made
- When the deduplication is made
- How it is made
Backup deduplication can be made either on the source or the target side, and those techniques are called Source Side Deduplication and Target Side Deduplication respectively.
The source side deduplication decreases network load because less data would be transferred during the backup. However, it requires a deduplication agent to be installed on each VM. The other drawback is that source side deduplication may slow down VMs due to calculations required for the identification of duplicate data blocks.
The target side deduplication first transfers the data to the backup repository and then performs deduplication. The heavy computing tasks are performed by the software in charge of deduplication.
Backup deduplication can be inline or post-processing. The inline deduplication checks for data duplicates before it is written to a backup repository. This technique requires less storage in a backup repository as it clears backup data stream from redundancies, but it results in longer backup time as the inline deduplication happens during the backup job.Post-processing deduplication processes data after it is written to the backup repository. Obviously, this approach requires more free space in the repository, but backups run faster, and all necessary operations are made afterwards.
The most common methods to identify duplicates are the hash-based and modified hash-based ones. In case of hash-based method, the deduplication software divides data into blocks of fixed or variable length and calculates a hash for each of them using cryptographic algorithms such as MD5, SHA-1, or SHA-256. Each of these methods yields a unique fingerprint of the data blocks, so the blocks with similar hashes are considered to be identical. The drawback of this method is that it may require significant computing resources, especially in case of large backups.The modified hash-based method uses simpler hash-generating algorithms such as CRC which produce only 16 bits (compared to that of 256 bits in SHA-256). Then, if the blocks have similar hashes, they are compared byte-by-byte, and if they are completely similar, the blocks are considered to be identical. This method is a bit slower than the hash-based one, but it requires less computing resources.
NAKIVO Backup & Replication uses the target post-processing deduplication with modified hash-based duplicates detection. Depending on the size and structure of a virtual environment, such data deduplication decreases backup size up to ten times.