June 2, 2017
GFS Retention Policy Explained
At first, let’s recap what retention policy is and why we need it. Ideally, the goal of backup is the ability to recover data from any point back in time. The straightforward way to do that is to keep periodical backups, usually daily. However, even with space-saving techniques like forever-incremental backups, synthetic backups, data compression, and deduplication, this approach requires unlimited storage capacities, and barely any company can afford this. That is why backup retention policy, or backup rotation scheme, exist.
This backup retention policy pursues two goals: the minimization of storage space and the maximization of recovery points. Simply said, our task is to get the most recovery points using the least storage space.
There are some backup rotation schemes with different complexity and different efficiency. The simplest one is a 'first in, first out' (FIFO). It is simple: when the backup media runs out of space, the oldest backup is deleted, and the new one is written in its place. The FIFO’s merit is its simplicity, and its biggest drawback is its spatial limits, thus with FIFO you can store a finite number of backups. Depending on how frequently you backup your VMs and how big your backup repository is, you can cover with backups a relatively small time interval. However, this time interval is covered in full.
Do all companies need such verbatim backup coverage? Of course, there are some, like financial or government institutions, where even small data loss can be extremely costly. No one would like if his or her bank account or social security is voided because of some hardware failure in the datacenter. That is why such organizations spend really big money on backup storages, tape archives and so on.
However, most of the businesses are not so extremely sensitive to data loss so they can implement a backup rotation scheme which does not imply storage of daily backups within a year. Such backup retention policy allows finding a sane balance between the data recoverability and costs spent on the backup infrastructure. One of the most commonly used is a Grandfather-Father-Son (GFS) rotation scheme.
Who are all these relatives? Like in a human family, a son is the youngest, a father is older, and a grandpa is the oldest one. In a backup’s world, a son is the most recent backup from a given moment, and a grandfather is the most distant. Usually, a son is a daily backup, a father is a weekly one, and a grandfather is a monthly one. However, you can add more 'relatives' in between, like hourly, quarterly, or annual backups. For example, Apple’s macOS has a built-in Time Machine backup utility that uses a GFS rotation scheme, and a son is an hourly operating system backup, and a grandfather is a monthly one.
The classic GFS scheme implies daily backups as 'sons', weekly as 'fathers', and monthly as 'grandfathers'. The initial full backup made on Monday becomes the first 'father', and the following incremental daily backups are 'sons'. The last backup of the week becomes the next 'father'.
The 'sons' are rotated on the FIFO rotation scheme, so the oldest 'son' is replaced with the new incremental backup, and the cycle repeats. The last backup of the month becomes a 'grandfather'. After that, 'fathers' begin to rotate by the FIFO scheme.
On the figure below, you can see which backups are available as of the end of June if we started backing up a VM in April: the blue items represent the backups available and the gray ones represent those that are not.
One of the disadvantages of the GFS scheme is that older backups become less granular. For instance, if you created some file, say, on Monday on the second week of June, and then deleted it the next day, it will be lost irretrievably.
Depending on your organization's data protection policy, you can add hourly, quarterly, or annual backups to the GFS rotation scheme. With other space-saving techniques like forever-incremental backups, synthetic backups, and backup repository compression and deduplication, it provides reasonable data protection without spending tons of money on backup storages infrastructure.