Disaster Recovery vs. High Availability vs. Fault Tolerance

When it comes to keeping an organization’s IT infrastructure up and running 24/7, there still seems to be some confusion between the three core terms used in this domain. These core terms are  high availability (HA), fault tolerance (FT), and disaster recovery (DR). The terms are often used interchangeably since, on the surface, all of them are aimed at achieving an IT system’s continuity. It is important, however, to note that each of these terms have their own specific definitions, methodologies, and roles.

In this blog post, we are going to define the meaning of high availability, fault tolerance, and disaster recovery in practice; explore how the terms overlap, as well why they are important to implement.

High Availability

High availability is a characteristic of a system, which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.

HA is a concept that only manifests itself through technology. The goal of an HA design is to deliver 99.999% of operational uptime. Nevertheless, it is important to emphasize that HA is not expected to deliver 100% uptime, and that downtime (up to 5.26 minutes/year) is acceptable.

How does high availability work?

The objective of the “five nines” is achieved through the elimination of a single point of failure in a system. For this, you can implement redundancy and failover components that are configured to handle the workloads without human intervention, in case a primary component fails.

In virtualization, high availability can be designed with the help of clustering technologies. For example, when one of your hosts or virtual machines (VM) within a cluster fails, another VM takes over and maintains the proper performance of the system.

When is high availability important?

A well-thought out HA architecture is important for any business that strives to minimize downtime. According to statistics, in 2017, the cost of an hourly downtime accounted for between 301 and 400 thousand USD for a big number (24%) of businesses worldwide. This means that even the acceptable amount of downtime – 5.26 minutes – costs businesses up to 35 thousand USD.

Besides significant financial losses, downtime may have other serious implications such as productivity loss, an inability to deliver services in a timely manner, damaged business reputation, and so on. Highly available systems help avoid such scenarios by handling failures automatically, and in a timely manner.

What makes a system highly available?

While having redundancy components is an ultimate condition for ensuring high availability, just having these components in place is not enough for the system to be considered highly available. A highly available system is the one that includes both redundant components and mechanisms for failure detection, as well as workload redirection. These can either be a load balancer or a hypervisor.

Fault Tolerance

Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of some (one or more faults within) of its components.

In simple terms, fault tolerance is a stricter version of high availability. HA focuses on delivering the minimal possible downtimes, while FT goes further by delivering zero downtime. However, in the fault tolerant model, the ability of a system to deliver high performance in the event of failure is not a top priority. In contrast, it is expected that a system can maintain operational performance, but at a reduced level.

How does fault tolerance work?

Similarly to high availability, fault tolerance also works on the principle of redundancy. Such redundancy can be achieved through simultaneously running one application on two servers, which enables one server to be able to instantly take over another if one were to fail.

In virtualization, redundancy is achieved through keeping and running identical copies of a given virtual machine on a separate host. Any change or input that takes place on the primary VM is duplicated on a secondary VM. This way, in the event that the VM is corrupted, fault tolerance is ensured through the instant transfer of workloads from one VM to its copy.

When is fault tolerance important?

Fault tolerant design is crucial to implement if your IT system cannot tolerate any downtime. If there are critical applications that support your business operations, and even a slightest downtime can translate into irrevocable losses, you should consider configuring your IT components with FT in mind.

What is a fault tolerant system?

A fault tolerant system is a system that includes two tightly coupled components that mirror each other, providing redundancy. This way, if a primary component goes down, the secondary one is always set and immediately ready to take over.

Disaster Recovery

Disaster recovery involves a set of policies, tools and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster.

What is disaster recovery?

Unlike high availability and fault tolerance, disaster recovery deals with catastrophic consequences that render entire IT infrastructures unavailable rather than single component failures. Since DR is both data- and technology-centric, its main objective is to recover data as well as get infrastructure components up and running within the shortest time frame after an unpredicted event.

How does disaster recovery work?

Normally, DR requires having a secondary location where you can restore your critical data and workloads (whether entirely or partially) in order to resume sufficient business operation following a disruptive event. To transfer the workloads to a remote location, it is necessary to incorporate a proper disaster recovery solution. Such a solution can take care of the failover operation in a timely manner and with little to no input on your part, which allows you to achieve your designated recovery time objectives.

What are the components of disaster recovery?

Unlike HA and FT, disaster recovery is a much broader and more complex concept that refers to a strategy with a comprehensive set of components including: risk assessment, planning, dependencies analysis, remote site configuration, staff training, testing, automation setup, and so on. Disaster recovery goes beyond high availability and fault tolerance, but can and should include these factors in its technological design.

When is disaster recovery important?

The term disaster not only refers to a natural catastrophe, but to any kind of disruptive event that leads to significant downtimes such as cyber-attacks, power outages, human errors, software failures, and other incidents. This means that such events can take place anywhere at any time, making organizations of all types and sizes their potential victims. While in most cases disasters are impossible to predict or avoid, organizations can and should take measures to strengthen their disaster recovery preparedness, as well as optimize their DR strategies on a regular basis.

NAKIVO Backup & Replication for Disaster Recovery

NAKIVO Backup & Replication is a fast, reliable, and affordable solution that combines high-end data protection as well as disaster recovery functionality in a single piece of software. The Site Recovery functionality was designed for the simplicity and automation of disaster recovery operations.

If you have a remote site configured, as required by DR best practices, you can fully rely on NAKIVO Backup & Replication as your disaster recovery tool. The Site Recovery functionality is easy-to-use and configure, yet it allows you to build complex recovery workflows.

You can combine up to 200 actions in one workflow (job) to fit different disaster scenarios and serve different purposes, including: monitoring, data center migration, emergency failover, planned failover, failback, etc. In the event of a disaster, any of the created workflows can be put into action immediately, in a single click, allowing businesses achieve the shortest time to recovery.

With Site Recovery in place, you can perform automated non-disruptive DR testing. This way, you can make sure that your site recovery workflows are valid, and that they reflect all recent changes that took place in your IT infrastructure, in order to exclude any possible weaknesses before actual disaster hits.

Statistics show that the majority of IT professionals consider modern DR solutions as unaffordable luxuries rather than a necessary element in their data protection and recovery strategy. NAKIVO has made robust disaster recovery affordable for many businesses by offering NAKIVO Backup & Replication with Site Recovery at a fraction of the cost as compared to its competitors.

Conclusion

While high availability and fault tolerance are exclusively technology-centric, disaster recovery encompasses much more than just software/hardware elements. HA and FT focus on addressing the isolated failures in an IT system. DR, on the contrary, deals with failures of a much bigger scope, as well as the consequences of such failures. Incorporating high availability or fault tolerance cannot ensure protection from disasters, but both of them can complement disaster recovery strategies in an efficient manner.

NAKIVO Backup & Replication with Site Recovery is a turn-key solution that provides integrated protection against data loss. By incorporating the solution into your environment, you can ensure quick recovery across multiple sites under all circumstances.

Disaster Recovery vs. High Availability vs. Fault Tolerance
5 (100%) 1 vote

Share:

LinkedIn Google+