Disaster Recovery

Category: General
Created: 2021-03-10

What is a Disaster Recovery Plan?

A Disaster Recovery Plan is the concrete preparation for the event that one hopes will never occur. But if it does happen, then you are at least as well prepared as possible (you will have enough other problems anyway).

What the plan looks like cannot be given in general terms, it depends on the individual risks. In the following, we will therefore focus on typical hosting and web applications and provide an insight into our own disaster recovery plan.

Creation of a disaster recovery plan

1. Identify your assets

Create a list of the assets that are business critical for the operation of your services. Think about what service your customers are paying for and how long a failure would be tolerable. This is not just about physical objects (e.g. servers) but also about domains, external services or data.

Don’t just think of direct, but also indirect dependencies. For example, how long could you continue to operate your business if for technical reasons you could not receive payments for a while?

Then sort this list by priority - usually according to the tolerable downtime.

2. Identify your locations / suppliers

For each item on the list, name the location or the supplier. This makes it easier to keep track of how far you are affected if a location or supplier fails.

3. Outline failure scenarios

Now create the worst possible scenarios for all points. And: think big. Do not expect individual server failures, but a complete location or supplier failure:

Major event: direct (fire/explosion/…) and indirect (the neighboring chemical plant/nuclear power plant/… burns)
Bankruptcy: a supplier becomes insolvent, all services are stopped immediately
Seizure: a customer is investigated and all systems are seized

The cases are unquestionably dramatic. However, it is not a question of how probable these are, but that they are possible in principle.

4. Develop measures

That is the most important point: what preparatory measures could one take to recover quickly from the individual possible incidents? Prevention is only of secondary importance: every data center has a fire protection system, but it can still burn down.

Think about practice too. In many cases, the provider is not or only limited liable for the data of its customers (if only because the costs for IT liability insurance would otherwise rise immeasurably). But in the event of a data loss, would all customers actually be able to simply re-upload their Wordpress websites, or is there a greater risk that they will terminate and change provider?

Practical recommendations

Off-Site Backup

It is always a good idea to keep the most critical business data secure in a separate location. The following applies:

the data must be secure (for heaven’s sake no unencrypted backups on external hard drives!)
the backup must be automated
the backup must be monitored

Also think of the documentation requirements according to the GDPR, including: where is the data store, who has access to it, how is it ensured that data is also deleted from backups?

Infrastructure documentation

For each system, it must be documented how it is configured and what infrastructure requirements must be considered. Depending on the application, this can also be automated (Ansible / Chef / Puppet / Saltstack).

Develop products

At some point, the question arises as to who should bear the costs for the additional effort (e.g. off-site backup, standby systems, etc.).

If you factor this into your own prices, then you can also advertise it accordingly. Alternatively, you can develop separate products (e.g. off-site backup, managed servers with mirrored standby servers, geo-redundant storage, …) and sell them as add-ons for a surcharge.

Our Disaster Recovery Plans

We use on-site backups (i.e. at the same location) to be able to quickly restore large amounts of data in the event of a hardware failure. Additional off-site backups are automatically saved to an external location.

For the backups themselves, we use Borg. The backups are of course all encrypted, and access to the backup servers is only possible with two-factor authentication.

The backup intervals are different: the backup of the development environment and code management has completely different requirements than the test server.

A monitoring (Nagios / Icinga) monitors the age of the backups and warns if something has not worked.

Last but not least, we regularly try to run through a complete disaster recovery process, i.e. a randomly selected system must be restored spontaneously exclusively from the backup.