Backup Policy

James F. Carter, UCLA-Mathnet, 2005-10-21

With the transition to a new operating system and a whole new technology of backup, we need to review and codify our backup policy.

Goals

All files which are created by us and our users, or which otherwise are both valuable and hard to replace, should be backed up.
Our primary goal for backups is to defend against catastrophic total loss of one or more servers. An important additional goal is short term file restoration, i.e. I deleted an important file; can I get it back?
Policies, and technological adaptations to those policies, should be simple so people can remember them. However, inevitably there will be special needs, and we need to balance functionality in daily use, ease of backing up the files, and ease of restoring them when needed.
Backup copies must survive for an agreed-upon time, and then go away on schedule. For legal reasons it's important for us to be able to say truthfully, we have backups to this date and no farther. And it's important that we document and publish our backup schedule.

Issues

Backups have two phases, both of which must be considered carefully: getting the files onto the backup media, and getting them from the media to a new server, in case of a blown disc. The latter step we hope is very rare, but has to work with high reliability when needed.
There is a lot of material that should not be backed up, either because it is easily reinstalled, or because it is not valuable. We should work hard to keep backed-up and non-backup files separate, and to define policies that help us to do that.
Traditionally, backups have been done of entire filesystems, because that's how the UNIX dump program works. Amanda under Linux can, and in our case does, dump named directories independent of filesystem boundaries.
There are two classes of backed-up files: large directories containing only backed up files (example: a home directory partition), and relatively small special cases which, at least currently, are hiding among much larger volumes of non-backup files (but in a few cases the tail has started to wag the dog). Let's call these outliers. Examples of outliers are /etc (configuration files) on each machine; the web directories on special purpose web servers; and e-mail directories. Strategies for backing up outliers need to be different, effective, and easy to administer.

Policy

This is jimc's proposed policy for backups:

We back up directories named /h[0-9]. On homedir servers, these will be entire filesystems filled with home directories. When a non-server has a /h directory that we consider valuable, we back that up too. This does not include non-backed-up private data areas on workstations. We'll back up a directory not named /h[0-9] only after major arm twisting.
On machines that are not homedir servers, it's common for /m1/h1 to exist. To the extent feasible we put outliers there, and back it up without backing up all of /m1. It normally is reached under the name of /h1 by a symlink or bind mount, and it counts as being a /h[0-9] directory, although this policy does not specify whether it's dumped under the /h1 or /m1/h1 name.
Outliers not already in a /h directory are copied to a (tgz file in a) /h directory, which is backed up. On non-homedir servers, a suitable subdirectory within /m1/h1 is a recommended place for these tgz files. Outlier backups are automated by a special script, as far as possible, which runs just before the /h directories are backed up.
At EOQ we also back up some /m directories: /m1 on Sunset and Malibu, and possibly other software that is hard to install, as value may appear. It's a matter for debate how sanguine we're going to be about software installation directories and how we can dump them without having to dump all the /m's every day.
Every directory that should be backed up will have a file called BACKMEUP; every /h[0-9] directory that should not be backed up will have a file called DONTBACKUP. Outlier directories that should be compressed and copied to a /h directory will have a file called BACKUPTAR (which signals DONTBACKUP also; you need one or the other). Contents of the file is not a policy issue but in the case of BACKUPTAR it's a convenient place for a list of filenames to exclude from the dump.

Implementation Issues

Currently, mail is backed up in two ways: as an outlier on a directory sunset:/m1/mailbackup, and by direct backups of /m1 on each homedir server. For political reasons we can't put either the system mailbox or the backup copy on the same filesystem as the user's homedir, i.e. /h1, where it would count towards his disc quota. Nor do we (jimc) want to back up /m1 just to get the mail. Therefore we will continue to back up the mail across the net to Sunset, but the backup directory will be on sunset:/h1 so it actually gets copied to the backup media, and the mail backup will be done just before filesystem backups. Mail acts as an outlier, but the manner of backing it up is special.
Currently, /etc is backed up in two ways: as an outlier on a directory sunset:/h1/rootbackup, and by direct backups of the root (servers only). The issues are the same as for mail: we don't need to backup 3 to 5 GB of root to catch /etc and possible other outliers in /var.
Additional outliers so far identified are /m1/custom on all hosts (should be moved to /h1 or /m1/h1), namedaemon on slave servers (Sunset's is in /h1), and the Certificate Authority, which needs a special procedure.
Let's add a step in dump execution, in which /usr/math/lib/daily.d/P* are executed on all hosts just before dumps occur (initiated from Ulanda). These P-series scripts do the outlier backups: /etc on all hosts, mail on homedir servers, and additional outliers where needed.
To avoid the inevitable falling through the cracks, we need an automated auditing script to see that all /h directories have the appropriate file controlling backups, and that the requested backups are happening.
We will avoid bind mounts of /m1/h1 onto /h1, unless there is a real operational need. At present these bind mounts are in use:
- Arachne: web directories, /h1/www on /srv/www
- Zuma: web directories, /h1/www on /srv/www
- Hollyfs: web directories, /m1/h1 on /h1
- Papyrus: web directories, /m1/h1 on /h1
- Babe: snort reports, /m1/snortlog on /src/www/htdocs/snortlog
If the recent work with Apache has made it feasible to implement these with symlinks, we'll promptly implement them with symlinks instead of bind mounts, and make the linkages more uniform and transparent.
On Laguna and Malibu, /h1/m1 exists and /m1 is a symlink to it. When time permits (i.e. EOQ) we'll create an actual /m1 partition on these machines.
We need, and may not have, a backed-up copy of the partition table of every machine.
Here's an outline of the procedure for restoring a server after a total loss:
1. Restore (in a scratch area) /etc and /m1/custom for the hosed machine using rootbackup. Look at the partition table.
2. Format the disc of the new server, matching the original partition table and mount points as close as possible.
3. Install Linux on the new server in the normal way. Restore /etc and /m1/custom (specifically extra.sel and scripts.extra) from the scratch area. Post-jump the machine to get all the extra packages.
4. Scan /m1/local on another machine (or use an index saved in the rootbackup) to identify what non-backed-up software was on the machine. Install stuff from CDs or download the latest version from the web and install it. Possibly EOQ dumps could speed up this process.
5. On the backup server, identify the best available backups of each /h directory, from zero-level to leaf nodes. Spin tapes. Un-tar the backups in the correct order.
6. Use the outlier backup script (which should have a special mode for this) to print a list of outliers that need to be restored. Do that restoration by un-tarring the appropriate files. Mail will have to be done specially, as always.
7. The machine is now supposedly ready for use.

We haven't discussed retention periods, and what goes to tape (versus sitting on Ulanda's RAID), and when this happens. But we need to deal with those issues.